When server consolidation goes too far

Virtualization has revolutionized the data center, but less isn't always more when it comes to the number of server virtualization hosts

In thinking about virtualized environments, we're at a point in time where we can provide abundant resources to a single physical server, yet we're still beholden to the age-old elements of hardware failure. This leads to overly optimistic planning and, ultimately, downtime.

It's not unreasonable to spec out a physical server with 128GB, 256GB, or even 1TB of RAM, 16 to 48 CPU cores, and a slew of 10G interfaces. Such a server could easily handle dozens, possibly hundreds, of VMs depending on the workloads. On the face of it, we could run the equivalent of three racks of 1U physical servers from 2004 on a single 1U server today. It truly is an amazing evolution in general computing. It's also dangerous, because when that server tanks for whatever reason, the problems generated by that failure are vast, far surpassing the failure of a 1U server from nine years ago. For some reason, this risk isn't factored into many virtualization builds.

[ The true grit of IT troubleshooting | Doing server virtualization right is not so simple. InfoWorld's expert contributors show you how to get it right in this 24-page "Server Virtualization Deep Dive" PDF guide. | Get the latest practical info and news with InfoWorld's Data Center newsletter. ]

The fact is many small-to-medium-size businesses can run their entire server operations on a single modern server. If we're talking about 40 or 50 general-purpose VMs, it's completely doable. Most builds add a second server for load balancing and failover, so you have the entire business running on four CPUs, however much RAM, and four power supplies. We're back to the mainframe, but without the RAS (reliability, availability, serviceability) features. Internal system failures, power issues, upgrades, and the like can easily take one of those servers out of commission, and we're down to a single box again and the potential of dealing with powering up dozens of VMs that were lost when the other server failed.

It's a very tenuous situation at best and catastrophic at worst, yet I see many builds that try to pack as much as possible into a few physical servers and call it a day. A much better solution is to reduce the resources per server and add more physical systems to the mix.

Granted, licensing considerations come into play here. Because many virtualization frameworks license on CPU and RAM counts, deploying eight smaller-spec servers can cost considerably more than deploying four high-spec servers. The fact remains that by cutting too close to the bone with the physical platform, we deeply undermine our ability to handle outages and physical server problems. We wouldn't deploy mainline storage as a RAID1, but I've seen too many dual-server solutions that are essentially the same thing on the server side by reducing the server count.

We often hear about how reliable and resilient modern server hardware has become, how redundancy is built in from the power supplies to the hypervisor, and how we can reduce licensing, power, and cooling costs by running fewer, larger boxes. These are accurate points, but they're useless when a hardware or software event takes down a box. It's not "if," it's "when," no matter how resilient you believe your hardware to be.

A case in point might be a file system hiccup on a particular LUN that locks up the I/O subsystem on a server. Virtual servers on other physical servers might be unaffected, but at the very least the affected server will have to be restarted, and restoring lost or corrupted VMs from backups will likely be required. If there are only a few other boxes to take up the slack, this process becomes even more stressful, because suddenly the entire deployment is in jeopardy. If there are four or five other servers in the mix, then the pressure is reduced.

Don't think this example is far-fetched -- I had to deal with just such a problem a few weeks ago. Luckily there were eight servers in that cluster, and fixing the problem involved actually fixing the problem and bringing the three affected servers back up, not trying to triage the loss of dozens of VMs with a greatly reduced resource footprint before getting to the root cause.

If you find yourself considering a few huge boxes rather than several smaller boxes, remember that sometimes more is more. Although those few servers can easily support the virtualization load, they are likely to greatly impede future fixes and will make upgrades problematic due to the small number of servers to take up the slack. I'll take eight small boxes over three large boxes any day, and I will definitely sleep better at night because of it.

This story, "When server consolidation goes too far," was originally published at InfoWorld.com. Read more of Paul Venezia's The Deep End blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.