Some inconvenient truths about virtualization

Beyond the hype, there are a few gotchas to keep in mind when venturing into the virtualization world.

In case you haven't heard, virtualization is taking the IT world by storm. Promises of reduced downtime, lower power and cooling costs, better use of hardware, and ease of administration have IT shops moving to virtualization like they're in a race. For the most part, those promises are true, but there are certainly some pitfalls associated with virtualization, and those don't get nearly as much attention as the benefits.

One major downside to enterprise virtualization is the CPUs in the host servers. A common scenario is that a push for virtualization will start with a small test VMware, Xen, or VirtualIron farm. A few servers running a dozen VMs, perhaps. Once that farm has proven to be stable and functional, it will necessarily expand, and more production servers will be brought into the virtual infrastructure, requiring more physical host servers. Calls are made and orders are placed, only to discover that the CPUs used in the original farm are no longer offered by the server vendor or the CPU vendor, or both. This becomes a major issue, since a farm must be built on identical CPU structures to realize the full benefits of virtualization, including VM migrations and high availability.

The problem is that transitioning running VMs from one host to another can only realistically occur if both host servers have the same CPU stepping and options. Otherwise, VMs can become unstable or crash altogether. For instance, if you have a VM running on a host that has SSE3-equipped CPUs, moving that VM to a host that only has SSE2 will produce myriad problems -- if the VM moves at all. The really frustrating part is that the host hardware might only be six months old, yet the CPUs are no longer available.

This situation produces some interesting thinking. In some cases, you may be able to push the vendor to look in the back of their closet for a few CPU kits, but that's a stop-gap measure and cannot be relied upon. Another oddly popular option is to hit eBay for new old-stock CPU kits that match the current hardware. The only other option is to buy a whole new set of servers with new CPUs or new CPU kits for the old servers, migrate the whole farm, and hope that when you need to add capacity, the CPUs for the new boxes are still around. Of course, you then put the nearly-new-but-now-useless original CPU kits on eBay and sell them to someone else who's in the same boat you were.

Even in blade systems this is a problem, as evidenced by a recent trip I took down CPU-matching lane with a Sun 6000 chassis and X6250 blades. When purchased late last year, I spec'd the top-end 2.66GHz quad-core Intel X5355 CPUs with the hope that since they were the top end at the time, they would still be available when more blades were required. When new blades were needed six months later, Sun informed me that those CPU kits were no longer produced. After many gyrations with Sun and a reseller, matching CPU kits were magically located, but they came with a premium, including $250/hr for a Sun tech to manually downgrade the BIOS on the blades. Talk about heading in the wrong direction. The other option was to bump all the other servers up to the new CPUs at a very significant cost. That can really put a damper on ROI.

Another problem with virtualization is the all-your-eggs-in-one-basket issue. Running a farm of host servers with DRS and HA options enabled can produce lots of peace-of-mind, since VMs will migrate around to even out load, and if a host goes down the VMs that had been running on that host will boot from other servers. However, the controller making those decisions must be available. In the case of VMware, this is VirtualCenter server, which is a Windows service. Several times in the past month, I've found myself manually restarting the VC service on a Windows 2003 VC server due to lockups. Everything comes back together after one of these situations, but when you're trying to put a host into maintenance mode to upgrade RAM and there are a few VMs "stuck" during a migration between two ESX hosts for 20 minutes, it can be nerve-wracking. It's at that point that you fully realize that if there's a big problem with VC (or its equivalent on other virtualization platforms) you're not just looking at rebooting a single server, you might be forced to reboot a dozen or more. Thankfully, with all the virtualization work that I've done, I've never had to punt to that particular solution, but more than a few times with more than a few virtualization platforms I've come close.

In that particular case, I had to manually remove the offending VMs from the host using the VMware CLI tools on the hosts themselves, along the lines of this synopsis. It worked, but it wasn't without a few tense moments.

As virtualization infrastructures mature, the management stability problems will hopefully decrease, and perhaps at some point in the future migrations between hosts with differing CPUs will be possible (but probably at a significant performance cost, if this is even possible at all).

Until then, don't let this information deter you from moving toward virtualization -- just keep your eyes open while you walk down the path.