EMC VMware's ESX 3.0 was released a bit more than three years ago. While ESX 2.5 was a solid virtualization platform, ESX 3.0 seemed to push server virtualization into the realm where a lot of small and large businesses alike could really sink their teeth into it. The new high-availability features in ESX 3.0 were a huge draw to many businesses seeking better uptime, and the refined centralized management offered by VirtualCenter 2.0 was compelling. Support for a wider set of hardware such as iSCSI SANs also allowed high-end functionality at a lower price.
Now that we're three years down the road, many of these initial adopters of ESX 3.0 are starting to replace their hosts with new ones and preparing to upgrade to vSphere 4.0. That seems to be leaving a lot of server admins staring at a stack of three-year-old virtualization hosts that aren't yet finished doing their jobs. Sure, they might not be quite fast enough to go the distance with increased production loads, and you might like to have some more performance headroom, but it's always a painful decision to turn off a bunch of expensive servers and not do anything with them.
Instead of tossing their old hosts in a Dumpster, many enterprises are opting to reuse them. Some turn them into development clusters to separate dev loads from production loads. Some make them available for testing and training. My favorite use is as the seed hardware for a warm site. Even if the old hardware can't run all your production resources at 100 percent resource availability, having some immediately available production capability in a production site failure scenario is better than none -- and it bridges the gap between the time of the disaster and the time that you can get replacement hardware on site.
Assuming that business continuity is important to your organization and you have multiple offices or a sufficiently large campus, building a warm site is a great use of your hardware. It certainly isn't free and there are a number of common pitfalls that you'll want to steer clear from, but it's definitely a worthy endeavor if downtime costs you money.
Step 1: Define the service level
First, you need to define the level of service you want to grant with your warm site. Do you want to protect all of your machines or just a subset? How quickly do you want to be able to recover (RTO)? How old can your data be when you do recover (RPO)? Your answers to these questions may change as you work through the design process and start attaching price tags to varying levels of service, but you should never let what you can afford directly drive what you provide.
It may be that, to be useful, a warm site would cost more than you can currently afford to spend on it. In that case it's better to save your pennies and do it correctly than to implement something that won't accomplish your organization's goals.
Step 2: Assess your SAN situation for replication options
The SAN is the first piece of hardware that needs to be looked at, as it tends to be the most expensive. If possible, using asynchronous SAN-to-SAN replication is the best way to implement a warm site. Depending on the SAN platform in use, such replication might simply be impossible or uneconomical.
For example, if you run a FibreChannel SAN with no iSCSI connectivity and don't have the tremendous luck to have dark fiber running to your warm site, implementing SAN replication might be out of the question without hardware such as an FCIP gateway or software such as EMC's RepliStor. If you're in this boat, be sure to consider these factors the next time you are weighing an upgrade to or replacement of your current SAN.
On the other hand, users of devices such as NetApp filers should add more SnapMirror licensing, and users of Dell EqualLogic PeerStorage arrays (also sold under the Dell brand) have everything they need already. No matter what your SAN, to perform SAN-to-SAN replication, you're going to need a second one.
If performing SAN-to-SAN replication is out of the question, you still have options. There are several good host-based replication software packages available that will run on the ESX hosts and do direct host-to-host replication. These include Vizioncore vReplicator and NSI DoubleTake for VI. They are usually licensed per VM rather than per host, which can make them unattractive depending upon the number of guests you want to replicate. The big caveat here is that you will need a large amount of directly attached storage on the old hosts that are being moved across to the warm site. (If they had been attached to your production SAN, they may no longer have any disks in them.)
No matter how you decide to do it, your storage configuration -- whether it involves SAN or host-based replication -- is the most important part of the warm site design and should not be treated lightly.
Step 3. Figure out your bandwidth needs
Once you've determined what the storage is going to be at your warm site, you need to consider how you're going to get your data there. If your warm site is on your campus or otherwise fiber-attached, there's not much to worry about unless your data sets are truly massive.
Although the network connectivity to your warm site is probably the most straightforward of all of the decisions that you need to make, it can easily blow your budget, since WAN bandwidth generally has a recurring monthly cost. Failing to properly estimate the required WAN bandwidth can have disastrous long-term budgetary consequences.
For example, let's say your initial calculations show that you're going to need two T1s' worth of bandwidth (3.0Mbps) to replicate an estimated 25GB of storage deltas per 24-hour period to maintain whatever RPO you've set. But it turns out you actually need to move 35GB per day to meet that RPO -- a difference of roughly one more T1 circuit. Depending on your bandwidth costs, that small difference could cost as much as an entirely new SAN or a few new virtualization hosts over three years' time.
So if estimating your replication bandwidth needs is so important, there must be a tried-and-true way of doing it, right? Not really. There are some tricks to determine how much data is turning over on your VMs, but you can't always trust what they tell you.
The first and easiest method is to use VMware's built-in snapshot functionality. Take a snapshot of every VM you want to replicate, wait a period of time equal to what you'd like your replication period to be based on your RPO, then examine the snapshot files on your VMFS volumes to see how big they are. (Note: Be sure you have enough free space on your VMFS volumes before you do this.) That figure is roughly how much data has changed on those VMs in that period. If you do this at several times during different parts of your production day and month, you should get a reasonably good idea of how quickly your data is changing.
However, that's not all there's to it. Depending on your SAN platform, your SAN may replicate data in larger blocks than VMware's snapshot files allocate. Thus, a single change of a 1KB file within a VM may be seen as a change to a 16MB block on your SAN -- essentially magnifying the amount of data that needs to move by 16,000 times. This magnitude difference would be a fairly rare occurrence, but it shows that you can't easily predict actual data volumes based on snapshots.
To combat this problem and generally increase the amount of data your WAN can carry, using some form of WAN accelerator that includes deduplication technology is a wise move. Examples of such products include Cisco's WAAS and Riverbed's Steelhead. Both platforms have their own strengths and weaknesses, but they operate in much the same way. They optimize the WAN data flow through intelligent re-windowing and other TCP enhancements, but they also retain a remote cache of what has previously been sent over the WAN link.
In the event that they get a cache hit (a packet that has the same data payload as one seen previously), that packet is not re-sent. Instead, just a pointer to that packets payload is sent to the device on the other end of the circuit. In the example of a 1KB change requiring 16MB of data transmission, a WAN accelerator could essentially nullify the problem.
Step 4: Consider your hypervisor licensing needs
The last thing to think about is what additional hypervisor licensing you may need to acquire for your warm site. You could configure new ESX hosts and run them unlicensed, only moving the production licenses to them in the event of a production site failure, but this will make it impossible to test anything while the production site is active without breaking your license agreement.
Another option is to purchase licensing for VMware vSphere Essentials, which will operate with a much more limited feature set than on your production site, but still be able to start and run your VMs.
Another issue to consider is whether you want to implement VMware's Site Recovery Manager (SRM). SRM requires that you do SAN-to-SAN replication with SANs support it (most do), and it is somewhat expensive. However, if being able to test your recovery plan frequently and having a completely automated failover process is important to you, implementing SRM is certainly worth a close look. It's also worth noting that vSphere 4.0 support for SRM likely won't be available until later this year.
If you do it right, reusing retired hardware is a great idea
Taking advantage of retired hardware to build a warm-standby datacenter is a fantastic use of resources and builds in backup computational capacity you'll be happy to have if you ever need it. However, blindly building a warm site without a plan -- regardless of how much extra hardware you have kicking around -- isn't likely to work out well in the long run.
Failing to do any of several things -- set goals properly, consider storage resources, keep WAN bandwidth in mind, or take into account software licensing limitations -- will almost certainly make the exercise more expensive and less effective than it could be.
Notwithstanding all of these challenges, today's virtualization technology, coupled with modern storage and networking technology, makes it far easier to build always-on standby failover capacity than it ever has been in the past. If your organization places a high value on uptime, now is the time to put your toe in the water and give it a try.