In my post last week, I dug into some of the perils of moving data to the cloud without having your own plan for bringing your services back online in the event that the provider fails. Judging by some of the email responses, a lot of folks out there are shocked that Amazon.com could consider the loss of its customers' data to be a relatively normal event.
Some of that surprise may stem from a lack of understanding about how Amazon.com has designed the various services that live under the Amazon Web Services (AWS) umbrella. But I think much of it may be due to a more persistent industry-wide problem: widespread confusion around what high availability (HA), disaster recovery (DR), and business continuity (BC) really mean. These terms are thrown around all over the place, but are frequently misused or misunderstood.
Why does this matter? Because these terms have three very important things in common: mission-critical business applications, large amounts of money, and setting expectations with high-placed and often very nontechnical business stakeholders. And let me tell you from firsthand experience, if improperly managed, these ingredients can be a recipe for disaster all by themselves.
Being very clear with yourself, management, and business stakeholders about how you're spending money on HA, DR, BC, and what that money is going to buy you is one of the keys to a happy life in IT.
In the case of the SQL server, there a several common failure vectors to protect against. Most servers today include provisions for redundant power supplies, error-correcting memory, and RAID arrays -- all of which could be considered types of high availability in that they allow a component to fail without causing interruption to service. However, most servers won't protect you against a main board failure or OS instability. Thus, you might implement a second server and configure transactional replication between the two or take it a step further and implement shared storage (SAN) and full clustering. That gives you the ability to weather the failure of either host and significantly decreases your exposure.
The same approach can be used with the virtualization host. By adding a second host, clustering it with the first, and attaching it to the same SAN, you've allowed either host to fail with only short-lived service interruption. However, that still leaves the SAN as a single point of failure. Due to cost, most enterprises won't opt to deliver the same level of HA at the shared storage level (by implementing a second on-site SAN with synchronous data replication) -- instead opting to rely on internal SAN HA components, such as redundant controllers and fabric switches.
Cloud-based HA is often a bit tougher to wrap your head around because the infrastructures that back different cloud services vary widely. If you're using AWS, you know you already have a lot of HA built in simply because of how Amazon EC2 and EBS are constructed. If an Amazon EC2 compute node fails and you're using EBS disk resources, your EC2 instance can be restarted on a different compute node without too much fanfare. Similarly, each block of EBS disk is replicated among multiple storage nodes in Amazon's EBS network. So, Amazon has already done most of the common HA footwork for you.
Unlike HA, DR does not seek to make your services more available by avoiding the impact of a disaster, but instead allows you to recover from a disaster when it inevitably happens. The biggest differences between HA and DR from a nontechnical perspective are the amount of time it takes to recover and how stale the recovery data is -- often reflected in terms of RTO (recovery time objective) and RPO (recovery point objective). An HA event (such as a cluster node failing over) might introduce a minute's worth of service interruption, whereas recovering a backup onto a new piece of hardware could take hours or even days. DR's job is not to ensure consistent uptime, but instead to ensure that you can recover your data in the event that it's lost, inaccessible, or compromised.
As with HA, approaches to DR vary widely. The approach you'll decide to use for any given application depends on the RTO and RPO that you want to achieve and how much money you have. In the case of the example SQL server, your approach might be as simple as tossing a tape drive onto the server and doing nightly backups. As long as your tapes are stored somewhere secure (preferably off-site), you're protected from most disasters that might strike. If you need a shorter RPO, you might layer on periodic transaction log backups to be shipped off to an onsite NAS or perhaps to a cloud-based storage such as Amazon S3.
BC for the SAN-based virtualization cluster is again similar to that for the stand-alone SQL server in that you'll want redundant compute (server) capacity located at a remote site. As in the DR example, you'll probably also want a second SAN running asynchronous replication. However, this time you're going to want to locate the secondary instance at a different site and configure it with enough transactional performance to keep up with a full production workload -- probably a mirror of the configuration at the primary site rather than a stripped-down, low-performance configuration.
A solid cloud-based business continuity design really requires a thorough understanding of how your cloud provider works. Using Amazon AWS as an example, you'd want a second EC2 instance, but this one should be located in a different AWS availability zone from the first. So, if the first instance is in U.S.-East, you'd want the second to be at least in U.S.-West (if not one of the more expensive international zones). Then you'd do some scripting on the primary server to have it periodically ship incremental live-state backups to the secondary. In fact, you could even include turning the secondary instance on and off before and after the replication to save you some cash. In the event that the primary EC2 instance failed, Amazon's Elastic IP assignment could be used to shift traffic to the backup without any users being the wiser.
Some might even question whether that approach takes things far enough -- especially given that there has been at least one instance where a failure in one AWS availability zone hurt services at others. If you find that you're not comfortable working within a single provider, you could always replicate your data to a completely different cloud provider or to on-premise hardware. However, that would involve designing an addressing redundancy system to replace Amazon's Elastic IP (whether that's simply modifying DNS or something more complicated).
Putting it all together
Whatever approach you end up using to satisfy your HA, DR, and BC requirements, make sure that both you and your stakeholders are using the correct terminology and understand what is actually being bought by the investments being made. Business stakeholders, no matter how nontechnical they are, should understand how quickly you'll be able to recover from the entire range of failures that might occur and what it will cost for them to improve those numbers.