The ugly truth about disaster recovery

High availability, disaster recovery, and business continuity often fail due to poor design. Here's how to do them right -- even in the cloud

1 2 3 4 Page 3
Page 3 of 4

The same approach can be used with the virtualization host. By adding a second host, clustering it with the first, and attaching it to the same SAN, you've allowed either host to fail with only short-lived service interruption. However, that still leaves the SAN as a single point of failure. Due to cost, most enterprises won't opt to deliver the same level of HA at the shared storage level (by implementing a second on-site SAN with synchronous data replication) -- instead opting to rely on internal SAN HA components, such as redundant controllers and fabric switches.

Cloud-based HA is often a bit tougher to wrap your head around because the infrastructures that back different cloud services vary widely. If you're using AWS, you know you already have a lot of HA built in simply because of how Amazon EC2 and EBS are constructed. If an Amazon EC2 compute node fails and you're using EBS disk resources, your EC2 instance can be restarted on a different compute node without too much fanfare. Similarly, each block of EBS disk is replicated among multiple storage nodes in Amazon's EBS network. So, Amazon has already done most of the common HA footwork for you.

Disaster recovery

Unlike HA, DR does not seek to make your services more available by avoiding the impact of a disaster, but instead allows you to recover from a disaster when it inevitably happens. The biggest differences between HA and DR from a nontechnical perspective are the amount of time it takes to recover and how stale the recovery data is -- often reflected in terms of RTO (recovery time objective) and RPO (recovery point objective). An HA event (such as a cluster node failing over) might introduce a minute's worth of service interruption, whereas recovering a backup onto a new piece of hardware could take hours or even days. DR's job is not to ensure consistent uptime, but instead to ensure that you can recover your data in the event that it's lost, inaccessible, or compromised.

As with HA, approaches to DR vary widely. The approach you'll decide to use for any given application depends on the RTO and RPO that you want to achieve and how much money you have. In the case of the example SQL server, your approach might be as simple as tossing a tape drive onto the server and doing nightly backups. As long as your tapes are stored somewhere secure (preferably off-site), you're protected from most disasters that might strike. If you need a shorter RPO, you might layer on periodic transaction log backups to be shipped off to an onsite NAS or perhaps to a cloud-based storage such as Amazon S3.

The same approach could be used for virtualization hosts. However, the presence of a SAN gives you a few more options that you could layer on top of traditional backup to achieve better RPO and RTO. For example, you could implement a second SAN, this time at a different site (or a different building if you have a campus), and configure them to replicate.

However, this time you wouldn't opt to use synchronous replication and would instead use multiple layers of asynchronous replication. This is an important distinction to make. Data corruption would immediately spread to your second SAN if you were to use synchronous replication. By choosing asynchronous replication, you get a wide array of recovery points to choose from. A DR SAN also doesn't need to be configured with the same amount of resources as the primary SAN. Instead of using lots of high-speed online disk, you could opt to use fewer, large-capacity nearline disks.

DR in the cloud is conceptually similar to protecting on-premise equipment, and this is precisely what most of the users who were ill-prepared for the Amazon failure didn't account for. Backing up your EC2 instance might be as simple as taking periodic snapshots of the underlying EBS disk instance, thus copying it onto Amazon's S3 storage service, which boasts far better data durability than EBS (though at significantly lower performance). In addition to that, you might also configure periodic backups of the EC2 instance down to local, on-premise storage to completely divorce the DR plan's dependence on Amazon.

Business continuity

BC ties together the availability goals of HA with the recoverability goals of DR. A correctly conceived BC design will allow you to recover from a full-blown disaster with very limited downtime and zero data loss. It is by far the most involved and expensive approach to take, but many enterprises have concluded that their dependence upon their data has grown to such an extent that BC is too important not to pursue. BC almost always involves some form of site redundancy -- allowing business to continue in the event that the primary data center is rendered unavailable for whatever reason.

A word of caution: Don't forget the network. Once you start talking about having BC resources located at remote sites or in the cloud, you need to have easy ways to fail over to them. Whether that means using dynamic routing to allow whole swaths of your data center to suddenly "appear" at a remote site (without addressing changes) or reconfiguring clients to access the services at the remote site, the networking component of BC is a challenge that should not be overlooked.

BC for that example SQL database might look pretty similar to a combination of the HA and DR approaches I've suggested here, but with the introduction of compute and replicated storage resources at a remote site. You could extend what you've already done by implementing a third SQL server at a remote site (or in the cloud) that would also receive high-frequency data replications. That would allow fast recovery of the database in the event of a complete site failure.

1 2 3 4 Page 3
Page 3 of 4