The ugly truth about disaster recovery

High availability, disaster recovery, and business continuity often fail due to poor design. Here's how to do them right -- even in the cloud

In my post last week, I dug into some of the perils of moving data to the cloud without having your own plan for bringing your services back online in the event that the provider fails. Judging by some of the email responses, a lot of folks out there are shocked that could consider the loss of its customers' data to be a relatively normal event.

Some of that surprise may stem from a lack of understanding about how has designed the various services that live under the Amazon Web Services (AWS) umbrella. But I think much of it may be due to a more persistent industry-wide problem: widespread confusion around what high availability (HA), disaster recovery (DR), and business continuity (BC) really mean. These terms are thrown around all over the place, but are frequently misused or misunderstood.

Why does this matter? Because these terms have three very important things in common: mission-critical business applications, large amounts of money, and setting expectations with high-placed and often very nontechnical business stakeholders. And let me tell you from firsthand experience, if improperly managed, these ingredients can be a recipe for disaster all by themselves.

Being very clear with yourself, management, and business stakeholders about how you're spending money on HA, DR, BC, and what that money is going to buy you is one of the keys to a happy life in IT.

Setting the stage

To compare and contrast HA, DR, and BC, it helps to have some examples to build on. To that end, let's imagine three different mission-critical services that we want to ensure are as available as possible. The first is an on-premise SQL server running the database for a line-of-business application. The second is a an on-premise SAN-attached virtualization host that contains a mess of different virtual machines ranging from application servers to infrastructural servers like domain controllers and file servers. The third is a Linux-based Web server hosted on AWS.

These three examples are all dramatically different from an infrastructural standpoint, and so are the approaches used to ensure that they're up and ready for business. Although their HA, DR, and BC solutions are very similar in concept, they share very little in execution.

High availability

Simply put, HA is a means to reduce your exposure to disaster by increasing the number of failures that must occur to cause a disaster. HA does not and cannot ensure that disaster won't strike. You can provide all of the redundancy you want, and I can still guarantee there will be failure vectors that can skip right on past every one of them and ruin your day (data corruption, bad software, power spikes, storms, fire, and so on). That is perhaps the most important lesson to learn about HA: It's giving only you a lower probability of experiencing disaster, not making you immune to them.

Applying HA to my three examples is fairly straightforward. There are certainly many different ways to go about it -- all providing varying levels of high availability at equally variable cost, but these are some common approaches.

1 2 3 4 Page 1
Page 1 of 4