Moving data to the cloud? Read this first

Protecting cloud-based infrastructure from downtime and data loss requires new skills and new approaches

Today's systems have become so complex, most IT practitioners expect failure. Everything fails, and we invest the time, energy, and capital to build backups, warm sites, and all manner of redundancy so that we can stand things back up when they inevitably stumble.

Cloud-based infrastructure services have taken the lessons learned about data protection and stood them on their head. Suddenly, instead of building a comprehensive, on-premise data protection mechanism, we're tossing our data in the cloud, where the only thing we really have to show for it is a fancy SLA that says our cloud provider probably won't lose our data. It's not exactly awe-inspiring.

You only need to look at the fairly well-publicized Amazon Web Services failure from a few months ago to see the result. The AWS forums were packed full of livid EC2/EBS users who had experienced extended downtime or even lost data during the outage. Does this mean that the cloud (AWS or otherwise) is an unreliable piece of junk we should all avoid? Of course not.

What it does mean is that we have a lot to learn as we bridge the experience gap between on-premise boxes made of sheet metal and seemingly locationless services objects floating in the free space of the cloud.

Lesson No. 1: Forget the SLA

SLAs are great. Before you enter into a service contract for anything with anyone, you should subject the service-level agreement to intense scrutiny. The fine print will give you a lot of insight into how your provider will react in the event it fails to deliver on its reliability promises.

Next, forget you have an SLA. No matter what kind of paper it's printed on, an SLA can't get your data back for you if it's lost. No matter how good it is, the refund will never, ever make up for that -- just as your homeowner's insurance can't replace all of your family heirlooms should your house burn down. A solid SLA really just provides motivation for the service provider to avoid screwing up, not a guarantee that it won't.

Lesson No. 2: Expect failure

No matter how good a given cloud provider's internal redundancy is, you need to plan for it to fail. No redundant system, no matter how well engineered, can survive the wrong combination of failures. This applies to off-premise, cloud-based services just as it does to traditional on-premise IT infrastructures. I have seen underdesigned, on-premise backup systems fail catastrophically, and I've seen several different cloud-based systems fail to work as advertised.

The Amazon debacle stemmed from a short period of network disruption during an upgrade that caused a localized failure, which then cascaded into serious, systemic failure. To be sure, Amazon has promised it will correct the design deficiencies that allowed the initial problem to cascade, but no one can foresee every eventuality that might cause this kind of widespread failure.

1 2 Page 1
Page 1 of 2