Today's systems have become so complex, most IT practitioners expect failure. Everything fails, and we invest the time, energy, and capital to build backups, warm sites, and all manner of redundancy so that we can stand things back up when they inevitably stumble.
Cloud-based infrastructure services have taken the lessons learned about data protection and stood them on their head. Suddenly, instead of building a comprehensive, on-premise data protection mechanism, we're tossing our data in the cloud, where the only thing we really have to show for it is a fancy SLA that says our cloud provider probably won't lose our data. It's not exactly awe-inspiring.
You only need to look at the fairly well-publicized Amazon Web Services failure from a few months ago to see the result. The AWS forums were packed full of livid EC2/EBS users who had experienced extended downtime or even lost data during the outage. Does this mean that the cloud (AWS or otherwise) is an unreliable piece of junk we should all avoid? Of course not.
What it does mean is that we have a lot to learn as we bridge the experience gap between on-premise boxes made of sheet metal and seemingly locationless services objects floating in the free space of the cloud.
Lesson No. 1: Forget the SLA
Moreover, not every cloud service is designed to be failure-proof. Take Amazon's EBS (Elastic Block Storage), for example -- it is stated to have an annual failure rate of 0.1 of 0.5 percent. That means if you field 1,000 EBS volumes, you can fully expect up to 5 of them not to survive a year without being destroyed. Those aren't bad odds as far as disk resources go, but it's clearly an eventuality worth planning for.
Lesson No. 3: Understand the infrastructure -- even if it isn't yours
Here's the rub with the cloud: Just because you aren't tasked directly with operating the cloud infrastructure that runs your services doesn't mean you don't need to develop the skills to understand how it works. In fact, quite the opposite is true. One of the main reasons why so many Amazon users were so badly affected by the EBS outage was because they didn't fully understand what made Amazon's services tick and how to use them appropriately -- though whether that's a result of a failure of comprehension or documentation is open to debate.
In Amazon's case, that requires a thorough understanding of the data durability differences between Amazon EBS and Amazon S3 and what benefit locating redundant services in different availability zones might grant you. Furthermore, designing, scripting, and regularly testing a solid game plan for what you'll do when failure does strike is critically important.
Putting it all together