This past week saw another high-profile cloud infrastructure failure as Microsoft's Azure Compute service experienced an extended service outage. As in previous cases, such as last April's Amazon EC2 outage, this event caused another round of hand-wringing about the future of the cloud and the wisdom of using it.
Each time something like this happens, however, I'm always nonplussed that anyone is actually surprised.
Yes, megascale cloud implementations such as those fielded by Amazon, Microsoft, and Rackspace are designed with many layers of overlapping redundancy to prevent this sort of problem, but why is anyone shocked that some combination of unforeseen circumstances could lead to widespread failure? Far be it from me to defend Microsoft here (I mean, leap year -- really?), but you cannot design a perfect, infallible system. It simply doesn't exist.
Much of life in IT revolves around planning for the things that we build to fail -- and how we'll cope when they do. Working with the cloud is no different. Just because someone else is investing millions of dollars in creating an ultraredundant and hyperscalable infrastructure doesn't free you from having to construct your own game plan for keeping your services available when -- not if -- there's a failure.
What's missing, though, is a reasonable and objective measure of cloud computing provider availability. All we can do now is search through the news archives to see when major failures have occurred. We don't really have a comprehensive, third-party benchmarking and grading scheme to help prospective customers evaluate which cloud service platforms have been the most reliable. The meaningless SLAs offered by cloud providers aren't going to help you either.