This past week saw another high-profile cloud infrastructure failure as Microsoft's Azure Compute service experienced an extended service outage. As in previous cases, such as last April's Amazon EC2 outage, this event caused another round of hand-wringing about the future of the cloud and the wisdom of using it.
Each time something like this happens, however, I'm always nonplussed that anyone is actually surprised.
Yes, megascale cloud implementations such as those fielded by Amazon, Microsoft, and Rackspace are designed with many layers of overlapping redundancy to prevent this sort of problem, but why is anyone shocked that some combination of unforeseen circumstances could lead to widespread failure? Far be it from me to defend Microsoft here (I mean, leap year -- really?), but you cannot design a perfect, infallible system. It simply doesn't exist.
Much of life in IT revolves around planning for the things that we build to fail -- and how we'll cope when they do. Working with the cloud is no different. Just because someone else is investing millions of dollars in creating an ultraredundant and hyperscalable infrastructure doesn't free you from having to construct your own game plan for keeping your services available when -- not if -- there's a failure.