The failure behind the Amazon outage isn't just Amazon's

It's up to cloud users to figure out how to remove risk from their cloud implementations -- like they used to do within IT

When Amazon.com's outage last week -- specifically, the failure of its EBS (elastic block storage) subsystem -- left popular websites and services such as Reddit, Foursquare, and Hootsuite crippled or outright disabled, the blogosphere blew up with noise around the risks of using the cloud. Although a few defenders spoke up, most of these instant experts panned the cloud and Amazon.com. The story was huge, covered by the New York Times and the national business press; Amazon.com is now "enjoying" the same limelight that fell on Microsoft in the 1990s. It will be watched carefully for any weakness and rapidly kicked when issues occur.

It's the same situation we've seen since we began to use computers: They are not perfect, and from time to time, hardware and software fails in such a way that outages occur. Most cloud providers, including Amazon.com, have spent a lot of time and money to create advanced multitenant architectures and advanced infrastructures to reduce the number and severity of outages. But to think that all potential problems are eliminated is just being naive.

[ Get the no-nonsense explanations and advice you need to take real advantage of cloud computing in InfoWorld editors' 21-page Cloud Computing Deep Dive PDF special report. | Stay up on the cloud with InfoWorld's Cloud Computing Report newsletter. ]

Some of the blame around the outage has to go to those who made Amazon.com a single point of failure for their organizations. You have to plan and create architectures that can work around the loss of major components to protect your own services, as well as make sure you live up to your own SLA requirements.

Although this incident does indeed show weakness in the Amazon.com cloud, it also highlights liabilities in those who've become overly dependent on Amazon.com. The affected companies need to create solutions that can fail over to a secondary cloud or locally hosted system -- or they will again risk a single outage taking down their core moneymaking machines. I suspect the losses around this outage will easily track into the millions of dollars. 

Never trust a single system component, be it a cloud, a network, a router, a database, or whatever. Figure out what to do when a component goes offline or fails in other ways. The typical solution is to fail to secondary components that can operate until the primary is back online. That used to be a given in IT. Unfortunately, many organizations have put too much trust into clouds, pushing their systems out to providers with the incorrect thought that a third party will provide the resiliency and the redundancy they require.

As we've seen so dramatically, clouds have limitations, too. Don't get mad at that fact -- just deal with it. 

This article, "The failure behind the Amazon outage isn't just Amazon's," originally appeared at InfoWorld.com. Read more of David Linthicum's Cloud Computing blog and track the latest developments in cloud computing at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

Copyright © 2011 IDG Communications, Inc.

InfoWorld Technology of the Year Awards 2023. Now open for entries!