"Our architecture avoids using EBS as our main data storage service, and the SimpleDB, S3, and Cassandra services that we do depend upon were not affected by the outage," Netflix engineers wrote in their "Lessons Netflix Learned From the AWS Outage" blog post. Stateless services and multiple redundant hot copies of data across availability zones were key to avoiding AWS cloud fail pain.
Think you have to be a Netflix-size business to stay safe? Think again. Twilio, a company that helps developers integrate communications into their Web apps, uses Amazon's EC2 to host the core of its infrastructure -- yet April's outage had little to no impact on its stability.
"The fundamental premise of building on the cloud is assuming that the network will have glitches," says Evan Cooke, Twilio's co-founder and chief technology officer. "We built an infrastructure around the idea that a host can and will fail, so we don't rely on any single machine or single component in the core architecture itself."
Colossal cloud outage No. 2: The Sidekick shutdown
Smartphones make it easy to access your data on the go, but just because something has "smart" in its name doesn't mean it can't be dumb. Case in point: the T-Mobile Sidekick screwup, circa fall 2009.
Remember this fiasco? The Microsoft-owned Sidekick suffered a nearly week-long service outage that left users without access to email, calendar info, and other personal data. Then, adding insult to injury, Microsoft confessed it had completely lost the cloud-stored bits and wouldn't be able to restore them. Evidently, the good ol' gang from Redmond had forgotten to make backups.
The technology may have evolved since then, but the lesson remains the same: When it comes to crucial data, never assume someone else is automatically protecting you. Make sure you understand your cloud provider's disaster recovery setup -- better yet, make your own arrangements to back up your important data independently.
"The same operational rules apply even in the cloud," says Ken Godskind, vice president of monitoring products for AlertSite, a SmartBear company. "Organizations using the cloud can't just assume that because it's in the cloud, all the responsibility for business continuity planning has somehow been transferred to the provider."
Colossal cloud outage No. 3: Gmail fail
Of all cloud services, Google's Gmail presents one of the more likely threats to Microsoft's on-premises stranglehold on the enterprise. Replace your high-maintenance Exchange servers with a cheap, dependable email service backed by Postini. What's not to like?
A rash of irksome outages, the most recent of which had 150,000 Gmail users signing into their accounts only to find blank slates -- no emails, no folders, nothing that indicated they were actually looking at their own inboxes. To Google's credit, it provided regular updates and promised a quick fix. But repairs took as long as four days for some of the affected users.