While sites including Foursquare, Reddit, Quora and Hootsuite went offline, the success of photo-sharing site SmugMug shows how planning ahead can help customers survive what SmugMug CEO Don MacAskill called the "Amazonpocalypse."
SmugMug spread across three availability zones, and decided not to use Amazon's "Elastic Block Storage" service because of "unpredictable performance and sketchy durability," MacAskill wrote in his blog. The storage service played a key role in last week's failure.
If you're putting mission-critical applications in the cloud, MacAskill advises spreading them across multiple Amazon regions (East Coast and West Coast, for example) or multiple cloud providers.
Amazon's load-balancing service doesn't work across regions, so customers have to do some extra work on their own and use third-party software to make it happen, says Gartner analyst Drue Reeves. Spreading applications across multiple cloud vendors, meanwhile, is not impossible but difficult due to a lack of standards and interoperability.
Rackspace, another infrastructure-as-a-service provider, recently began offering a Cloud Load Balancers service that protects applications against the failure of a single server. But the load balancer does not spread applications across different data centers.
Josh Odom, who leads product development for Rackspace's cloud platform, notes that running an application in multiple data centers is the best way to guarantee 100% uptime, and Rackspace tries to make it easy for customers to use third-party load balancing and failover products to achieve that.
The biggest challenge isn't the application itself, but the data, Odom says. "Any kind of database replication with relational database systems is fairly cumbersome," Odom says. "We're trying to lower those barriers."
Rackspace's Texas data center suffered a few power outages in 2009, forcing the company to issue service credits to customers. The company has since brought in new data center experts and performed top-to-bottom audits of the facilities, Odom says. Despite past problems, Odom says Rackspace data centers are designed to withstand "catastrophic failures" including the loss of major power sources or network capacity.
While disaster recovery planning in infrastructure as a service requires some tech expertise, not all cloud services are geared toward the experts. Platform-as-a-service offerings -- such as Microsoft's Windows Azure or Google App Engine -- are designed to minimize involvement with underlying infrastructure and provide developers a relatively simple way to build and host Web applications.
But load balancing and the ability to fail over from one data center to another is still a big plus in platform-as-a-service clouds.