First, a recap of what happened last year: A few weeks after AWS's outage, the company released a post-mortem report detailing what caused the disruption and steps the company took immediately following. Basically, human error started the chain reaction of events. In the wee hours of the morning of April 21, 2011, while attempting to upgrade the East Coast region of the company's Elastic Block Storage (EBS) service -- a storage feature that links in with the company's Elastic Cloud Compute (EC2) offering -- part of the EBS network was switched to a lower capacity infrastructure that wasn't prepared to handle the traffic of the EBS system. The EBS nodes attempted to rectify the problem themselves, causing a network traffic jam that soon spilled over into another AWS feature, the Relational Database Service (RDS), another log storage offering. In all, about 13 percent of the EBS nodes in the affected area were impacted by the outage, and after the four-day event, 0.07 percent of the impacted data was permanently lost.
Experts say AWS has made improvements to its system since then, but it's unclear just how substantial those are. For example, in the post-mortem report the company says it audited its change process and increased the use of automation tools when making updates to avoid human error. Drue Reeves, a Gartner analyst who tracks the cloud industry and AWS, says the company has boosted its primary and secondary EBS networks to handle high network capacities. "It's made EBS more resilient," he says. "They've taken some steps to rectify the situation to make sure this instance doesn't happen again, but that doesn't mean we won't have other outages."
The company says it has taken steps to ensure that problems in one area don't spill over to bring down other services, and Amazon says it has made it easier for customers to build fault-tolerant systems using AWS products. The company's secretive nature related to the architecture of its cloud operations make it difficult to assess security vulnerabilities though, Reeves says.
AWS promised to be more forthright about outages in the future. During last April's downtime, hours after the event, the company said only that a "networking event" had occurred, frustrating many customers who wanted information on what had happened and estimates on when services would be back up. A spokesperson for AWS wrote in an email that the post-mortem report details a number of changes the company has made: "This included software fixes as well as new features including EC2 Instance Status Monitoring and EBS Volume Status, which provide customers the information they need in order to understand the full health of their resources running in AWS."
The company has also stressed what customers can do themselves to protect against outages. At the crux of this is the idea of availability zones (AZ). AWS has eight global regions where customers can store data, including the U.S. East Coast region, where this outage was centered. Then there are availability zones within each region, which are physically separated, independent infrastructures meant to allow for high availability of data. In what amounted to a polite 'I told you so,' AWS makes clear in the recap of the outage that customers who backed up their data in multiple AZs were less affected by the outage. For customers using the Amazon Relational Database Service, 45 percent who used a single AZ were impacted by the outage, but only 2.5 percent of customers using multi-AZs were, according to AWS. A series of whitepapers and webinars have been sponsored by the company since the outage advising customers of how to architect multi-AZ systems.