Lessons learned from the recent AWS S3 outage

Cross-region replication and backups can help applications survive regional cloud service outages.

aws amazon web services
Amazon Web Services

Amazon S3 underpins many AWS services, including AWS Lambda, Elastic BeanStalk, and Amazon’s own Service Health Dashboard. It also serves as an object and media store for many other internet services that rely on it every day.

On February 28th, 2017 AWS experienced an hours long outage of the Amazon S3 Service in US-EAST–1 region. That created a cascading effect of outages across a good chunk of the internet, including services like Dockerhub.

A human error turned out to be the root cause:

At 9:37 AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that are used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly, and a larger set of servers was removed than intended.

As it turns out, there is a common misconception about the difference between durability and availability. Durability measures how reliable the storage is and answers the question “Am I going to lose my data?” Availability, on the other hand, measures how accessible the data is, i.e. “Am I going to be able to retrieve my data?”

AWS S3 offers 99.999999999% durability within a single region. If we examine Amazon’s example, that means if you store 10,000 objects in S3, on average a single object may get lost once every 10 million years. Amazon S3 accomplishes this is by replicating the data across multiple facilities within a region.

Standard S3 availability of objects, on the other hand, is at 99.99% per year within a region. What that means is that in any given 12 month period you should expect a total of 52 minutes and 33 seconds of not be able to access your data.

AWS offers both IaaS and PaaS services. At the IaaS level, the AWS customers have full control over the virtual servers and networks. They can configure any software and service they desire, and they manage it on their own. Any outage is the responsibility of the customer.

At PaaS level, AWS offers fully managed platform services such as object storage, databases, queues and so on. The client delegates the responsibility for availability and durability of these services to the managed service provider -- AWS in this case. AWS platform services that are utilized via their proprietary API are particularly vulnerable to a regional outage due to a human error at AWS.

Human error can cause an outage anywhere -- on-premise, in the cloud, managed, or self-hosted. Consider the recent Delta computer outage as an example of an entire self-hosted system going down. Delegating the responsibility for managing a platform service to a cloud provider doesn’t change the fact that human error can bring it down -- but it does amplify the impact. Whereas the Delta outage only impacted Delta, an AWS S3 outage impacted a good chunk of the internet.

Fortunately, AWS S3 offers ample tools for reducing the impact of an outage. Let’s consider just a few.

S3 cross-region replication

Data stored in a particular S3 region is replicated across all availability zones and can sustain an outage in any zone. It can’t, however, survive an outage in an entire region, such as the one that happened on February 28th. Replicating S3 objects across geographic regions helps satisfy the increased redundancy requirements.

Backups

Cross-region replication can help increase availability. Backups to AWS Glacier can contribute to increased durability. Conveniently, AWS offers an automatic mechanism to backup objects in S3 to Glacier.

Consider content distribution with CloudFront

If your S3 objects are frequently accessed, it may make sense to configure AWS CloudFront to serve objects from S3. CloudFront will replicate the data where the users need it most and may help alleviate the effects of an S3 outage in some use cases.

Final thoughts

Managed platform services are the cornerstone of cloud services. Using a one like S3 can reduce DevOps costs and help bring applications to market faster. While AWS has been extremely reliable over the years, Amazon has experienced self-inflicted outages in the past. The recent S3 outage is no exception. Some combination of cross-region replication, backups and content-distribution should reduce the impact of such outages.

Copyright © 2017 IDG Communications, Inc.