For cloud customers willing to pony up a little extra cash, Amazon has an enticing proposition: Spread your application across multiple availability zones for a near-guarantee that it won't suffer from downtime.
[ Amazon's cloud service outage today raised questions about the company's backup and disaster recovery plans. | Get the no-nonsense explanations and advice you need to take real advantage of cloud computing in InfoWorld editors' 21-page Cloud Computing Deep Dive PDF special report. ]
Customers who build applications in just one availability zone are more likely to suffer outages. But what happens when multiple availability zones go dark at the same time? We found out today when an outage forced websites like Foursquare, Reddit, Quora, and Hootsuite offline.
"We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS (Elastic Block Storage) volumes in multiple availability zones in the US-EAST-1 region," Amazon said Thursday on its service health dashboard.
The US-EAST-1 region, based in northern Virginia, is one of several Amazon regions around the world. There's another one in northern California. Amazon started reporting troubles at 4:41 a.m. Eastern time. By 1:26 p.m., Amazon said it is "now seeing significantly reduced failures and latencies," but that problems were still ongoing. Amazon blamed a "networking event" that "triggered a large amount of re-mirroring" of storage volume, creating a capacity shortage.
Each region contains multiple availability zones -- but little information about each one is known, according to Gartner analyst Drue Reeves. There are four availability zones within the Virginia region, Reeves says. But are they in different data centers? How far apart are they? How is data replicated across zones? Reeves says Amazon hasn't been transparent about these questions. Not knowing the answers makes it difficult for customers to know which methods of building high availability into applications will be most effective.
"Amazon has said for years that they run multiple availability zones within a region to prevent the outage of an entire region," Reeves said. "But yet here we are, and we have an outage inside EC2 for an entire region."
An Amazon spokesperson hasn't yet responded to a request for comment.
Perhaps tellingly, Amazon's service-level commitment provides 99.95 percent availability for each region -- but not for each availability zone. This is good enough for many customers but well below the "five nines" standard of high availability.
In describing the availability zones on the EC2 website, Amazon says they are "distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region."