Amazon has posted an essay-length explanation of the cloud outage that took offline some of the Web's most popular services last week. In summary, it appears that human error during an system upgrade meant a redundant backup network for the EBS (Elastic Block Service) accidentally took up the entire network traffic in the U.S. East Region, overloading it, and jamming up the system.
At the end of a long battle to restore services, Amazon says it managed to recover most data but 0.07 percent "could not be restored for customers in a consistent state". A rather miserly 10-day usage credit is being given to users, although users should check their AWS (Amazon Web Services) control panel to see if they qualify. No doubt several users are also consulting the AWS terms and conditions right now, if not lawyers.
[ Get the no-nonsense explanations and advice you need to take real advantage of cloud computing in InfoWorld editors' 21-page Cloud Computing Deep Dive PDF special report. | Keep up on the day's tech news headlines with InfoWorld's Today's Headlines: Wrap Up newsletter. ]
A software bug played a part, too. Although unlikely to occur in normal EBS usage, the bug became a substantial problem because of the sheer volume of failures that were occurring. Amazon also says their warning systems were not "fine-grained enough" to spot when other issues occurred at the same time as other, louder alarm bells were ringing.
Amazon calls the outage a "re-mirroring storm." EBS is essentially the storage component of the EC2 (Elastic Compute Cloud), which lets users hire computing capacity in Amazon's cloud service.
EBS works via two networks: a primary one and a secondary network that's slower and used for backup and intercommunication. Both are comprised of clusters containing nodes, and each node acts as a separate storage unit.
There are always two copies of a node, meant to preserve data integrity. This is called re-mirroring. Crucially, if one node is unable to find its partner node to backup to then it'll get stuck until it can find a replacement, and will keep trying until it can find a node. Similarly, new nodes need also to create a partner to be valid, and will get stuck until they can succeed.
It appears that during a routine system upgrade, all network traffic for the U.S. East Region was accidentally sent to the secondary network. Being slower and of lower capacity, the secondary network couldn't handle this traffic. The error was realized and the changes rolled back, but by that point the secondary network had been largely filled -- leaving some nodes on the primary network unable to re-mirror successfully. When unable to re-mirror, a node stops all data access until it's sorted out a backup, a process that ordinarily takes milliseconds but -- it would transpire -- would now take days, as Amazon engineers fought to fix the system.
Because of the re-mirroring storm that had arisen, it became difficult to create new nodes, as happens normally during everyday EC2 usage. In fact, so many new node creation requests arose, which couldn't be serviced, that the EBS control system also became partially unavailable.
Amazon engineers then turned off the capability to create new nodes, essentially putting the brakes on EBS (and therefore EC2 -- this is probably the moment at which many websites and services went offline). Things began to improve but that's when a software bug struck. When many EBS nodes close their requests for re-mirroring at the same time, they fail. Normally, this issue hadn't shown its head because there'd never been a situation when so many nodes were closing requests simultaneously.