Software error complicates Amazon's data center recovery

The storage snapshot management system incorrectly deleted data but it should be recovered

Amazon Web Services' efforts to restore service following a power outage at its Dublin data center were complicated further on Monday by an error in the EBS (Elastic Block Storage) software, the company said.

As a result of the software error, the EBS snapshot management system incorrectly thought some of the blocks were no longer being used and deleted them, Amazon wrote on its Service Health Dashboard at 3:11 PM PDT.

[ Also on InfoWorld: Amazon Web Services also reported an outage in the U.S. | Get the no-nonsense explanations and advice you need to take real advantage of cloud computing in the InfoWorld editors' 21-page Cloud Computing Deep Dive PDF special report. | Stay up on the cloud with InfoWorld's Cloud Computing Report newsletter. ]

The company has addressed the error to prevent it from recurring and also disabled all of the snapshots that contain these missing blocks. Amazon will send emails to affected customers as soon as it has a new copy of their snapshots available, which can then be used to recover the data, it said.

At 10:01 PM PDT, Amazon said that it had recovered all of the EBS volumes and EC2 (Elastic Compute Cloud) instances that it was able to verify were fully consistent at the time of the power outage.

However, the company was unable to verify whether or not there were any in-flight writes that did not get consistently saved to some EBS volumes. To remedy that, it has started creating recovery snapshots for the affected volumes. As they become available, the snapshots will be added to users' accounts, according to Amazon. The process will be time consuming and may take up to 24 hours to fully complete, it said.

The availability problems started after lightening struck a transformer, sparking an explosion and fire which caused the power outage at 10:41 AM PDT on Sunday.

The last few days have not been very merry for Amazon Web Services. Besides the issues in Dublin, Amazon also suffered a brief outage in the U.S. as a result of network connectivity issues.

Send news tips and comments to mikael_ricknas@idg.com.

From CIO: 8 Free Online Courses to Grow Your Tech Skills
Join the discussion
Be the first to comment on this article. Our Commenting Policies