Last week, Amazon Web Services, IBM, and Rackspace rebooted their clouds to deal with maintenance issues around the Xen hypervisor. Xen is foundational to the customized version of AWS, as well as part of IBM SoftLayer’s and Rackspace’s clouds.
Not all providers have to reboot their clouds to upgrades or maintenance. Google and EMC VMware support the notion of live migration, which keeps internal changes invisible to users and avoids these Xen reboots.
Although I saw minimal impact from these cloud reboots, some users had fits, complaining about the outage and the providers' communications.
What can you learn from these incidents? Three lessons come to mind.
First, understand the limitations of the architectural components your provider uses -- such as Xen. Ask how often these kinds of reboots will occur, and how the provider handles transparent maintenance.
Second, make sure to consider the lines of communications between the cloud provider and your enterprise. Providers often drop the ball here. Users are often unhappy because they didn't get much (or any) heads-up about the reboot, not about the reboots itself.
Third, be practical about your expectations of your cloud provider. Reboot issues come up, exactly as they do with your own internal systems. You need to plan for such disruptions.
As public cloud providers work through these issues and learn from their mistakes, they will become better at their operations. Customers should remember outages and other disruptions are few and far between these days, so don’t let the rare event take you off stride.