Azure has stumbled a few times lately. In August, InfoWorld reported on service interruptions in the Microsoft cloud platform across its services, virtual machines, and hosted websites. Two weeks ago, Azure services were down for about 11 hours. Though unacceptable, it's not uncommon to have these kinds of outages.
We've seen similar outages from AWS (Amazon Web Services), such as the one last year that took out Vine, Instagram, and Amazon.com itself. But we learn from each outage, and this particular outage teaches volumes. You can see that in the five types of responses the latest Azure outage has elicited.
1. The Microsoft response
Microsoft explained the day after the outage what exactly had happened. Apparently, a performance update for Azure Storage, initially tested on Azure's table-based storage, caused problems with the Azure blob-based storage with the "storage blob front ends going into an infinite loop," preventing them from taking on further traffic. The service froze, essentially.
The change was rolled back, but the front ends had to be rebooted, and the process of getting everyone back up and running took time. Microsoft issued the standard mea culpa with promises to improve recovery in the future and to enforce slower rollout batches to help lessen issues of this sort in the future.
2. The customer response
Customers, logically, were angry. They also were a touch frustrated. The outage itself doesn't necessarily cause all the anger and frustration -- It's the lack of communication that does it, as InfoWorld's Caroline Craig detailed in her post "In a cloud outage, no one can hear you scream." Guess what? All your Twitter followers can hear you scream.
Even though Microsoft has an Azure status page, it doesn't help anyone if you're offline or if the status isn't indicating a problem.
Clearly, Microsoft and other cloud providers need a better response for customers when an outage occurs. After all, outages are "inevitable," says Microsoft's chief reliability strategist, David Bills. As Ray Suelzer commented on Microsoft's Azure blog, "We understand that [outages] happen, but the human side of this could have been handled much better." I agree.
3. The cloud naysayer response
Any time we see such an outage from any cloud vendor, inevitably someone comments, "The outage raises serious questions about whether public cloud platforms are ready for mission-critical workloads."
For dev and test purposes (which are not mission critical), an outage doesn't cause lost business or a damaged reputation. It's one thing to have a hiccup, perhaps even a localized outage, but when it affects North America, Europe, and parts of Asia, as it did in this recent Azure outage, it legitimately calls into the question the public cloud's reliability.
It's obvious we need to see better resiliency in the service if businesses are to trust it. I'm confident Microsoft will get there, but not before we hit a few more "inevitable" bumps in the road.
4. The Manic Pixie Dream Girl response
The MPDG response takes a negative and finds a positive spin on it. One interesting MPDG comment I read for the Azure outage was that the fact that such an outrage is an indicator that Azure indeed has customers who are heavily invested in the service being up and running.
Ars Technica's Peter Bright was the MPDG in question here, and he wrote, "While complaints about downtime perhaps aren't the best advertising a service provider could hope for, it's much better than not having complaints because nobody's using your service in the first place." It's an interesting, albeit spin-doctored thought.
5. My response: Be judiciously cautious
I've said it before and I'll say it again: Cloud services go down. It is, as Microsoft's Bills says, "inevitable." Should you thus pause before jumping in with both feet into the Azure infrastructure or services for your organization? Absolutely!
There are pros and cons to using something new or different. One con to being an early adopter is that you become a guinea pig that lives through the ups and downs until sufficient dependability is achieved. Being an early adapter isn't always rainbows and sunshine, so keep that in mind going in.
Azure is a maturing platform that many enterprises have put their trust in. You may be one of them. If so, it's your responsibility to let Microsoft know, clearly, when you are unhappy with the service or its communication to help it improve.
Improve it will. That, too, is "inevitable."