If this is the way Microsoft's going to handle Office 365 outages, we're in for some interesting times.
On Dec. 30, one of the largest SQL Server databases on the planet started having problems. The database in question just happens to belong to Microsoft. And the way the company reacted to the problems should raise red flags for anyone considering a move to the Microsoft cloud.
According to a blog post by Chris Jones, a Microsoft vice president in Windows Live Engineering, the Hotmail servers had a problem with load balancing, resulting in 17,355 email accounts losing all of their data. It took Microsoft three days to restore the data. At least, Microsoft claims it had the data restored in three days. Voluminous postings on both the Windows Live Engineering site and the Windows Live Solution Center say that some people still haven't gotten their data back.
Data loss happens in the cloud, on corporate servers, and on the desktop. But this is, arguably, Microsoft's most widely deployed cloud application, backed by Redmond's best and brightest, and it failed for 17,000 users for at least three days.
Put aside the obvious technical questions, like why were the servers performing a load balancing act in the middle of the busiest time of year? How did the data disappear and then suddenly reappear? Why does it take three days to retrieve lost data? Can't SQL Server scale better than that? If you look at Microsoft's response to the disappearing data, you really have to wonder how the 'Softies would handle a data-destroying incident involving your company's data.
Consider: The initial problem notification, predictably, came on the Microsoft support board, the Windows Live Solution Center. Hundreds, then thousands of people reported that all of their messages were gone. The support staff handling the Solution Center must've realized they were facing a systemic problem, not a random sampling of clueless users. But instead of coming out with a definitive statement and posting it on the forum in blazing color, the support people just chased after individual reports, using cut-and-paste responses to users' cries of anguish.
Even a simple "We don't know what's going on, but here are the symptoms and we're working on it" pinned to the top of the Hotmail forum would've been a breath of fresh air.
Instead, on Jan. 3, Microsoft posted an official terse explanation: "We have identified the source of the issue have restored email access to those who were effected."
It's now six days since the initial problem surfaced and we still don't have any definitive word from Microsoft about what happened. In fact, we're still getting conflicting stories. At 4:55 p.m. on Jan. 5, the tech support staff posted this response to a series of inquiries about still-missing messages:
I'd like you to know that we are actively working on resolving on this issue since it's already under investigation. We will post back as soon as we have the latest news on what caused this issue. Thank you for your understanding.
For heaven's sake. Microsoft's engineering team has been working on the problem for almost a week, and that's the only explanation they can give us? Three days ago, we were told that "we have identified the source of the issue," and now the support team's telling us, "we are actively working on resolving this issue"?
Granted, on the Hotmail scale, 17,000 inboxes doesn't amount to a hill of beans. But Microsoft's ongoing fumbles in identifying and analyzing the problem; its trouble restoring user data; its muddled explanations of what happened and how the problems were resolved; and repeated communication gaffes with its customers certainly have me worried. How about you?
This article, "Hotmail fail: Microsoft lays an egg in the cloud," was originally published at InfoWorld.com. Get the first word on what the important tech news really means with the InfoWorld Tech Watch blog.