Office 365 outage: This cloud has a silver lining

With all of the wailing and gnashing of teeth about Microsoft's three-hour Office 365 email outage, many have overlooked an important fact

Is it possible that Microsoft's doing something right?

If you're even slightly interested in Office 365, by now you know that Microsoft had a major service disruption on Wednesday. At about 12 p.m., Redmond time, one of Microsoft's data centers in North America turned belly up, taking out the Office 365 Exchange servers. For those of you who have migrated to Office 365, you had no email at all for about three hours, and then messages slowly trickling in for an hour or so after that.

At about the same time, Microsoft's CRM Online system went down, and there are many reports that SkyDrive went down as well.

The silver lining? Office 365's Lync and SharePoint kept working. But there's one more thing that seems to get lost amid all the angst: Microsoft kept its customers reasonably well informed. It fessed up.

No, there was no explanation for why the outage happened -- even now, we don't have a backstory -- and no estimate of the time to repair. There was no talk about CRM or SkyDrive. But the worst problem, Office 365's Exchange Online sleeping with the fishes, was reported on the Service Health Page as Incident Ex440.

The Office 365 support 800-number went dead. The Office 365 Service Request link on the Admin page didn't work. The Office 365 Community Forum went ballistic, with very little response from Microsoft. Based primarily on the thread on the Office 365 forum, the timeline went something like this (translated into Pacific time):

12:07 p.m. -- First post that Outlook and OWA are dead; the Support Request link on the Admin page is dead.

12:14 p.m. -- Exchange is dead. The Service Health Page shows all green; everything's OK. No response on the forum from Microsoft.

12:33 p.m. -- Nobody seems to be able to get through to Microsoft phone support.

12:32 p.m. -- Incident Ex440 appears on the Exchange Online Service Health Dashboard. "We are investigating a service issue and will provide updated information when it becomes available." Incident Start Time stamped at noon.

12:37 p.m. -- Phone support says it's a nationwide outage. No estimated repair time.

12:42 p.m. -- Incident Ex440 updated, with the same "We are investigating" message posted.

1:10 p.m. -- @Office365 tweets: "Investigating service issues. Expect more service updates will be available via the Service Health Dashboard."

1:13 p.m. -- Incident Ex440 on the Service Health Page is updated to say, "Investigating. All users are unable to access their email, and administrators are unable to manage existing accounts or provision new accounts."

1:38 p.m. -- The first post on the Office 365 thread from Microsoft: "We are aware of the service issue. Please follow the Service Health Dashboard for real time updates on this issue."

2:21 p.m. -- Incident Ex440 updated to say, "Connectivity issues to a North American datacenter have caused broad client access problems to a number of O365 Services. We are currently working to resolve this issue as soon as possible. We apologize in advance for any inconvenience this has caused our customers."

2:57 p.m. -- @Office 365 tweets: "Services restoration beginning and being verified. Understand that Service Health Dashboard was intermittent. Pls try again."

2:57 p.m. -- First report that most mailboxes are back and functioning.

3:00 p.m. -- Incident Ex440 updated: "Connectivity is being restored. Service connections are being restored across all protocols and full service will be available soon. We apologize for any inconvenience this has caused our customers."

3:17 p.m. -- The moderator switches over to a forum that's only accessible to registered admins for paid Office 365 subscribers.

Shortly afterward, the Service Health Page went down again. But the Office 365 Exchange service was back up and working.

5:16 p.m. -- Incident Ex440 updated to say, "Connectivity has been restored. Service connections have been restored across all protocols and full service is available. If you experience further service problems, please contact customer support immediately. We apologize for any inconvenience this has caused our customers."

9:53 p.m. -- @Office365 tweets: "email service restored at 2:30 p.m. PDT. Network issues resolved in North American Data Center. APAC/Europe not impacted."

If there was ever any acknowledgment from Microsoft about CRM or SkyDrive, I haven't seen it.

The timeline tells an interesting tale. It took Microsoft about a half-hour to acknowledge the problem, and the first post at 12:32 p.m. didn't even begin to hint at the extent of the outage. By 1:13, though -- 1 hour and 15 minutes after Exchange Online fried -- Microsoft divulged the full extent of the damage. After that, the Service Health Dashboard announced restoration of service accurately, when it was up.

People following @Office365 on Twitter didn't get the bad news for more than an hour. Since the Service Health dashboard is only available to administrators -- and a few of them were, uh, preoccupied answering angry phone calls -- users with dead email symptoms didn't receive any official notification until then. And the hapless admins couldn't exactly send all of their users a quick warning message, could they?

The one shining light? The Office 365 community forum. People there were clueless, but at least they were all in the same boat.

Back to the silver lining: Microsoft's response was slow -- a nationwide email outage certainly deserves something better than a 30-minute partial notification -- but at least there was a response. The response mechanism, the Service Health dashboard, worked most of the time. There was some ancillary notification on the community forum and even on Twitter. Compared to Microsoft's response to BPOS outages just three months ago, we're witnessing some big-time improvements.

As David Linthicum noted yesterday in his Cloud Computing blog, cloud performance isn't always stable -- demonstrably. If Microsoft can't keep its systems working, at least it can notify customers when there are problems and keep them up-to-date on the solutions. This time -- I'm tempted to say for the first time -- Microsoft's done a pretty good job of it.

This story, "Office 365 outage: This cloud has a silver lining," was originally published at Get the first word on what the important tech news really means with the InfoWorld Tech Watch blog. For the latest develop.m.ents in business technology news, follow on Twitter.

Copyright © 2011 IDG Communications, Inc.

How to choose a low-code development platform