In IT, as in life, sometimes bad things just happen. Although we invest a ton of treasure and labor in trying to make sure that critical business operations never experience downtime, things always seem to find a way to slip through the cracks. When something does go off the rails in a brilliantly visible fashion, it often matters a lot less how you actually fixed the problem and more how you communicated both while the problem was occurring and during its aftermath.
In the past few weeks, I was given a stark reminder of that fact. A service provider I deal with on a regular basis encountered what can only be described as a catastrophic network meltdown. It's unimportant what the actual problem was except that the failure wasn't easily predictable and happened in such a way that several layers of design redundancy were rendered ineffective. In that respect, I really feel for the technical folks who saw their network crumble in a way that they hadn't anticipated. If you work in IT for long enough, that'll happen to you no matter how good you are.
Although I'm ultimately willing to cut the provider a lot of slack for the failure itself (despite the fact that it was disruptive), I'm not inclined to be so charitable about how it was communicated. In fact, the lackadaisical way that the provider communicated about its outage stands to do far more damage to my opinion of the company than the failure itself.
All of us in IT risk this same outcome anytime anything goes wrong. Being able to communicate clearly during and after an outage is often what people remember when they think of the outage. I've seen organizations accept epic meltdowns with grace when communication is good, and I've seen people nearly fired for events that resulted in no lost productivity simply because communication was bad. It's that important.
Although that incident I experienced is not the worst I've been through, it did share a few characteristics I've seen reflected in other poorly communicated outages: