In IT, as in life, sometimes bad things just happen. Although we invest a ton of treasure and labor in trying to make sure that critical business operations never experience downtime, things always seem to find a way to slip through the cracks. When something does go off the rails in a brilliantly visible fashion, it often matters a lot less how you actually fixed the problem and more how you communicated both while the problem was occurring and during its aftermath.
In the past few weeks, I was given a stark reminder of that fact. A service provider I deal with on a regular basis encountered what can only be described as a catastrophic network meltdown. It's unimportant what the actual problem was except that the failure wasn't easily predictable and happened in such a way that several layers of design redundancy were rendered ineffective. In that respect, I really feel for the technical folks who saw their network crumble in a way that they hadn't anticipated. If you work in IT for long enough, that'll happen to you no matter how good you are.
Although I'm ultimately willing to cut the provider a lot of slack for the failure itself (despite the fact that it was disruptive), I'm not inclined to be so charitable about how it was communicated. In fact, the lackadaisical way that the provider communicated about its outage stands to do far more damage to my opinion of the company than the failure itself.
All of us in IT risk this same outcome anytime anything goes wrong. Being able to communicate clearly during and after an outage is often what people remember when they think of the outage. I've seen organizations accept epic meltdowns with grace when communication is good, and I've seen people nearly fired for events that resulted in no lost productivity simply because communication was bad. It's that important.
Although that incident I experienced is not the worst I've been through, it did share a few characteristics I've seen reflected in other poorly communicated outages:
- I was only aware of the problem because of monitoring tools that I had configured to detect it -- no one from the provider ever reached out to inform me of the problem.
- When I asked about the cause of the problem, I was told it would take a few hours to put together a root-cause analysis and no other information was available.
- Despite several requests, that root-cause analysis didn't come until four days later -- during which there was absolutely no communication about the issue, including whether or not it might recur.
- Finally, when the root-cause analysis was presented, it was done haphazardly, and in spite of its detailed and adequate documention about the issue, it was full of spelling and grammatical mistakes that screamed "unprofessional."
When an outage occurs on your watch, the very first thing you should do is inform stakeholders of the situation, even if they might not yet be aware that anything is wrong. Certainly, give yourself some time to classify the problem and definitely try not to draw any conclusions about what the problem is before you actually know -- that can make things worse later. But don't wait too long to communicate, and try to be the first to communicate. Nontechnical stakeholders will worry less if they think you're on top of it and are keeping them in the loop.
One of the hardest things to do in the face of a debilitating outage is to take time out of trying to fix it to tell people about it. I know what it's like. I've sat in a chair with an entire enterprise network down in shambles around me and been the only one working feverishly to fix it. The very last thing you want to do is to stop, break your train of thought, and try to tell someone (especially someone nontechnical) what's happening and what to expect.
But that's exactly what you need to do. If you leave stakeholders in the dark, they'll start to make wild assumptions about what's wrong, how long it will last, and who's to blame. Because those assumptions tend to gravitate toward the apocalyptic, it's crucially important to control them. The repercussions from not doing so can long outlast the outage you're working to correct.
That said, you need not always drop what you're doing to explain to a bunch of management types exactly how a SAN works or why the network can be down even though they spent all that money on buying two-core switches for it. The best thing to do is find someone on your team who can act as a communications liaison in the event that there's a real outage that you need to focus on.
That person might be someone from the desktop support team or perhaps someone in your own management chain of command. Either way, it helps if they're reasonably technical and can learn what they need to know from watching what you're working on rather than interrupting you to ask. Even if they have to interrupt you, at least it's only one person doing so rather than a whole room full of them.
Once the outage has been resolved, immediately start working on a root-cause analysis. If it's going to take you a while to put together -- say, if you have to work with a vendor to get it -- be sure to communicate regularly to apprise folks of the status. When you write it, make sure to communicate the timeline of the outage, explain the root cause in both executive-level layman's terms and in technical terms, and spell out what action you're taking to prevent it again.
And though this may seem silly to say, make sure it doesn't have any typos. Even one error can make someone who doesn't understand a word of what you've written automatically assume that everything you've said is wrong or carelessly assembled. Have someone proofread it, if you need to.
No matter what you do, don't underestimate the importance of communicating proactively, fully, and professionally. From a stakeholder's perspective, that's often all they'll see or understand. Your hours of frantic, genius-level toil in the bowels of the network will rarely be seen, appreciated, or remembered in the same way as the communication that follows it.
This article, "What's worse than a system failure? What you say about it," originally appeared at InfoWorld.com. Read more of Matt Prigge's Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.