Improve availability of enterprise data
For those striving to avoid system downtime, change is enemy No. 1
Ask an expert about data availability and how to ensure it, and the conversation quickly turns to the subject of human error. Not that IT mistakes are the leading cause of unplanned downtime; the research firm Gartner identifies software failures as the chief culprit, and “operator error” as the second most common cause, ahead of hardware outages; building or site disasters; and metro disasters, such as storms or floods, in that order. But of all of these major causes, human error is the one that IT can really do something about.
IT folks close to the action generally agree with Gartner’s ranking, although some suggest that Garter may even have underestimated the role of mistakes. Software failures often result from configuration errors, and sometimes they arise as the result of improper testing: an incompatibility isn’t discovered because an application was tested on a different system configuration than the one in production, for example, or performance testing didn’t give the app the workout it would get in real life.
Even many hardware failures can be laid at the feet of IT malpractice. If systems aren’t cooled properly, if they’re improperly racked, or if the procedure for starting them up and shutting them down isn’t followed correctly, equipment life is shortened and premature failures can result. Even for dumb hardware, it pays to read the manual.
But whether it’s software testing practices, hardware maintenance procedures, or the plain old boneheaded mistake lurking in the dark, the question is what to do about it.
If you’ve recently suffered from a blunder-induced outage, you might be tempted to ask, Why me? Mauricio Daher, a principal consultant with the storage services provider GlassHouse Technologies, can tell you: Not enough red tape.
In Daher’s line of work, which is helping large IT organizations prepare for disaster and recover from outages, he’s seen his fair share of glitches attributable to human error.
“Out of those,” he says, “it is mostly, ‘Gee, somebody reconfigured a LUN [logical unit number] that was actually a production LUN but they thought it was something else.’ These are simple things that I see happening again and again because of the nature of my business.”
You might think human error is an equal-opportunity affliction, but these sorts of slips just don’t happen in better-run enterprises, Daher points out. “By the time you get to a point where you can input those commands, you’ve been through so many bits of red tape that it’s impossible to make a mistake,” he says. “That type of mistake really doesn’t happen in a mature organization, because there are so many safeguards.”
Daher and GlassHouse use the CMM (Capability and Maturity Model) to evaluate datacenters. Essentially, CMM is a model for process improvement that measures maturity level on a five-point scale. When Daher assesses an IT organization, he is looking for standard operating procedures, whether they have SLAs in place, how they measure against those SLAs, and whether there is accountability at various points in the personnel chart.
Training, documentation, and standardization are the essential ingredients of process success. Falling short on the CMM scale typically has more to do with a lack of discipline than a shortage of skills.