CYA: 3 rules to keep a systems crisis from taking down IT

FREE

Become An Insider

Sign up now and get free access to hundreds of Insider articles, guides, reviews, interviews, blogs, and other premium content from the best tech brands on the Internet: CIO, CITEworld, CSO, Computerworld, InfoWorld, ITworld and Network World. Learn more.

Even the most redundant of infrastructures can be brought down by a lack of readily accessible knowledge

These days, all but the smallest organizations spend mountains of money building redundancy into their infrastructures. As business depends more and more on those systems to function at even the most basic levels, the capital plowed into highly redundant disk arrays, bulletproof backups, and highly available virtualization infrastructures has become an expected cost of doing business.

However, the frenetic pace of break/fix, application rollouts, and systems upgrades often leads to the most dangerous single points of failure of all: people. That huge investment in redundant, self-healing infrastructure can be negated in one fell swoop if the one person who knows how to run some critical part of it quits, is on vacation, or even just went out for lunch without a cellphone at just the wrong time.

[ Get expert networking how-to advice from InfoWorld's Networking Deep Dive PDF special report. | Keep up on the latest networking news with our Technology: Networking newsletter. ]

Often, you don't need an actual service disruption to cause a five-alarm political fire that reaches all the way to the executive suite. In my years as a consultant, I can't count the number of times I've been called in (sometimes at a substantial expense) to help fill a knowledge gap simply because a single member of the IT staff wasn't available for whatever reason.

Looking back on those incidents, there are three items every member of the infrastructure team should have at their fingertips, whatever their role.

2. Track support contacts

Another potential for huge problems when a primary admin is unavailable is when the rest of team members don't know where to escalate a problem. If a disk dies in your storage array while you're on vacation, does the rest of the team know who to call to get a new one? Is there a support contract number they need to know? What if a WAN circuit goes down? Will everyone know which telco provider services that line and who to call?

That kind of backstop sounds incredibly simple -- and it absolutely is -- but if that kind of information isn't widely available (say, a printout on a bulletin board in the data center), it can delay the fix and sometimes even cause a general panic around a problem nobody knows how to fix.

3. Back up and restore

Everything I've mentioned so far boils down to having the right information easily available to everyone who might need it. But there are two hands-on activities that most of the IT team should be fully versed in doing themselves: Ensuring that backups are being made and being able to restore them. Backing up and restoring are simply too important to depend on one or even two people to do.

To continue reading, please begin the free registration process or sign in to your Insider account by entering your email address:
From CIO: 8 Free Online Courses to Grow Your Tech Skills
Join the discussion
Be the first to comment on this article. Our Commenting Policies