Uh oh, the system went down: 5 rules for better troubleshooting


Become An Insider

Sign up now and get free access to hundreds of Insider articles, guides, reviews, interviews, blogs, and other premium content from the best tech brands on the Internet: CIO, CSO, Computerworld, InfoWorld, IT World and Network World Learn more.

Troubleshooting a downed mission-critical system can be terrifying, but a slow, methodical approach can save you time

If you've been in IT for more than a few minutes, chances are you've seen it happen: A mission-critical production system falls flat on its face, and you have absolutely no idea why or how to even begin to fix it. Moments of true terror punctuating the monotony of too many project meetings, application rollouts, and systems upgrades is really what makes IT interesting -- and one reason why it's not for everyone.

The troubleshooting process of seemingly inexplicable failures can be one of the most stressful parts of the job. Unplanned downtime of a mission-critical system can invite the harshest scrutiny from coworkers and management in even the smallest of organizations, and it only gets worse as the size of the enterprise grows and the stakes get higher. That additional pressure often leads even the best engineers to make very dumb mistakes, further compounding the problem and prolonging the downtime.

[ InfoWorld's Paul Venezia is no stranger to IT crises. Live and learn from his experiences: "The OS installation from hell" • "When virtualization become your own worst enemy" • "Mission impossible: A remote network cutover" | Managing backup infrastructure right is not so simple. InfoWorld's expert contributors show you how to get it right in this "Backup Infrastructure Deep Dive" PDF guide. ]

Staying cool under pressure isn't easy no matter how many times you've been tossed into the fire, but there are five easy rules you can add to your emergency troubleshooting processes to get to a resolution faster, conclusively prove the cause of the outage, and avoid making things worse.

First, it prevents you from going in circles and trying the same things over and over -- which happens frequently when stress levels are high. Second, if you have to involve the vendor, you'll have a comprehensive list of what you've already done so that the support folks don't have you do it all over again. Third, if you find yourself pawing through error logs, you'll be able to line up the time stamps of when you tried various fixes to the time stamps in the logs. Without that, you'll often be forced to retry the troubleshooting steps so that you can isolate the log entries they generate -- costing you more time in the end.

3. Research carefully

If you're back is really up against the wall, you'll inevitably find yourself grasping at straws when researching the problem (in other words, Googling). Unless you have an incredibly specific error on your hands, chances are you'll find several people posting that they've experienced a problem similar to the one you're stuck in.

The most important thing to do here is be very critical when you review those apparently close fits. In many cases, you'll discover that, although the symptom is the same, the circumstances are entirely different. I've seen massive amounts of time wasted in chasing the implementation of a fix for a completely unrelated problem -- a situation that could have been avoided by more careful review of the problem description.

4. Share what you know

To continue reading, please begin the free registration process or sign in to your Insider account by entering your email address:
From CIO: 8 Free Online Courses to Grow Your Tech Skills
View Comments
You Might Like
Join the discussion
Be the first to comment on this article. Our Commenting Policies