Uh oh, the system went down: 5 rules for better troubleshooting

Troubleshooting a downed mission-critical system can be terrifying, but a slow, methodical approach can save you time

Become An Insider

Sign up now and get FREE access to hundreds of Insider articles, guides, reviews, interviews, blogs, and other premium content. Learn more.

If you've been in IT for more than a few minutes, chances are you've seen it happen: A mission-critical production system falls flat on its face, and you have absolutely no idea why or how to even begin to fix it. Moments of true terror punctuating the monotony of too many project meetings, application rollouts, and systems upgrades is really what makes IT interesting -- and one reason why it's not for everyone.

The troubleshooting process of seemingly inexplicable failures can be one of the most stressful parts of the job. Unplanned downtime of a mission-critical system can invite the harshest scrutiny from coworkers and management in even the smallest of organizations, and it only gets worse as the size of the enterprise grows and the stakes get higher. That additional pressure often leads even the best engineers to make very dumb mistakes, further compounding the problem and prolonging the downtime.

[ InfoWorld's Paul Venezia is no stranger to IT crises. Live and learn from his experiences: "The OS installation from hell" • "When virtualization become your own worst enemy" • "Mission impossible: A remote network cutover" | Managing backup infrastructure right is not so simple. InfoWorld's expert contributors show you how to get it right in this "Backup Infrastructure Deep Dive" PDF guide. ]

Staying cool under pressure isn't easy no matter how many times you've been tossed into the fire, but there are five easy rules you can add to your emergency troubleshooting processes to get to a resolution faster, conclusively prove the cause of the outage, and avoid making things worse.

To continue reading this article register now