Last week I talked about IT ninjas -- and how they earn those credentials by having a sort of sixth sense about computing problems. The ability to sniff out problems quickly and accurately in complex data center infrastructures comes from boatloads of experience and intrinsic knowledge. It can't generally be taught. You don't find anyone offering certifications in supernatural troubleshooting.
Nonetheless, heavy-duty troubleshooters tend to follow some common, unwritten rules. Here are six I use in my own practice. Note that they apply to most -- but not all -- situations.
[ Also on InfoWorld: Read Paul Venezia's Virtualization Networking Deep Dive and learn. | Check out Paul Venezia's five-year plan to tackle the 8 problems IT must solve. ]
1. Never modify the interface on a server or network device you're currently connected to
While this may sound like a no-brainer, it's amazing how often someone modifies the properties of the network interface they're using to communicate with the device, a practice that has a high failure rate. At times, it may be the only option, but if there's a way to eliminate this potential pitfall, do it. Configure a secondary IP on an interface if you have to -- connect through another device or subnet, serial console, KVM, whatever. This is especially true if the device is in a remote office without on-site IT staff.
Occasionally, when I'm feeling relatively lazy, I'll write a script to change the IP on a Linux box, do a ping test, and revert the change if something goes wrong. But that's cheating, sort of.
2. Always have a way to get back to where you started
Whenever possible, provide a way to get back to the original problem, whether that means imaging a failing disk before working on it, backing up an entire directory structure in case there are files you aren't aware of that you'll need later, or simply pulling one disk of a RAID1 array on a physical server before you mess with a borked operating system. Naturally, this comes easier in virtual machine environments where you can simply take a snapshot.








