Last week I talked about IT ninjas -- and how they earn those credentials by having a sort of sixth sense about computing problems. The ability to sniff out problems quickly and accurately in complex data center infrastructures comes from boatloads of experience and intrinsic knowledge. It can't generally be taught. You don't find anyone offering certifications in supernatural troubleshooting.
Nonetheless, heavy-duty troubleshooters tend to follow some common, unwritten rules. Here are six I use in my own practice. Note that they apply to most -- but not all -- situations.
[ Also on InfoWorld: Read Paul Venezia's Virtualization Networking Deep Dive and learn. | Check out Paul Venezia's five-year plan to tackle the 8 problems IT must solve. ]
1. Never modify the interface on a server or network device you're currently connected to
While this may sound like a no-brainer, it's amazing how often someone modifies the properties of the network interface they're using to communicate with the device, a practice that has a high failure rate. At times, it may be the only option, but if there's a way to eliminate this potential pitfall, do it. Configure a secondary IP on an interface if you have to -- connect through another device or subnet, serial console, KVM, whatever. This is especially true if the device is in a remote office without on-site IT staff.
Occasionally, when I'm feeling relatively lazy, I'll write a script to change the IP on a Linux box, do a ping test, and revert the change if something goes wrong. But that's cheating, sort of.
2. Always have a way to get back to where you started
Whenever possible, provide a way to get back to the original problem, whether that means imaging a failing disk before working on it, backing up an entire directory structure in case there are files you aren't aware of that you'll need later, or simply pulling one disk of a RAID1 array on a physical server before you mess with a borked operating system. Naturally, this comes easier in virtual machine environments where you can simply take a snapshot.
3. Document, document, document
Of all these rules, this one may be the least followed. To be sure, documenting a problem and a resolution when you're in the middle of a chaotic situation may not be practical. That said, always hold a postmortem on the problem when the dust settles and go over the steps taken and the path to the solution. Write it down. Keep it safe somewhere, preferably on a wiki hosted on your intranet -- and backed up to several other places.
4. There's no magic in IT, but there is luck
As Thomas Jefferson said, "I find that the harder I work, the more luck I seem to have." The same is true in IT. The more time you spend researching aspects of your infrastructure, noting certain operating conditions of routers, switches, servers, and whatnot, the more in tune with your infrastructure you become. That homework allows you to sniff out problems in their very early stages and to move far quicker when the game's afoot. Also, there are plenty of ways to manufacture luck in IT. For example, use tools that automate network device configuration backups; that way, when a switch loses its mind, you can have it back up in minutes, not hours.
5. Make a backup of every configuration file before you modify it
This rule tends to apply only to Unix servers and network devices where configuration files exist for nearly all aspects of the device configuration. Before you go mucking around with sensitive configurations, save a copy to flash on a switch and maybe one to a TFTP host. On Unix systems, simply cp something.conf to something.conf.orig.
In a pinch, reverting to prior known-good status is as simple as copying the file back and restarting the service. This generally isn't possible on Windows, due to the registry and Windows' proclivity to complicate simple concepts. Even so, you can sometimes export a portion of the registry before messing with it so that it can be reapplied if all hell breaks loose. Note: As with all matters regarding the Windows registry, you take the life of the server in your hands when you make changes.
6. Monitor, monitor, monitor
An ounce of prevention is worth a month of work weekends. You should monitor every aspect of your data center, beginning with the temperatures of the room, the racks, and the servers -- plus, server process checks, uptime checks, ad infinitum. You should also implement centralized syslogging of all network devices, as well as set up trending and graphing tools to monitor bandwidth utilization, temperatures, disk partition use, and other datapoints. All of these monitors should alert you by any means necessary when they exceed reasonable thresholds.
When a database gets corrupted because a partition filled up, an email or SMS an hour before would have saved untold hours of work and downtime. There's no reason to put off monitoring your data center to the hilt.
These aren't just rules to follow -- they're rules that should be ingrained in your daily IT life. They're core concepts to many in the IT field, but to others, they're mythical -- kinda like ninjas.