Networking issues are generally more forgiving than those involving servers or storage. Usually it means a circuit is down, routes have been mangled, or a switch has failed. The outages can cause big problems, but once the source is determined and the issue is fixed, everything pops back to normal quite quickly, with little chance of further damage. That's not to say that network troubleshooting is any easier -- it certainly is not. But it's easier to play fast and loose when troubleshooting the network because the effects of your efforts are almost always reflected immediately.
That's not the case with servers and storage, where most troubleshooting decisions require intense deliberation. That's because choosing the blue pill over the red pill might fix the issue at hand, but might also mean several hours of work restoring data from backups or restoring physical or virtual servers in full. Furthermore, the effects of your efforts are not necessarily immediately visible. It may take hours to know if issuing a certain command on a SAN will elicit the desired result, or if standing up a snapshot and working on that will get a critical component back online without too much data loss. Simply defining "too much data loss" is a gambler's game.
As time wears on during an emergency, we become more cavalier in our testing and decisions. Whereas initially we may have been reticent to reboot a virtualization host that has lost all contact with the cluster, yet is still hosting functional VMs, after an hour or two of picking around at why it lost its mind, we may finally understand there is no other choice. We then go into unplanned downtime, power down as many of those VMs as we can, cut our losses, and pull the trigger.
Fixing more than just symptoms
As I mentioned last week, many of the most intractable, brain-bending problems are the result of bugs somewhere within infrastructure components that we cannot see or even diagnose without extensive and time-consuming forensics work, and usually a high-tier engineer from the vendor. When you have hundreds or thousands of people sitting idle while you poke through a critical system, those are luxuries that are expensive indeed. Once a prospective fix is determined and appears to solve the problem, it's usually implemented immediately, and fingers are crossed in hopes that it holds and the problem does not occur again.
Those are exactly the kinds of problems that are likely to occur again, because they were never actually fixed; only the symptoms were addressed. If there's time and budget for a lab, these issues can sometimes be recreated under less damaging conditions, and the source of the problem can be uncovered and fixed. Otherwise, all you can do is add it to the long list of big problems that happen once or twice and that you hope never happen again.
Such is life in IT.
This story, "The true grit of IT troubleshooting," was originally published at InfoWorld.com. Read more of Paul Venezia's The Deep End blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.