I once worked with an organization that had a mature, load-balanced farm of server systems for its corporate directory. Users
started reporting that, infrequently, they had to authenticate twice, or that they had to resend mail to internal staff. Our
management console showed the directory service online and healthy. There was no pattern to users’ query failures, but calls
to the help desk were growing in frequency.
It took a one-time school administrator to send us in the right direction. “That management console,” he said, “doesn’t know
everything. Try diagnosing the problem as if the management system’s not there.” So we did it the old-fashioned way, tracking
from the bottom up instead of from the top down. Lo and behold we found that one server in the farm had a sick hard drive
controller that was intermittently garbling read data.
Business continuity and failure-recovery strategies are based on the assumption that the most expensive failures are the most
obvious ones: One or more systems, services, or devices die. But, by design, that system is not looking for lesser signs of
trouble.
Some smaller problems compound over time or thrive simply because they aren’t being watched. Whether these little issues go
unnoticed because there’s no one left to look out for them or because they don’t seem important enough to monitor, the small
stuff can wind up costing more to repair than the big problems you fear most.
By the time one of these creeping, under-the-radar conditions trips the alarm bell, it may have left a trail of damage. In
the case of the server with the sick hard drive, the controller didn’t realize there was a problem, so its host didn’t know,
and no alert went out. We found that if there had been an alert, it wouldn’t have been heard. The management system was configured
(or misconfigured) to listen for alerts only from the master directory server and the load balancer. It didn’t see anything
behind the load balancer.
That made the problem difficult to diagnose, which is often the case with failures that start out small. What’s the solution?
You need to adjust your administrative practices so you’ll see costly small problems coming.
Through a Foggy Windshield
Administrators routinely loosen management systems’ alarm thresholds so that they’ll send out fewer alerts. Some of the staffers
who made those adjustments to your systems are probably gone now, leaving you uncertain about disabled or misconfigured monitoring
settings. Before you do anything else, you’ll need to restore alert defaults and tune them to more realistic thresholds, which
will bury you under management alerts for a while. Is that an enjoyable process? No, and that’s why I’d get vendors to handle
as much of it as possible.
If you think that being bombarded with too much information is rough, it’s a joy compared to seeing nothing at all. A management
system that’s tuned for quiet operation is a great source of calm, but it’s a false comfort. These systems simply aren’t aware
of the status of some elements of your operation. You either need to plug these invisible assets into your management system
or cook up some other way to track their status. Choose one or the other, because it’s in these dark places that costly troubles
fester. Strolling up to a console whenever a user complains is not an effective solution.
Most enterprise products are equipped for management. But not everything is made to the enterprise standard. Products designed
to adapt to small and medium businesses default to independent management. A business with two routers and eight servers is
not going to spring for a copy of OpenView. Instead, a company that size will use Telnet, X Window, or Terminal Services to
keep things tweaked. By now, I think everything can be managed from a Web browser, but #every device has its own interface
style.