February 20, 2004

Business continuity means monitoring the small stuff

Maybe your systems appear to be running smoothly because you've tuned out the alarms

I once worked with an organization that had a mature, load-balanced farm of server systems for its corporate directory. Users started reporting that, infrequently, they had to authenticate twice, or that they had to resend mail to internal staff. Our management console showed the directory service online and healthy. There was no pattern to users’ query failures, but calls to the help desk were growing in frequency.

It took a one-time school administrator to send us in the right direction. “That management console,” he said, “doesn’t know everything. Try diagnosing the problem as if the management system’s not there.” So we did it the old-fashioned way, tracking from the bottom up instead of from the top down. Lo and behold we found that one server in the farm had a sick hard drive controller that was intermittently garbling read data.

Business continuity and failure-recovery strategies are based on the assumption that the most expensive failures are the most obvious ones: One or more systems, services, or devices die. But, by design, that system is not looking for lesser signs of trouble.

Some smaller problems compound over time or thrive simply because they aren’t being watched. Whether these little issues go unnoticed because there’s no one left to look out for them or because they don’t seem important enough to monitor, the small stuff can wind up costing more to repair than the big problems you fear most.

By the time one of these creeping, under-the-radar conditions trips the alarm bell, it may have left a trail of damage. In the case of the server with the sick hard drive, the controller didn’t realize there was a problem, so its host didn’t know, and no alert went out. We found that if there had been an alert, it wouldn’t have been heard. The management system was configured (or misconfigured) to listen for alerts only from the master directory server and the load balancer. It didn’t see anything behind the load balancer.

That made the problem difficult to diagnose, which is often the case with failures that start out small. What’s the solution? You need to adjust your administrative practices so you’ll see costly small problems coming.

Through a Foggy Windshield

Administrators routinely loosen management systems’ alarm thresholds so that they’ll send out fewer alerts. Some of the staffers who made those adjustments to your systems are probably gone now, leaving you uncertain about disabled or misconfigured monitoring settings. Before you do anything else, you’ll need to restore alert defaults and tune them to more realistic thresholds, which will bury you under management alerts for a while. Is that an enjoyable process? No, and that’s why I’d get vendors to handle as much of it as possible.

If you think that being bombarded with too much information is rough, it’s a joy compared to seeing nothing at all. A management system that’s tuned for quiet operation is a great source of calm, but it’s a false comfort. These systems simply aren’t aware of the status of some elements of your operation. You either need to plug these invisible assets into your management system or cook up some other way to track their status. Choose one or the other, because it’s in these dark places that costly troubles fester. Strolling up to a console whenever a user complains is not an effective solution.

Close

On Twitter now

Application development

Powered by Twitter

White Paper

D2D Virtual Tape Library Replication Primer

This whitepaper explains the terminology and concepts behind Data Replication technologies and establishes some sizing rules through worked examples. Learn the new paradigm in disaster tolerance—protect data anywhere.

Download now »

White Paper

An Alternative to Virtualization for Datacenter Cost Savings

Server virtualization is a popular option for dealing with mounting datacenter costs. Another equally promising approach is the use of an Application Delivery Controller. Citrix NetScaler provides a low-cost way for organizations to reduce their server count and accrue cost savings from a reduction in space, cooling, power and personnel.

Download now »

White Paper

Why Your Firewall, VPN, and IEEE 802.11i Aren't Enough to Protect Your Network

The emergence of WLANs has created a new breed of security threats to enterprise networks.

Included in HP ProCurve WLAN solutions is security technology that alleviates threats from WLANs through:
* Monitoring wireless activity inside and out of the enterprise
* Classifying WLAN transmissions into harmful and harmless
* Preventing transmissions that pose a security threat to the enterprise network
* Locating participating devices for physical remediation

Download now »

White Paper

Bringing the Edge to the Data Center

Effectively address data protection challenges, implementing solutions that help store and protect business–critical data while cutting costs and improving efficiency and reliability.

Download now »

Sign up to receive InfoWorld Resource Alerts

Subscribe to the Developer World Newsletter

Receive a weekly roundup about the art and science of software development.

©1994-2009 Infoworld, Inc.