AT INFOWORLD CTO Forum 2002, I had the honor of moderating a roundtable discussion of a familiar topic: systems management.
Yawn, you say? Au contraire. Luminaries from IBM, Cisco, and Path Communications expressed fascinating and diverse points
of view about the challenge of managing IT assets -- systems, infrastructure components, and applications -- in a distributed
setting. For all the coverage of Web services, grid computing, and utility computing, surprisingly little attention is paid
to keeping these widely dispersed, loosely coupled federations of systems running smoothly.
We can adapt existing management tools and techniques to new processing models, but changes will eventually be required
at all layers. Because every company already has a substantial investment in hardware, software is the best place to start.
Applications should be outfitted with some awareness of their distributed environment; at the very least, they should respond
gracefully to signals from a management engine telling it to pause, stop, restart, or release its resources.
Applications should also report trouble in some more structured manner than dumping plain text into a log file. Apps are
first to be affected by an equipment or network failure, so in a way every application is a sensor that can be tapped to measure
the enterprise's health.
If software is wired with an awareness of its operating state, the next logical step is to teach it to respond to certain
kinds of trouble without requiring human intervention. There is some debate about whether it's best to have applications control
their own destiny, submit to centralized control (the current model), or watch out for one another. The eventual goal is to
have self-diagnosing, self-healing software that only sends up a flare when it encounters a problem it can't fathom or can't
resolve on its own.
The concept of autonomic computing, largely pioneered by IBM, imbues systems and infrastructure with enough intelligence
to report on and react to failure. The name is borrowed from animals' autonomic nervous system, which automatically regulates
processes such as heart rate and respiration in response to signals from the body. We don't need to think about making our
hearts beat faster when we exercise, it just happens. In the same way, administrators shouldn't have to constantly watch and
tweak equipment.
Instead of paging an operator at 3 a.m., systems should automatically respond in the way that best preserves functionality
and protects data. At first, those responses will have to be programmed: When problem A occurs, take action B. Over time,
systems will make the transition from rules to reason. Every system, and in time every component in that system, will be able
to react quickly and intelligently to most failures.
It's unlikely that human administrators will ever be taken entirely out of the loop. But to derive the maximum benefit from
distributed technology, we must reduce the administrative noise level by giving hardware and software more power to manage
itself.