I like to know what's going on with the stuff I manage, so I'm a big proponent of heavily monitored infrastructures. I've written more code to monitor, manage, and track IT infrastructure components than I can remember. Folks still email me about plug-ins or code I wrote years ago and have since completely forgotten.
Many of those efforts were intended to monitor fixed systems, from UPSes to PDUs, air conditioners, servers, switches, routers, firewalls -- the whole gamut of IT gear. Until recently, they shared one trait: They stayed put in the same data center running on the same hardware using the same IP addresses. These days, except for power and cooling gear, that's not necessarily a safe assumption.
[ Matt Prigge reveals what's key in VMware's new vSphere, vCenter, and vCloud. | Get virtualization right with InfoWorld's 24-page "Server Virtualization Deep Dive" PDF guide. | Track the latest trends in virtualization in InfoWorld's Virtualization Report newsletter. ]
While we work toward the totally modular data center, we may find it's not as simple to monitor and maintain these infrastructures as it used to be. When entire application stacks consisting of several servers can be moved from one data center to another, we need to keep track of them and adjust the monitoring accordingly. When we see shifts and spikes in application performance and network latencies, we need to account for the fact it may not be an actual problem; rather, the system performing the monitoring is suddenly a much further logical distance away from the target. We may suddenly need to understand that physical hosts could fall off the network and disappear altogether -- and that's not necessarily a problem.
When it is a problem, we need to know immediately. Naturally, we need to automate the systems to be able to tell the difference.
This new reality is putting a strain on monitoring systems that have been around for ages. Commercial and open source alike, they weren't designed for transient and fungible infrastructure components. Instead, their purpose was to make sure that everything stayed exactly as it was at a known good state and to send up alarm signals if anything changed. Now, we need a middle layer to determine whether or not that seemingly negative variation is actually perfectly normal.
Case in point: VMware's DPM (Dynamic Power Management). When you configure and enable DPM, vCenter will try to figure out if the available resources exceed current utilization levels, then automatically power down ESXi hosts deemed unnecessary at that point. When the load grows, DPM powers up those hosts, and DRS (Dynamic Resource Scheduling) spreads out the load. This further saves on power and cooling costs in a data center, while at the same time ensuring more horsepower is available as the situation demands.
When you start tying veteran monitoring systems into those hosts, however, you can run into problems unless you convince the monitoring platform that the ESXi hosts can disappear -- and the disappearing act is not a problem requiring an alert. On the other hand, if there's a problem within a running host, an alert should be generated. But note that during the course of DPM functions, a host might cause an alarm to trip while it's shutting down. At the moment, for lots of external monitoring systems, it's easiest to forgo host monitoring altogether and shift that burden on vCenter.