Let's use the common pattern of a server (one of a cluster, most likely) with a Web server fronting a Java application server running several JVMs, each of which contains a couple Java applications:
- Software platform metrics: Simple process uptime monitoring is the most basic method, but most Web servers, app servers, and other third-party components also surface metrics about their operation via a status page or other means. These provide another data point on uptime and performance, which helps isolate issues, but you're limited to the specific metrics the software provider decided to provide.
- App container metrics: Typically, these are Java JVM metrics via JMX or code instrumentation (or similar metrics on other platforms). These deliver excellent depth to find application issues at runtime, but there are thousands of fine-grained metrics that require some sophistication to understand.
- Application metrics: These surface from inside the application itself using a metrics library. They're very valuable because they are custom to the exact data you want to surface, whether it's dollars sold in your online store or number of customers served -- but your developers need to write code explicitly to surface them.
- Hardware platform metrics: Here we're talking about OS-level metrics (the ever popular CPU/memory/disk), the underlying abstraction layer, if any (for example, Amazon AWS metrics pulled using CloudWatch, virtualization layer metrics, or LXC container metrics for Docker users), and hardware metrics. They're necessary to identify resource shortfalls and provide insight into many common issues, but may or may not be representative of the service experience in the real world.
- Network metrics: These metrics are gathered by sniffing the interface on each server. Otherwise, they're similar to the network APM technique discussed above.
- Log aggregation: All of the above parts of the system usually dump records of events and metrics into log files, which are an alternate path to gather much of the same information. Log information is often richer than pure metrics, but it's also large in volume and often slower to collect and process for rapid information.
Because each type of instrumentation has its strengths and weaknesses, users face a challenge to layer in types of monitoring that provide both proactive awareness of the end-user experience to minimize mean time to detection (MTTD) and to offer sufficient granularity and information to minimize mean time to resolution (MTTR) after an issue arises.
In pursuit of optimum results, it's tempting to want one of each. But opting for every form of instrumentation raises cost -- including licenses, labor, complexity of management, and the additional system load from active methods -- to impractical levels. One solution is to adopt an integrated solution, such as that offered by CopperEgg, which wraps a mix of monitoring services into one complete portfolio at lower cost.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to firstname.lastname@example.org.