As most of us in the developer and IT ops communities know by now, Docker is good. Docker and containers have brought production operations closer to development, given us more freedom in our technology choices, and ushered in microservices as the backbone of a more flexible and aggressive approach to building software, especially in cloud environments.
But as organizations adopt Docker and containerization, life can get complicated. Operationalizing Docker, more often than not, means increased complexity, an abundance of infrastructure and application data, and a commensurate need for additional monitoring and alerting on the production environment.
As Docker and containers make the leap from development into production in your organization, there are three factors to keep in mind when it comes to monitoring a containerized environment. First, monitoring Docker is not a solution unto itself. Second, you need to know which container metrics you should care about. Third, there are multiple options for collecting application metrics. Let’s dive in.
As operations, IT, and engineering organizations coalesce around the value and importance of containers, they often ask the seemingly logical question: “How do I monitor Docker in my production environment?” As it turns out, this question has it backward. Monitoring the Docker daemon, the Kubernetes master, or the Mesos scheduler isn’t especially complicated, and there are, in fact, solutions for each of these.
Running your applications in Docker containers only changes how the applications are packaged, scheduled, and orchestrated, not how they actually run. The question, properly rephrased, becomes, “How does Docker change how I monitor my applications?” As you might imagine, the answer to this question: “It depends.”
The answer will be dictated by the dependencies of your environment and your use cases and objectives. The orchestration technology you use, the Docker image philosophy you follow, and the level of observability your containerized application provides, among other considerations, will all factor into how you monitor your applications.
To begin to understand how a microservices regimen and a Dockerized environment will affect your monitoring strategy, ask yourself the following four simple questions. Note that the answers may differ for different applications, and your approach to monitoring should reflect these differences.
- Do you want to track application-specific metrics or only system-level metrics?
- Is your application placement static or dynamic (that is, do you use a static mapping of what runs where, or do you use dynamic container placement, scheduling, and bin packing)?
- If you have application-specific metrics, do you poll those metrics from your application, or are they being pushed to some external endpoint? If you poll the metrics, are they available through a TCP port you’re comfortable exposing from your container?
- Do you run lightweight, bare-bones, single-process Docker containers or heavyweight images with supervisord (or something similar)?
Getting your containers’ metrics
When it comes to gathering system-level metrics from your containers, Docker has you covered. The Docker daemon already exposes detailed metrics about CPU, memory, network, and I/O usage that are available for running containers via the
/stats endpoint of Docker’s remote API. Regardless of whether you plan on collecting application-level metrics, you should definitely first obtain the metrics from your containers. The simplest and most reliable way to gather metrics from all your containers is by running
collectd on each host that has a Docker daemon, along with the
If you’re using Docker Swarm, the Swarm API endpoint exposes the full Docker remote API, reporting data for all of the containers executed in the swarm. This means you need only one
collectd instance with the
docker-collectd plugin to point at the Swarm manager’s API endpoint.
Once you have all of your container metrics flowing into your monitoring systems, you can then build charts and dashboards to visualize the performance of your containers and your infrastructure. Some monitoring systems will even discover these metrics for you automatically and provide curated, built-in dashboards to show your Docker infrastructure from cluster to host to container.
Collecting application metrics
What about application metrics? Collecting these is more complicated—if your applications don’t automatically push metrics to a remote endpoint, you’ll need to know what applications run where, what metrics to poll, and how to poll those metrics from your applications.
For first-party software, I strongly recommend having your application report its metrics on its own. In fact, most code instrumentation libraries already work this way. Alternatively, it should be easy to add this functionality to your codebase, but make sure that the remote endpoint is easily and (if possible) dynamically configurable.
Collecting third-party software application metrics can get particularly tricky because, most of the time, the application that you want to monitor isn’t capable of pushing metrics data to an external endpoint. Therefore, you have to poll those metrics directly from the application, from JMX, or even from logs. Suffice to say, in Dockerized environments, this can make configuring your monitoring system quite challenging, depending on whether you use some form of dynamic container scheduling.
Static container placement
Knowing the placement of your application containers, whether by configuration or by convention, makes collecting metrics from those applications easier. Starting the collection process is as simple as configuring
collectd from a central location or preferably on each host. Keep in mind that you may have to expose additional TCP ports to reach the endpoint that exposes the application metrics. In some cases, such as for Elasticsearch and Zookeeper, a specific endpoint of the API is made directly available, whereas in others, such as with Kafka, you’ll need to enable and expose JMX.
Dynamic container scheduling
Dynamic container schedulers, like Kubernetes and Mesos/Marathon, don’t typically provide control over where your applications execute. Thus, it can be difficult to bridge the gap between metrics collection and monitoring systems, even if your applications leverage service discovery. Using server-less infrastructures or pure container hosting providers presents a similar challenge. There are three solutions to this problem, none of which are perfect, but each provides a starting point for collecting metrics from container-based applications:
- When your container scheduler takes action, find a way to make your metrics collection system dynamically reconfigurable. Keep in mind that building a service that listens to events generated by your container scheduler when new containers start and reacts to containers coming and going in order to reconfigure your metrics collection system requires a fair amount of engineering effort. For example, if you use
collectd, this could mean automatically regenerating its configuration sections and restarting as appropriate.
collectdin a “sidekick” container and use the events generated by your container scheduler to automatically start and stop these sidekicks. For each application container running in your environment, a
collectdcontainer is started (with minimal configuration) to collect metrics exclusively from the application in the corresponding container. Clearly, this approach multiplies the number of containers you are running but offers the most flexibility and reliability of the metrics collection process. Minimize network involvement whenever possible by executing this sidekick container with a placement constraint that will force it to run on the same physical host as the application container.
collectdinside your application container so that you no longer have to deal with the dynamic nature of your application placement. When the application starts,
collectdstarts with it to report that application’s metrics. A minimal configuration can be tailored and run on localhost, providing the point of view of what’s inside the container. In this situation, you will need to manage the lifecycle of
collectdrunning next to your application yourself.
Using SignalFx to monitor Docker
At SignalFx, we’ve been running Docker containers in production since 2013. Every single application we manage, in fact, executes within a Docker container. Along the way, we’ve learned how to monitor our Docker-based infrastructure and how to gain maximum visibility into our applications, wherever and however they run.
The hosts on which these containers execute all belong to a specific service or role. Salt, our configuration management system, sets up and configures
collectd on each host. We use
collectd the same way we recommend our customers do: with the SignalFx
collectd package, the SignalFx
collectd metadata plugin, and the Docker
With this setup, we get complete visibility across all of the layers of our infrastructure—from every AWS instance to every application instance we run. Metrics from our first-party applications are emitted directly into SignalFx, while metrics from our third-party applications are provided via the corresponding plugins for those applications.
Although application metrics are the primary and clearest source of information on the health of your application, it’s also useful to monitor a handful of system-level metrics. This is particularly beneficial when we pack multiple containers onto the same host. Having container metrics reported by the
docker-collectd-plugin helps us set up meaningful alerting and anomaly detectors that complement our application-level anomaly detection.
In our experience, CPU and network utilization are the key indicators that something is amiss in a container; we keep an eye on these metrics as they approach 100 percent. By using alerts to identify problematic containers and applications, we can remediate these issues before an application fails. Of course, memory utilization is also a useful indicator.
Monitoring as a service
Our team at SignalFx previously built the analytics system in use at Facebook that monitors more than 22 trillion metrics per day. SignalFx aggregates metrics across distributed services with powerful streaming analytics to alert on servicewide issues and trends in real time, versus host-specific errors well after the fact. Thus, it addresses critical application and infrastructure management challenges unanswered by traditional monitoring, APM, and logging solutions.
SignalFx was built for apps that go beyond a single instance, for modern infrastructures like AWS or Google Cloud Platform, and for devops teams using services such as Docker, Kafka, and Elasticsearch.
SignalFx helps operations and product teams of all sizes manage their cloud environments in production by providing:
- Real-time analytics. With SignalFx, you can perform computations as metrics stream from your environment and drill down to see if an event is normal, an anomaly, a part of a trend, or a threat to availability.
- Actionable alerts. Get alerts on any metrics you choose and set detectors for only relevant changes to availability and performance. This means you can eliminate alert storms and false-positives for good.
- Monitoring as a service. Our cloud-based monitoring solution offers flexibility to operations of any size. Configuration is automatic as you scale, with no limitations due to hardware or maintenance requirements.
- A breadth of integrations. We provide a full catalog of configured, production-ready plugins, built-in dashboards, and an open approach to sending metrics to help you grow your monitoring workloads as your infrastructure evolves.
- Instant insight for every user. SignalFx is advanced enough for power users but approachable enough to make monitoring the basis of collaboration at every point in the product lifecycle.