The concept of observability has been around for decades, but it’s a relative newcomer to the world of IT infrastructure. So what is observability in this context? It’s the state of having all of the information about the internals of a system so when an issue occurs you can pinpoint the problem and take the right action to resolve it.
Notice that I said state. Observability is not a tool or a set of tools — it’s a property of the system that we are managing. In this article, I will walk through how to plan and implement an observable deployment including API testing and the collection of logs, metrics, and application performance monitoring (APM) data. I’ll also direct you to a number of free, self-paced training courses that help you develop the skills needed for achieving observable systems with the Elastic Stack.
Three steps to observability
These are the three steps toward observability presented in this article:
- Plan for success
- Collect requirements
- Identify data sources and integrations
- Deploy Elasticsearch and Kibana
- Collect data from systems and your services
- Logs
- Metrics
- Application performance management
- API synthetic testing
Plan for success
I have been doing fault and performance management for the past twenty years. In my experience, to reliably reach a state of observability, you have to do your homework before getting started. Here’s a condensed list of a few steps I take to set up my deployments for success:
Goals: Talk to everyone and write the goals down
Talk to your stakeholders and identify the goals: “We will know if the user is having a good or bad experience using our service;” “The solution will improve root cause analysis by providing distributed traces;” “When you page me in the middle of the night you will give me the info I need to find the problem;” etc.
Data: Make a list of what data you need and who has it
Make a list of the necessary information (data and metadata) needed to support the goals. Think beyond IT information — include whatever data you need to understand what is happening. For example, if Ops is checking the Weather Channel during their workflow, then consider adding weather data to your list of required information. Snoop around the best problem solver’s desk and find out what they’re looking at during an outage (and how they like their coffee). If your organization does postmortems, take a look at the data that the people bring into the room; if it’s valuable to determine the root cause at a finger-pointing session, then it’s so much more valuable in Ops before an outage.
Fix: Think about the solution and information that can speed it up
If Ops needs a hostname, a runbook, some asset info, and a process name to fix the problem, then have that data available in your observability solution and send it over when you page them. Add the required bits of information to the list you started in the previous step.
A good starting point
At this point, you have a list of data that you need so that when an issue occurs you can pinpoint the problem and take the right action to resolve it. That list might look something like this:
Service data
- User experience data for my service
- Response time of the application per transaction and the components that make up the application (e.g., the front end and the database)
- Proper API functionality via synthetic testing
- Performance data for my infrastructure
- Operating system metrics
- Database metrics
- Logs from servers and apps
Inbound integrations
- History of past incidents
- Runbooks
- Asset info
- Weather or other “non-IT” data
Outbound integrations
- Incident management integration for alerting
Elastic Observability
The Elastic Stack — Elasticsearch, Kibana, Beats, and Logstash; formerly known as the ELK Stack — is a set of powerful open source tools for searching, analyzing, and visualizing data in real time. The Elastic Stack is widely used to centralize logs from operational systems. Over time, Elastic has added products for metrics, APM, and uptime monitoring — this is the Elastic Observability solution.
The value of Elastic Observability is that it brings together all the types of data you need to help you make the right operational decisions and achieve a state of observability. Let’s jump into a scenario to demonstrate how to put Elastic Observability into action.
Scenario
I have a simple application to manage. It consists of a Spring Boot application running on a Linux VM in Google Cloud Platform. The application exposes two API endpoints and has a MariaDB back end. You can find the application in the Spring Guides. I have created an Elasticsearch Service deployment in Elastic Cloud and I will follow the agent install tutorials right in Kibana, the Elasticsearch analysis and management UI. The open source agents that will be used are:
- Filebeat for logs
- Metricbeat for metrics
- Heartbeat for API testing and response time monitoring
- Elastic APM Java Agent for distributed tracing of the application
Note: This guide is written for a specific application based on Spring Boot and MySQL. If you have something else that you want to collect logs, metrics, and APM traces from, then you should be able to modify these instructions to do what you want. When you open up Kibana you will be greeted with a long list of out-of-the-box observability integrations.
Implementation
In this article I will go over the steps to get the basics done, and then in future articles I’ll dive into best practices and some of the integrations. Let’s walk through a simple deployment.
Hosted Elasticsearch Service
To follow along in this guide, create a deployment in Elasticsearch Service on Elastic Cloud (a trial account is free). Once you sign up, watch and follow the steps in the Deploy Elasticsearch in 3 minutes or less video. A few minutes later you will have a cluster that you can use to follow along with the rest of this article. Download the password that is presented to you; you will use that to log in to Kibana and to configure the Beats. The screenshots are from version 7.8 of the Elastic Stack — your UI may look slightly different based on your version.
Kibana
Kibana is the visualization and management tool of the Elastic Stack. Kibana will guide us through installing and configuring the Beats and Elastic APM Java Agent.
Launch Kibana from the deployment details and log in with the elastic username and password. If you forget the password, reset it and then open Kibana:
Find your way home
The instructions for everything that you need to install can be found right in your Kibana instance. Often over the next few pages I will direct you to Kibana Home; you can get there by clicking on the Elastic icon in the top left of any Kibana page.
Navigation
If you want to dock the navigation menu while you learn your way around Kibana, click on the top-left three-line icon and then the lock at the bottom left:
Add integrations
This is the list of what will be collected:
- Logs from the infrastructure and MariaDB
- Metrics from the infrastructure and MariaDB
- API test results and response time measurements
- Distributed tracing of the application including the database
Kibana guides you through adding logs, metrics, and APM. This video shows how to add MySQL metrics, and once you know how to do that you can follow the same process to add log and APM data.
Logs from the infrastructure and MariaDB
Both MariaDB and MySQL provide logs. I am interested in the error log and the slow log. By default the slow log is not produced. To configure these logs, have a look in the MariaDB docs. For my deployment the configuration file is /etc/mysql/mariadb.conf.d/50-server.cnf
. Here are the relevant parts:
# This group is only read by MariaDB servers, not by MySQL.
# If you use the same .cnf file for MySQL and MariaDB,
# you can put MariaDB-only options here
[mariadb]
slow_query_log
#
# * Logging and Replication
#
# Both location gets rotated by the cronjob.
# Be aware that this log type is a performance killer.
# As of 5.1 you can enable the log at runtime!
#general_log_file = /var/log/mysql/mysql.log
#general_log = 1
#
# Error log - should be very few entries.
#
log_error = /var/log/mysql/error.log
#
# Enable the slow query log to see queries with especially long duration
slow_query_log_file = /var/log/mysql/mariadb-slow.log
long_query_time = 0.5
log_slow_rate_limit = 1
log_slow_verbosity = query_plan
#log-queries-not-using-indexes
To enable the slow query log, uncomment the lines in the slow query section and adjust the long query time as desired (the default is 10 seconds).
A quick test of the configuration is to force a slow query with a SELECT SLEEP()
:
$ sudo -- sh -c 'echo "select sleep(2);" | mysql'
sleep(2)
0
This results in a record being added to the slow log:
# Time: 200427 15:19:59
# User@Host: root[root] @ localhost []
# Thread_id: 13 Schema: QC_hit: No
# Query_time: 2.000173 Lock_time: 0.000000 Rows_sent: 1 Rows_examined: 0
# Rows_affected: 0
SET timestamp=1588000799;
select sleep(2);
Install Filebeat
Follow the directions in Kibana Home > Add log data > MySQL logs. When you are instructed to enable and configure the mysql module, refer to these details for additional information:
- The
filebeat modules enable
command takes a list of modules, so save some steps and addsystem
andauditd
to the list:
sudo filebeat modules enable mysql system auditd
- When you are instructed to Modify the settings in the d/mysql.yml file, note that the slow log I added is not in the default location, so edit the file modules.d/mysql.yml and specify the location of the slow log as an entry in the
var.paths
array:
- module: mysql
# Error logs
error:
enabled: true
# Set custom paths for the log files. If left empty,
# Filebeat will choose the paths depending on your OS.
#var.paths:
# Slow logs
slowlog:
enabled: true
# Set custom paths for the log files. If left empty,
# Filebeat will choose the paths depending on your OS.
var.paths:
- /var/log/mysql/mariadb-slow.log
Run the setup command and start Filebeat as directed in Kibana > Add log data > MySQL logs. At the bottom of that page is a link to the MySQL dashboard. You should also look at the [Filebeat System] Syslog dashboard ECS
and [Filebeat System] Sudo commands ECS
dashboards. You can search for these in the dashboard list:
Metrics from the infrastructure and MariaDB
The operating system and MariaDB both expose metrics. There is no configuration for the OS to expose metrics. MariaDB makes metrics available at port 3306 by default, and the connection is password protected when you add a password to MariaDB.
Install Metricbeat
Follow the directions in Kibana Home > Add metric data > MySQL metrics. When you are instructed to enable and configure the mysql module, refer to these details for additional information:
- The
metricbeat modules enable
command takes a list of modules, so save some steps and addsystem
to the list:
sudo metricbeat modules enable mysql system
- When you are instructed to Modify the settings in the d/mysql.yml file, refer to these details:
The Metricbeat module for MySQL needs to be configured with the proper host or IPADDR, port, username, and password. Here is my /etc/metricbeat/modules.d/mysql.yml:
- module: mysql
#metricsets:
# - status
# - galera_status
period: 10s
# Host DSN should be defined as "user:pass@tcp(127.0.0.1:3306)/"
hosts: ["springuser:ThePassword@tcp(roscigno-obs:3306)/"]
# I copied the username and password used in the Spring Book guide
# my hostname is roscigno-obs
Run the setup command and start Metricbeat as directed in Kibana > Add metric data > MySQL metrics. At the bottom of that page is a link to the MySQL dashboard:
You should also look at the Metricbeat system dashboards. You can search for these in the dashboard list: