Understanding SLOs for monitoring applications

Determining the performance metrics that really matter for your application can make life a lot easier for your team and express your standards clearly across the business.

Understanding SLOs for monitoring applications
Gremlin / Getty Images

To properly manage and monitor an application, you need a goal for defining where you are and how you are doing so you can adjust and improve over time. This reference point is known as a service level objective (SLO). Taking the time to define clear SLOs will make life easier for service owners as well as for the internal or external users who depend on your services. 

However, before you can define an SLO you need an objective, quantitative metric you can look at to determine performance or reliability for your application. These metrics are known as service level indicators (SLIs).

Service level indicator—SLI

A good way to determine what metrics you should use for your SLIs is to think about what directly impacts your user’s happiness in terms of your application’s performance. This could include things such as latency, availability, and accuracy of the application. On the other hand, CPU utilization would be a bad SLI because your users don’t really care about how your server’s CPU is doing, as long as it isn’t impacting their experience with your app.

Additionally, the SLIs you choose will depend on what type of application you are running. For a typical request/response type application you will probably focus on availability, request latency, and successful requests per second capacity. You might look at availability and the consistency of the data being served for data storage. For a data pipeline, your SLIs might be whether the expected data is returned and how long it takes for the data to be processed, especially in an eventual consistency model.

Service level objective—SLO

An SLO is a performance threshold measured for an SLI over a period of time. This is the bar against which the SLI is measured to determine if performance is meeting expectations. A good SLO will define the level of performance your application needs, but not any higher than necessary. This is a crucial point and will require some testing over time. If your users are fine with 99% availability, there’s no reason to make the massive investment that would be required to hit 99.999% availability.

Some example SLOs for latency could be the 95th percentile latencies, which would tell you the latency for the 5% slowest requests being made by users. This is far better than simple latency averages that could be easily skewed by outliers.

Another option to provide even more granularity would be to measure the total number of requests and the number of requests taking more than a reasonable threshold like one second. The percentage of requests in excess of your baseline will help identify how often your users are impatiently waiting for data to return, for a page to render, or for an action to complete.

Once you have nailed down your realistic performance goal, you need to figure out the time period you will use for measurement. Two common time periods for SLOs are calendar-based measures from a set date to another date like the start and end of a month. The other style is a rolling window that looks back from the current date by a set number of days.

Service level agreement—SLA

A service level agreement (SLA) is simply an SLO with an added agreement between the service provider and customer that establishes some form of consequences if an SLO isn’t met. This is generally seen between two different businesses as vendor and customer, with financial consequences for violating the SLA. An SLA could also be used inside companies where certain services may depend on other services controlled by different teams for the product to function.

Why use SLOs?

So now that you’ve got a decent understanding of what service level objectives are, you might be wondering why you should take the time to create them and use them. The most obvious reason is that taking the time to figure out what really matters in terms of performance can make life a lot easier for your team and express your standards clearly across the business. There are thousands of different ways you can track the metrics being generated by your applications, but if you break it down to what actually has a noticeable impact on users, you can clear away a lot of the distractions and noise.

At InfluxData, we’re all about time series data. As a result, we have large quantities of data covering myriad aspects of our systems. While there’s operational value in highly granular metrics, those metrics did not speak well to the customer experience and certainly left service owners wanting more. So we took the approach of examining each microservice and its consumers, establishing reasonable success criteria and achievable goals.

The resulting outputs are consistent measurements we can apply across our entire fleet, providing insight into availability and error rate that serves as a proxy to customer experience. Not only is this beneficial for service owners as a means to achieve operational excellence and inform error budgets, but it allows for insight into our engineering organization for all levels of the business.

These were the goals behind the dashboard below for a service we operate. You’ll see that it’s easy to understand at a glance, provides valuable metrics that can be used for alerting and error budgeting, and illustrates that this service has a target of 99.9 percent availability. By providing this data throughout the company, we can accelerate the delivery of services. In turn, this leads to high-velocity “time to awesome” for customers developing their applications on top of our platform.

slo dashboard 01 InfluxData

An important thing to note is that SLOs don’t have to be perfect on the first implementation. An SLO is always a work in progress that can be iterated as you get more data and learn more about user needs and expectations. Remember, the most valuable thing about implementing SLOs is the general mindset shift in monitoring your applications.

Tim Yocum is director of operations at InfluxData, where he is responsible for site reliability engineering and operations for InfluxData’s multi-cloud infrastructure. He has held leadership roles at startups and enterprises over the past 20 years, emphasizing the human factor in SRE team excellence.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2021 IDG Communications, Inc.