Anyone who’s been in the devops space is probably familiar with alert fatigue. At the beginning of a devops transformation, engineers set up as many monitors as they can to catch issues before they happen or to understand when things are happening. The next thing that happens is that their inbox is getting flooded with all these alerts and, suddenly, everything becomes less meaningful and no one is reacting to anything, which is essentially the same as not having any monitoring in the first place.
It’s a big problem and finding the balance between making sure everything is captured and not overloading everybody that’s responding to these issues is a key devops transformational issue. Managers quickly have to fix the problem of engineers being woken up at three o’clock in the morning and then looking into something that’s actually a false positive. Managers need the right tools and processes to efficiently catch meaningful events and only alerting people when it’s absolutely necessary.
When launching new products or services monitoring is often the last thing to be considered and often overlooked. When something blows up, nobody knows about it at first and then everyone is scrambling to fix it. Then they put something in place to make sure that they catch it the next time. Pretty quickly the devops group accumulates hundreds or thousands of alerts. If they’re not putting them at the right thresholds, or not tweaking them the right way it creates a lot of noise, and that’s a very common problem that can be avoided with the right planning architecture.
One way to clearly quantify alert fatigue is to look at the number of alerts per person, and the frequency and timing of the alerts. Devops leaders can use a kanban to measure green and red zones.
Tools to eliminate alert fatigue
We are seeing a broad spectrum of different monitoring tools. Clients are using open source, Sensu, and Nagios on the OS, and CloudWatch for the AWS applications. There’s APM tools, like New Relic, AppDynamics, and synthetic testing tools, including Apica, Gomez, and Dynatrace. We use PagerDuty as one of the alerting mechanisms, but one of the rules or processes we set forth is every alert should be actionable. Every time we have an alert, there are three possible outcomes. It’s either an actual problem and we fix it, it was alerted but it was either a premature alert or we should have waited some amount of time before we actually should have acted on it. In the latter two cases we just adjust the threshold or decide that it is really doing nothing for us and we turn it off. It’s a process of continuous improvements and ultimately, we wind up with meaningful alerts.
Sharing alerts beyond the devops team
A typical Fortune 1000-2000 company has 75 to 100 application owners who would like to see metrics on their applications. Many may own multiple applications and they generally want us to build metrics that are useful and actionable to them. Giving them data on the health of parts of their application is meaningless to them. They are counting on the application to impact revenue, and we need to build metrics that reassure them and tell them how their application is doing. Application owners typically want to know things like revenue, uptime or if they had to pay more overtime this month. We are often asked to look at conversion rates, page response time, cart abandonment, and things like that. We often can accomplish this using a combination of the synthetic monitoring and the real user monitoring.
Our methodology is to talk to the businesses. We present information to them and ask if it is something that’s valuable. Then we refine that and take it from there to provide dashboards for them to look at and give them the metrics they want.
The standard cadence for presenting data to the business owners is typically biweekly or weekly. For certain ones that are really more interested in this data, they get live dashboards.
Using a single console
A common problem is that engineers have to look at multiple screens to measure infrastructure and other data. They have to switch between different tools and that is cumbersome.
There are several great tools that organizations can use to really ingest all these metrics into one central location. One key feature is the ability to put weights on the importance of those different metrics equating that to the overall health of the service. A lot of clients spend time trying to build homegrown tools to consolidate everything together but there are great solutions out there. Splunk, DataDog, and xView are some of the products out there do this well.
There are a lot of different ways to do it, but it’s important to get all those alerts in a central location and eliminate a lot of the noise to make sure things are meaningful.
What to measure: envisioning the alert landscape
We think of the alert landscape in layers. The base layers are system metrics, including CPUs, memories, IO, network IO, disk IO, etc. This typically tells us that the system is running too close to capacity or too far under capacity.
On top of that layer, we look more at the service health. We want to know how particular functions of a service are operating. We measure the response time of its actual purpose. We also look at uptime, response times, and how long it takes to ingest the request and then to spit something back out.
We then take another step further out and look at the application as a whole. We often us APM tools including New Relic, Influx Data, and AppDynamics. We translate data from these tools to measure the application health, including transactions and calls to the database.
The top layer concerns the user. We want to know how our measurements are actually related to people trying to use the application.
When it’s mostly web applications, we do synthetic testing. These applications will often have many users, so thousands and thousands of requests will come in, and we simulate their experience using synthetic monitoring. The data gathering could be related to speed, revenue, time on the site, etc.
If we are building metrics for business users, the user experience is what they care about. If it doesn’t impact the users, they’re probably not going to care.
When we are building metrics for the devops team, they are the ones being woken up in the middle of the night for all these things that pop up and they are interested in automation to proactively catch issue before it happens.
We always look for a good balance. If we are not adding new alerts, we may miss something. But if we are not making sure the alerts are cleaner and more relevant, then we are going to wind up in the alert fatigue situation. Ultimately, this balance is key to the success of any team supporting large scale applications.