How AIOps improves application monitoring

Devops and site reliability engineers are vital to keep applications functioning. AIOps boosts effectiveness another notch

How AIOps improves application monitoring

IT operation teams use many tools to monitor, diagnose, and resolve system and application performance issues. In a recent survey of 1,300 IT professionals on the future of monitoring and AIOps, 42 percent report using more than 10 monitoring tools; 19 percent use more than 25 tools.

That’s a lot of technology just to keep the lights on and provide the data required to monitor, alert, research, and resolve application incidents.

Monitoring tools are not one size fits all, especially for organizations running mission-critical applications in multicloud environments. As organizations invest in mobile apps, microservices, dataops, and data science programs, new monitoring tools are being added to provide domain-specific monitoring capabilities.

AIOps platforms aim to simplify this landscape of monitoring tools. AIOps helps organizations that require high application service levels better manage the complexity of their monitoring tools and IT operational workflows. As the name suggests, AIOps brings machine learning and automation capabilities to the IT operations domain. These technologies aim to resolve incidents faster, identify operational trends that impact performance, and simplify the procedures required to resolve issues.

AIOps is an emerging platform. In the survey, 42 percent of respondents either had never heard of AIOps or had thought that applying machine learning to operations was “not a thing.” Only 4 percent are using an AIOps tool in production today. Although AIOps is an emerging platform, there’s a solid business case for many organizations to consider it.

AIOps is driven by business need and operational complexity

More businesses today rely on applications to serve customers and run operations. That drives higher requirements and expectations on the reliability, performance, and security of the applications.

It also fuels demand for application development teams to build new applications and enhance them more frequently. The job responsibility of maintaining application service levels has also broadened during the past decade.

Once upon a time, organizations staffed the NOC (network operations center) as the front line of defense. If you ever walked into a NOC, you would likely see dozens of computer monitors with warning lights and trend visuals to help the staff pinpoint issues—ideally before an end-user experienced one and opened tickets.

Business and IT leaders began changing this model by introducing devops practices and site reliability engineers. Devops changes the IT department’s culture by establishing a collective responsibility to enable frequent deployments and better support customer and employee needs. Tools and practices such as CI/CD (continuous integration and continuous delivery) and IaC (infrastructure as code) are part of what enables more frequent deployments.  

But devops practices also require a shared operational responsibility ensuring that applications are reliable, perform well, and are secure. That means more people in the IT organization need access to all the different monitoring tools.

Many IT organizations also hire SREs (site reliability engineers) to connect development and operations. SREs take a software engineering approach to system administration topics. In another survey that targeted SREs, they indicate that incident response is a massive part of their job: 49 percent claim to respond to at least one incident every week.

Maturing devops practices and hiring site reliability engineers is how a growing number of IT organizations are facing increasing operational challenges. But just expecting them to make sense of the dozens of monitoring tools being used is a recipe for poor performance.

AIOps platform capabilities and technical architecture

How can AIOps improve the status quo? AIOps platforms typically have the following architecture components and capabilities:

  • A central data platform for aggregating raw logs and data from different monitoring tools.
  • Out-of-the-box integrations with the most common log formats, monitoring tools, IT service management tools, agile development tools, and other collaboration platforms.
  • Machine learning capabilities to help identify patterns in the aggregated data.
  • Consoles, dashboards, and analytics to help IT operations see and manage multiple systems from a central interface.
  • Automation capabilities that enable IT to communicate status, route issues, and autorespond to common problems.

What differentiates AIOps from other IT operational platforms is the ability to aggregate data easily, leverage machine learning to find problems, and use automation as a tool to resolve them. AIOps doesn’t replace the existing monitoring tools. It integrates with them so that more people in the IT department have improved visibility to problems without the complexity of learning and using multiple monitoring tools.

Similarly, AIOps platforms typically don’t replace existing IT service management, workflow, agile, and other communication tools. Instead, they are a central platform to interface with them while alerting and resolving an incident.

Monitoring mission-critical applications without AIOps

Imagine your e-commerce application experiences slow performance when users try to complete a purchase. The first indicator that starts to send out alerts is the shopping cart abandonment rate.

The e-commerce leader quickly opens a ticket about the issue in Cherwell’s mobile interface, but the IT team has already been alerted to the problem. As more users try to make purchases, the underlying Web servers hang and database connections stay open. Alerts from DataDog report these issues, and Splunk reports Java exceptions in the e-commerce application’s log files.

Now imagine the NOC responding to this issue. Where should they start, given the number of alerts going off at the same time? The SREs called in to assist must also investigate the different alerts from different tools. Meanwhile, the e-commerce leader is upset because no one responded to her ticket!

AIOps helps IT address issues faster and with less stress

Here’s how AIOps platforms can potentially address this issue faster and more effectively.

First, AIOps sees that multiple alerts are going off, including application alerts. It automatically alerts the SREs, and when one responds, it automatically updates Cherwell that the incident has been answered by an SRE. No one had to manually update any system to send out these communications.

Second, the alerts from Cherwell, the e-commerce platform, Splunk, and DataDog are all aggregated and time sequenced. The SRE immediately knows which alert came before the others triggered. That’s incredibly useful because the SRE can quickly see that the Web server hanging and the pooling database connections all started after the Java application exceptions.

The AIOps platform’s machine learning capabilities are fairly sophisticated, so in addition to reporting on alerts, it also highlights other outlier operating conditions. In this case, the e-commerce application has many slow outbound connections to a single IP address. There are no alerts or exceptions on this issue, but its timing precedes any of the other alerts.

It doesn’t take the SRE much longer to figure out that this is a connection to a third-party service that validates the city, state, and ZIP code of the buyer. This service is clearly having performance issues that are rippling through the entire application.

With a root cause identified, the SRE adds a high-severity defect to the e-commerce development team’s Jira backlog, alerting them to the problem. A high-severity issue flags the agile development team to disrupt their sprint and address it. It’s a quick fix to circumvent the impacting service, and it’s easy to test and deploy the change through their Jenkins CI/CD pipeline.

The AIOps platform tracks this defect, the deployment, and the drop in all the alerts and keeps the e-commerce leader updated on the progress. Even though the SRE is monitoring the situation, the AIOps platform closes the issue automatically when all the monitors return to normal.

Implementing this scenario isn’t trivial, but neither is it science fiction with AIOps platforms.

Copyright © 2020 IDG Communications, Inc.