5 devops practices to improve application reliability

How to use monitoring and observability to resolve application performance problems before they impact users and the business.

5 devops practices to improve application reliability
Gremlin / Getty Images

When developers deploy a new release of an application or microservice to production, how does IT operations know whether it performs outside of defined service levels? Can they proactively recognize that there are issues and address them before they turn into business-impacting incidents?

And when incidents impact performance, stability, and reliability, can they quickly determine the root cause and resolve issues with minimal business impact? 

Taking this one step further, can IT ops automate some of the tasks used to respond to these conditions rather than having someone in IT support perform the remediation steps?

And what about the data management and analytics services that run on public and private clouds? How does IT ops receive alerts, review incident details, and resolve issues from data integrations, dataops, data lakes, etc., as well as the machine learning models and data visualizations that data scientists deploy? 

These are key questions for IT leaders deploying more applications and analytics as part of digital transformations. Furthermore, as devops teams enable more frequent deployments using CI/CD and infrastructure as code (IaC) automations, the likelihood that changes will cause disruptions increases.

What should developers, data scientists, data engineers, and IT operations do to improve reliability? Should they monitor applications or increase their observability? Are monitoring and observability two competing implementations, or can they be deployed together to improve reliability and shorten the mean time to resolve (MTTR) incidents?

I asked several technology partners who help IT develop applications and support them in production for their perspectives on monitoring, observability, AIops, and automation. Their responses suggest five practice areas to focus on to improve operational reliability.  

Develop one source of operational truth between developers and operations

Over the last decade, IT has been trying to close the gap between developers and operations in terms of mindsets, objectives, responsibilities, and tooling. Devops culture and process changes are at the heart of this transformation, and many organizations begin this journey by implementing CI/CD pipelines and IaC.

Agreement on which methodologies, data, reports, and tools to use is a key step toward aligning application development and operations teams in support of application performance and reliability.

Mohan Kompella, vice president of product marketing at BigPanda, agrees, noting the importance of developing a single operational source of truth. “Agile developers and devops teams use their own siloed and specialized observability tools for deep-dive diagnostics and forensics to optimize app performance,” he says. “But in the process, they can lose visibility into other areas of the infrastructure, leading to finger-pointing and trial-and-error approaches to incident investigation.”

The solution? “It becomes necessary to augment the developers’ application-centric visibility with additional 360-degree visibility into the network, storage, virtualization, and other layers,” Kompella says. “This eliminates friction and lets developers resolve incidents and outages faster.”

Understand how application issues impact customers and business operations

Before diving into an overall approach to application and system reliability, it’s important to have customer needs and business operations at the front of the discussion.

Jared Blitzstein, director of engineering at Boomi, a Dell Technologies business, stresses that customer and business context are central to developing a strategy. “We have centered observability around our customers and their ability to gather insights and actions into the operation of their business,” he says. “The difference is we use monitoring to understand how our systems are behaving at a point in time, but leverage the concept of observability to understand the context and overall impact those items (and others) have on our customer’s business.”

Having a customer mindset and business metrics guides teams on implementation strategy. “Understanding the effectiveness of your technology solutions on your day-to-day business becomes the more important metric at hand,” Blitzstein continues. “Fostering a culture and platform of observability allows you to build the context of all the relevant data needed to make the right decisions at the moment.”

Improve telemetry with monitoring and observability

If you’re already monitoring your applications, what do you gain by adding observability to the mix? What is the difference between monitoring and observability? I put these questions to two experts. Richard Whitehead, chief evangelist at Moogsoft, offers this explanation:

Monitoring relies on coarse, mostly structured data types—like event records and the performance monitoring system reports—to determine what is going on within your digital infrastructure, in many cases using intrusive checks. Observability relies on highly granular, low-level telemetry to make these determinations. Observability is the logical evolution of monitoring because of two shifts: re-written applications as part of the migration to the cloud (allowing instrumentation to be added) and the rise of devops, where developers are motivated to make their code easier to operate.

And Chris Farrell, observability strategist at Instana, an IBM Company, threw some additional light on the difference:

More than just getting data about an application, observability is about understanding how different pieces of information about your application system are connected, whether metrics from performance monitoring, distributed tracing of user requests, events in your infrastructure, or even code profilers. The better the observability platform is at understanding those relationships, the more effective any analysis from that information becomes, whether within the platform or downstream being consumed by CI/CD tooling or an AIops platform.

In short, monitoring and observability share similar objectives but take different approaches. Here’s my take on when to increase application monitoring and when to invest in observability for an application or microservice.

Developing and modernizing cloud-native applications and microservices through a strong collaboration between agile devops teams and IT operations is the opportunity to establish observability standards and engineer them during the development process. Adding observability to legacy or monolithic applications may be impractical. In that case, monitoring legacy or monolithic applications may be the optimal approach to understanding what is going on in production.

Automate actions to respond to monitored and observed issues

Investing in observability, monitoring, or both will improve data collection and telemetry and lead to a better understanding of application performance. Then by centralizing that monitoring and observability data in an AIops platform, you not only can produce deeper operational insights faster, but also automate responses.

Today’s IT operations teams have too much on their plate. Connecting insights to actions and leveraging automation is a critical capability for keeping up with the demand for more applications and increased reliability, says Marcus Rebelo, director of sales engineering of Americas at Resolve.

“Collect, aggregate, and analyze a wide variety of data sources to produce valuable insights and help IT teams understand what’s really going on in complex, hybrid cloud environments,” Rebelo says. But that’s not enough.

“It is critical to tie those insights to automation to transform IT operations,” Rebelo adds. “Combining automation with observability and AIops is the key to maximizing the insights’ value and handling the increasing complexity in IT environments today.”

Optimize monitoring and observability for value stream delivery

By connecting customer needs and business metrics on the one hand with monitoring, observability, AIops, and automation on the other, IT operations have an end-to-end strategy for ensuring a value stream’s operational reliability.

Bob Davis, chief marketing officer at Plutora, suggests that monitoring and observability are both required to support a portfolio of value streams. “Monitoring tools provide precise and deep information on a particular task, which can include watching for defects or triggers on usage or tracking the performance of something like an API, for example,” Davis says. “Observability tools look at everything and draw conclusions on what’s going on with the entire system or value stream.”

Therefore observability tools have a special role in the value stream. “With the information provided by observability tools, developers can better understand the health of an organization, boost efficiency, and improve an organization’s value delivery,” Davis notes.

There are tools, practices, and many trade-offs, but in the end, improving application delivery and reliability will require aligning development and operations on objectives.

Copyright © 2021 IDG Communications, Inc.