How PagerDuty helps customer service and IT teams improve responses

PagerDuty uses machine learning to anticipate issues and dramatically accelerate responses, so problems are addressed before they impact customers

Predicting the outcome of the NCAA men’s Division I basketball tournament — an event where upsets are celebrated wildly and the outcome is notoriously difficult to foresee — is nearly as competitive as the tournament itself. For years, Warren Buffet held a contest offering a billion dollars for a perfect bracket, and nobody even came close. Speaking of unpredictability, just as fans were getting ready to make their picks for this year’s tournament, all major public sporting events were canceled. Who could have predicted that?

Even though we can’t see the future, a deep understanding of variables does enable people to make better predictions and gain an edge over the competition. Picking winners by their school mascot may work every once in a while, but an in-depth study of the best teams, coaches, and athletes is a much more effective strategy.

Likewise, customer service, devops, and IT issues are inherently unpredictable. It’s impossible for companies to know in advance when operational problems will arise, product defects will surface, or communications will go askew. Solutions driven by AI and machine learning can help teams improve their odds. These products can dramatically accelerate responses to issues, so problems are prevented or resolved before most customers encounter them. 

Companies can get thousands of alerts per minute when a problem arises within their digital app or service — a broken cart for an ecommerce website, for example — which is neither useful nor actionable for human interpreters to tackle. The overwhelming amount of noise simply leads to lost signals and many more contacts between customers and service teams before underlying problems can be addressed.

Predictive solutions for customer services are built on understanding the drivers behind the signals. Quickly identifying patterns helps companies stay ahead of the curve. Machine learning tools free up a lot of cycles for response teams by cutting through the noise, rather than distracting them over and over again with alerts and information that may not be useful.

When teams use machine learning in this way, they can boil down the signals to uncover the actual incidents that are driving the unmanageable number of alerts. Instead of scrambling to put out many small fires, they can see the big picture of where the problems actually lie and be more intelligent and informed in tackling a smaller group of larger issues.

How predictive capabilities can improve service responses

Predictive processes must be conducted in real-time if they are going to help companies get ahead of the issues for the majority of customers. Developing problems that threaten to impact customers do not allow time to be paused for reflection or deliberation.

The higher-level need for predictive customer and IT services is in training algorithms to recognize which alerts belong to which incidents. At PagerDuty, our main goal is to help companies identify issues before they cause problems within digital systems, and predict what may go wrong in the future so companies can get ahead of it. We use machine learning to group alerts together so teams can see the full scale of the issues and know precisely how to solve them.

For example, multiple teams may each be working on individual complaints without understanding that they are all elements of a single issue. Insights from PagerDuty’s platform solve that problem and get everyone on the same page. Meanwhile, as responders are assigned specific issues to fix, the platform triages messages to each individual so they are not overwhelmed with issues outside of the one they are addressing.

This is important because most systems don’t operate in isolation where a point failure in one place is the same as a point failure somewhere else. When problems arise, companies use PagerDuty to help find the origin point for cascading issues to try to prevent catastrophic failure. When teams can be more predictive and preventive, they gain a higher-level view of the problems and learn where their efforts will have the most impact.

A structure that helps teams identify and solve problems quickly also gives much greater visibility to every level of the organization. Managers and directors can have better insights about how to deploy teams. Leaders who may have to explain problems or downtime to customers are likewise armed with information and a clear path forward.

How PagerDuty uses machine learning 

Giving companies greater predictive and preventive power for customer service and IT starts with grouping issues in a way that helps identify underlying causes of digital issues. That grouping begins with the supposition that if two messages have similar text, then those messages are fundamentally similar. Although this is reasonable in theory, knowing whether those messages are truly similar is a fuzzy concept.

At PagerDuty, the most impactful solution has been to apply a parser that takes messages and transforms them into less refined language. This process boils down the words in order to surface specific elements within the message.

The system locates unique identifiers like dates, times, customer IDs, or websites with IDs inside that would only be issued in the context of customer messages and reports. These identifiers are generally unimportant to the parser in terms of content. The program simply identifies that they are present within the body of the message.

After this overall blurring, the words and identifiers within each message can then be grouped together. This is where PagerDuty’s platform examines the incoming signals and determines the full extent to which messages share groups of words.

This step is accomplished by vectorizing, which is the process of turning each of these series of words into a representative sequence of numbers. But it is still an imperfect system. Every sentence produces a vector representation, of course, but each vector could conceivably come from several different sentences. Usually, there is enough information to determine when sentences have the same information. But PagerDuty’s software engineers still have to account for the fact that there are many ways that a vector could have been put together.

Once the system can identify a group of messages that have the same vectors, they are bundled together. These groups essentially have the same content. Their identifiers indicate they are full of all the same terms.

Turning machine data into predictions and prevention

For example, a company usually learns that something is going wrong when it is suddenly flooded with reports and messages. Most of these will be machine-generated, some with custom templates and some even written by a person. Without some sort of grouping, teams have no way of seeing a higher-level view of the circumstances. They could build a grouping tool, but that would require a serious investment of time and effort, all while more incident reports pile up.

Likewise, because so many of the messages have different content, simply grouping messages only when they are identical doesn't do much to diminish the volume of issues. Using AI to discern similarities lets the group accumulate relevant information over time. Instead of thousands of individual problems, each represented by a report or message, grouping alerts this way surfaces just a few core issues that are the source of other problems.

At that point, the system has enabled the response teams to become both predictive and preventive. It becomes much easier to find the biggest issues and fix the underlying causes that will prevent future problems. Prioritizing a little engineering work on core issues causes the incident load to drop dramatically, all from basic AI-driven grouping.

In theory, this should be a very reliable process. Once messages are parsed, identified, and vectorized, it should be easy for the system to group them together as similar. They are all textually related and the vectors let the platform measure the strength of the relatedness.

In practice, of course, it’s not always so simple. The flexibility of language means that the system quite often gets it wrong. This is why PagerDuty builds ample and powerful feedback systems into our products.

Improving results with human feedback

When end users provide feedback to the system, they are giving us new data points to help perfect the process. This is usually an acknowledgment that, yes, A and B look like they should be related. However, the human context of the message reveals that they don’t have much to do with each other.

PagerDuty’s feedback system gives greater weight to messages that have been positively correlated because they share terms, but then human feedback reveals they are not similar. This evaluation and modification could be accomplished in software through a very large reinforcement learning system, but for the user it is a simple evaluation whether terms and messages should or should not be together.  

Customers, of course, do not need to see the nuts and bolts of how this works. The customer service and IT teams should have simple tools to provide feedback that will delineate which terms don’t match.

On a higher level, PagerDuty’s feedback systems give users broad options for merging and separating groups of terms within the alerts. This is simply grabbing items and moving them in or out of a group; in essence indicating that certain items belong with each other, but another does not match.

Another less sophisticated but equally powerful product may only require literal thumbs-up and thumbs-down buttons. The user essentially approves of a match or indicates a flaw in the process.

Anything can and will happen to frustrate and disappoint customers, as anyone who has worked in customer service will tell you. Improving your odds in those unpredictable circumstances requires learning, understanding and solving problems as quickly as they emerge. The foremost integrated event intelligence and incident response solutions in the space combine machine and human telemetry by looking at both digital signals and human response behavior.

Chris Bonnell is principal data scientist at PagerDuty. He holds a Ph.D. in Mathematics from the University of Illinois at Urbana–Champaign and was once offered the assistant managership of a Blockbuster Video.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to

Copyright © 2020 IDG Communications, Inc.