How to make on-call work for your product engineering team

When your on-call process requires its own on-call, it’s time to start automating away

time disruption represented by a broken clock

Few things in software are more exciting than a successful launch. It’s incredibly rewarding to see the fruits of your labor finally lift off to much fanfare on their way to conquering the world. However, the journey doesn’t simply end with a launch.

Software has a complex development life cycle that requires maintenance long after it is in the hands of the end users. This part of the life cycle is sometimes handled through a dedicated team that specializes in post-launch support and escalates more severe issues to the core engineering team as needed. However, it has become more common for companies and teams to instead use an on-call process for continued maintenance that draws from their core engineering team to both cut costs and improve the quality of support.

The LinkedIn Flagship app has made some of these very same choices and operates an on-call team that rotates weekly among engineers across the Flagship org. A year ago, this on-call process had several issues that hobbled the team’s effectiveness, sometimes even bleeding over to product pillar teams and causing widespread malaise. Our on-call process was itself in need of a rescue by its own on-call team to keep things running smoothly.

These problems eventually gave rise to Project Star*, a collaborative effort between engineers and managers from different parts of the organization to revitalize Flagship on-call and simply make it work. In this article, I briefly cover what’s needed to make this effort work and the key lessons learned from this journey. The flagship organization is home to over 600 engineers, so many of the decisions we made were clearly informed by the scale of the problems we were facing. However, our solutions are not confined to large teams and should be effective even at much smaller scale, albeit to varying degrees of success.

Implement an automated on-call enrollment process with notification support

Perhaps the most glaring issue we faced in this project was the gap in the scheduling process, which led to constant scrambles to find engineers at the last minute.

This simply came down to two key issues: the small pool of on-call-eligible engineers, and the lack of timely communication of upcoming on-call rotations.

The first issue was addressed by implementing a shadowing process that enlisted new engineers under experienced on-call engineers, so they could be quickly coached and minted as on-call engineers themselves. These shadowing candidates were identified through a regular inspection of the recent code contribution history of the different services that were under the purview of on-call.

The communication issue was handled by scheduling engineers months ahead of their on-call shifts and then notifying them early and frequently of upcoming shifts. But the key improvement that gave us the win was our ability to automate both process improvements and eliminate the need for any manual intervention. Given the scale of this on-call, without that automation, it’s unlikely that the proposed solutions would have survived the test of time. Following these changes, we saw the percentage of engineers dissatisfied with the scheduling process and communication drop from 36 percent to just 8 percent, a massive improvement of 28 percent.

Address accountability through automated feedback

The need for automation looms large in engineering teams because the cost of repetitive tasks is high, and it only increases as the team scales. Scheduling was not the only area where automation served us well. We also had to tackle the issue of accountability that arose from a misalignment in how engineers perceived the value of their on-call contributions.

Because these engineers were generally drawn from pillar teams, there was a sense that on-call was a thankless tour of duty that would not be factored into their overall performance evaluation. To clear this up, we required the on-call manager (also drawn every week from pillar teams) to submit written feedback on the on-call performance, thus making it an important element of their regular performance evaluations.

Similarly, we wanted accountability around the overall on-call process itself, so we could periodically improve it based on the feedback we would collect from our on-call teams. To make these two feedback loops as frictionless as possible, we once again relied on automation to ensure that all the right surveys were sent to the right people at the exactly the right time. With these changes, there was a drastic improvement of 75 percent in engineers who felt that their on-call performance was finally being factored into their overall evaluations.

Clarify roles and responsibilities

Another pain point was the clear lack of understanding from on-call engineers on what their roles and responsibilities were. This was a combination of insufficient guidelines and a lack of proper on-call training. The latter was mostly addressed through the newly implemented shadowing process, but the insufficient guidelines were a combination of a few issues.

First, the roles and responsibilities had to be more crisply articulated, and we invested time in putting those in a format that was easier to quickly parse and digest for on-call engineers, especially given the rotational nature of the role. Fortunately for us, this process also exposed areas in the on-call process that could not be easily addressed by on-call engineers and which needed support from partner teams, such as Tools or Productivity.

Understanding this and setting the right expectations with these partner teams was critical to the on-call team’s ability to be effective by leveraging others when needed. This led to a 12 percent drop in engineers who were dissatisfied with partner teams and ensured that over 92 percent of engineers had full clarity on their roles.

Project Star* lasted for several months, and during that time, we collected initial feedback on the gaps, collaborated with different organizations, went on road shows socializing our proposals and iterated on our solutions until they were just right.

We took away many lessons from this journey, but the three that stand out are the incredible benefits of automating as much as possible, the importance of clearly identifying responsibilities and on-call partners, and the need for two-way accountability between the process and the engineers. No two on-call processes will be the same, but many share the same problems, so hopefully some of our learnings will come in handy if you ever have to go down the same road.

Copyright © 2018 IDG Communications, Inc.

How to choose a low-code development platform