LinkedIn learns to do devops right

Bruno Connelly, vice president of engineering at LinkedIn, describes how transforming operations gave rise to a new, hyperscale Internet platform

LinkedIn learns to do devops right

Bruno Connelly is not a fan of the term devops, mainly because it means different things to different people.

In certain startups, for example, devops simply means that developers shoulder tasks once performed by operations. But at LinkedIn, where as VP of engineering Connelly has led the company's site reliability efforts for five and a half years, operations has expanded its role to become more vital than ever while providing developers with the self-service tools they need to be more productive.

You might call that devops done right. In fact, Connelly's buildout of operations holds valuable lessons for any organization that needs to scale its Internet business. For LinkedIn, that growth has been dramatic: Over the past five years, the service has ballooned from around 80 million to nearly 400 million users -- and from basic business social networking to a wide array of messaging, job seeking, and training services.

Throughout that expansion, Connelly has played a key role in creating new sets of best practices and infrastructure-related technologies. More importantly, he has helped lead a transformation of operations culture that has affected the entire company.

A shaky situation

When Connelly joined LinkedIn in 2010, both traffic and the brand were taking off -- and was creaking under the load. "We struggled just keeping the site up. I spent my first six months, maybe a year, at LinkedIn being awake and on a keyboard with a bunch of folks during those periods trying to get portions, if not all, of the site back up."

The team he inherited was great, he says, but there were only six or seven of them, as opposed to a couple of hundred software engineers writing code constantly. "I was hired at LinkedIn specifically to scale the product, to take us from one data center to multiple data centers, but also to lead the cultural transition of the operations team," he says.

As with many enterprise dev shops today, developers had no access to production -- nor even to nonproduction environments without chasing down ops first. “The cynical interpretation is that operations’ job was to keep developers from breaking production,” Connelly says. Essentially, new versions of the entire site were deployed every two weeks using a branch-based model. “People would try to get all their branches merged. We’d get as much together as we could. If you missed the train, you missed the train. You had to wait two weeks.”

Adding to the frustration were the site rollouts themselves, which Connelly remembers as “an eight-hour process. Everyone was on deck to get it out there.” At a certain point in that process, rollback was impossible, so problems needed to be fixed in production. At the same time, the site ops team had to maintain the nonproduction environment “just to keep that release train going, which is not a healthy thing.”

Redefining roles

Change came from the top, driven by David Henke, LinkedIn’s then-head of operations, and Kevin Scott, who was brought in from Google in 2011 to run software engineering. Connelly reported to Henke and was charged with changing the role of operations.

The first priority across the company was to stop the bleeding and get everyone to agree that site reliability trumped everything else, including new product features.

Along with that imperative came a plan to make operations “engineering focused.” Instead of being stuck in a reactive, break-fix role, operations would take charge of building the automation, instrumentation, and monitoring necessary to create a hyperscale Internet platform.

Operations people would also need to be coders, which dramatically changed hiring practices. The language of choice was Python -- for building everything from systems-level automation to a wide and varied array of homegrown monitoring and alerting tools. The title SRE (site reliability engineer) was created to reflect the new skillset.

Many of these new tools were created to enable self-service for developers. Today, not only can developers provision their own dev and test environments, but there’s also an automated process by which new applications or services can be nominated to the live site. Using the monitoring tools, developers can see how their code is performing in production -- but they need to do their part, too. As Connelly puts it:

Monitoring is not something where you talk to operations and say: “Hey, please set up monitoring on X for me.” You should instrument the hell out of your code because you know your code better than anyone else. You should take that instrumentation, have a self-service platform with APIs around it where you can get data in and out, and set up your own visualization.

On the development side, Connelly says that Scott established an “ownership model and ownership culture.” All too often, developers build what they’re told to build and hand it off to production, at which point operations takes on all responsibility. In the ownership model, developers retain responsibility for what they’ve created -- improving code already in production as needed. Pride in software craftsmanship became an important part of the ethos at LinkedIn.

Building together

Altogether, a great deal of self-service automation has been put into place. I asked if, on the operations side, whether some engineers feared they were automating themselves out of a job. Connelly’s answer was instructive:

My personal opinion is that is absolutely the right goal. We should be automating ourselves out of a job. In my experience, though, that never happens -- it’s an unreachable goal. That’s point one ... point two is there’s a lot of other stuff that SREs do, especially what we call embedded SREs. They are part of product teams; they are involved with the design of new applications and infrastructure from the ground up so they are contributing to the actual design. “Hey, there should be a cache here, this should fail this way ...”

Meanwhile, the monitoring, alerting, and instrumentation has grown more sophisticated. To ensure high availability, operations has written software to simulate data center failures multiple times per week and measure the effects. "We built a platform last year called Nurse, which is basically a workflow engine, where you can define a set of automated steps to do what we associate with a failure scenario," Connelly says. Currently, he says he's building a self-service escalation system with functionality similar to that of PagerDuty.

The most important lesson from LinkedIn's journey is that the old divisions between development and operations become showstoppers at Internet scale. Developers need to be empowered through self-service tools, and operations needs a seat at the table as applications or services are being developed -- to ensure reliability and to inform the creation of appropriate tooling. Call it devops if you like, but anything less and you could find yourself on shaky ground.

Copyright © 2015 IDG Communications, Inc.