Real-world devops failures -- and how to avoid them

To deliver on the promise of devops, heed these hard-earned lessons of devops gone wrong

Everything about devops sounds great. It's a practice that emphasizes collaboration and communication between software developers and other IT staffers and management, while automating tasks such as software delivery and infrastructure updates.

With devops, the development, testing, and release of software can be accelerated and made more reliable, and that's vital for companies looking to survive in an ultracompetitive market.

There are plenty of examples of how devops works well and delivers tangible improvements for companies in a variety of industries. But sometimes it doesn't work well. Things can go wrong with devops just as they can with any other aspect of IT.

Following are some examples of devops initiatives that failed on at least some level and what the organizations involved did to address the problems or prevent them from happening again.

Lack of a project vision

IBM began what would become the company's foray into devops in 2003 -- a few years before the term was even coined -- when it launched an agile software development initiative for one of its new products. The company invested in agile, a set of principles for software development that encourages rapid and flexible response to change, because it wanted to speed up its software releases to customers.

It was a less-than-successful endeavor. "The problem with agile is it only takes you so far," says Mustafa Kapadia, North American cloud and devops service line leader for Global Business Services at IBM. "The development side was really fast but operations was slow to respond, so it didn't really matter. Customers didn't get products faster."

The company, as part of a move into devops, then decided to automate the deployment of code in addition to adhering to the agile methodology. But that didn't make the software delivery cycle faster either. IBM conducted a "value chain analysis," and found that the biggest impediment wasn't agile or automation, but the overall development and operational environment. Even with these various efforts to speed up development of the product, there was still too much lag time in the completion of the project.

Ultimately, IBM's devops debacle was due to a lack of vision by those putting these efforts into place, Kapadia says. "We needed to answer some basic questions and determine the problems we were trying to solve. That's where we failed," he said. "If you don't know how the work is actually done, you don't know which problems are worth solving. We were grasping at [imaginary] problems that came from vendor hype, not from seeing what was really slowing us down."

Once managers gained a better understanding of workflows and where processes were being slowed, they were able to make changes and get true value out of devops.

Too much accessibility -- not enough education

Back in 2006, when professional content sharing website SlideShare (now part of LinkedIn) was a small startup with fewer than 20 employees, it launched a devops model to speed processes and stay ahead of its competition.

"The [development] team was actually split between San Francisco and New Delhi, and the infrastructure was quite complicated," says Sylvain Kalache, co-founder of Holberton School, an institution that trains software engineers, who worked at SlideShare at the time.

The goals of devops were to achieve maximum efficiency within the engineering team and to spread technical knowledge as much as possible, so that if someone went on vacation or left the company, there would be limited impact.

"Working in a devops environment pushes every contributor to work and contribute to different parts of the product," Kalache says. "Having a cohesive team is super important, and this happens by making people interact and help each other."

One of the main ideas behind devops is a greater sense of ownership of work responsibilities, "and for that you need to give access to part of the infrastructure that developers do not generally have access to," Kalache says. While working at SlideShare, engineers had access to production servers and production databases.

A software engineer was working on a database-related project and trying out a tool that offered the ability to explore a MySQL database graphically. "He decided to reorganize the database columns' order in that tool so that the data would make more sense to him," Kalache says. "What he did not know was that it was also actually changing the columns' order in production on the actual database, locking it, which brought down"

When it happened, the person responsible did not realize that the tool was actually performing actions. It took 15 minutes of collective effort to figure out the source of the problem.

"There were two takeaways from this failure," Kalache says. "First, while devops is pushing for everyone to have an impact on any step of the product/service cycle, [it's] good practice to take a step back every time you give access to something and make sure it is actually valuable. In this specific situation of the database outage, we realized that giving access to production data was actually not useful at all and was very dangerous. The developer could have extracted the same exact value by using a staging database, but with a much more minor impact on the company."

The second takeaway is to better educate developers on the workings of infrastructure. "Many of them have never been exposed to production infrastructure," Kalache says. "Devops is based on a way of working, which obviously is more about human interaction. You can't expect everyone to naturally know ‘the hidden rules.' That's why onboarding is mandatory and critical."

Insufficient devops coverage

Sometimes the failure comes from the way devops is applied to a particular project.

A company involved in lease originations for vehicles has a large number of partners scattered across the United States. Any customers that enter a partner location and want to lease vehicles will have their information and request processed through a custom application. A large part of this information has to be verified through third-party services, since this is a financial transaction and none of the financial companies involved want to be stuck holding a bad lease.

"The devops setup for this software is focused around server metrics, primarily response times and breakdowns for various requests, along with deployment statistics and automation," says Nathaniel Rowe, a software consultant who worked with the lease origination company, which he declined to identify.

"A few weeks back, we had what amounted to a total system outage due to a hole in the monitoring," Rowe says. "A necessary third-party validation service had a network outage that brought their entire infrastructure down."

This shouldn't have been a problem, Rowe says. But due to the initial subpar construction of the software -- which was offshored for a bargain rate -- all the lease submissions processes were tightly linked to the service that went down. "In a company like this, that means the money stops flowing," he says.

The issue was a lack of complete devops coverage, because of a reliance on system metrics rather than adding in active monitoring of outside resources that were necessary for operations to continue. "That was a low-visibility hole in our coverage, which was masked by the fact that 99 percent of issues are explicitly code-based problems rather than due to outside interference," Rowe says.

Once the outage became known, the development team jumped in and decoupled the particular validation code and inserted procedures to bypass it, which allowed the company's partners to save the information they had entered into the system.

"We identified the root cause by contacting the service provider and receiving the information from them about what happened," Rowe says. "To safeguard against this in the future, any time a network failure like that occurs, a global setting is triggered to reroute the submission process to save successfully and notify partners that the corresponding service is down."

A major benefit of this failure was that time and money is now dedicated to patching these holes in monitoring and automatic recovery for other weak spots in the system, Rowe says.

Forgetting about people and process

When Brian Dawson, now devops evangelist at CloudBees, was working as a process consultant for a vendor on a contract with a U.S. government agency several years ago, he had one of his first experiences with devops. It was not a good one.

The agency was launching an important project to build a web application. "As the vendor responsible for the ALM [application lifecycle management] process, we set out to establish tooling and processes covering definition and planning, code and commit, and build and release, all done in a collaborative, open source-inspired manner," Dawson says. 

The deployment and configuration of the supporting devops tooling was successful, Dawson says. "Unfortunately, devops cannot be implemented strictly with tools alone," he warns. "Devops requires equal attention to people or culture, process, and tools."

The project involved multiple teams on a tight, fixed deadline, leading management to seek the quick fix and focus primarily on the tools platform. "We were able to build a platform which included robust agile planning tools, a modern SCM [software configuration management], and Jenkins for continuous integration all deployed on a somewhat elastic, scalable platform."

However, the agency largely ignored the people and process portion of devops, and failed to gain the buy-in from developers and other stakeholders that was needed to build a devops strategy that would actually be put to use.

"This meant that though we had a ‘devops platform' in place, it was effectively used to support the same old legacy practices," Dawson says. "Developers deferred commits, merges, and integration; automated QA [quality assurance] and release were never fully implemented; broken builds were no big deal, and production loads in production-like environments were never tested."

When the client released the web application it immediately experienced critical and very public failures, as it hadn't been regularly tested in a production environment or by real users. In addition, once the problems became apparent, it took the agency multiple, multi-week development cycles to fix the issues and get the site operational. The slow response times served to exaggerate the impact of the initial failures.

The technical issues were fixed in a few months, but fixing the root cause -- including bringing in clear owners of the project to ensure that the process and cultural facets of devops were addressed -- was multi-faceted and spanned many more months, Dawson says.

Only then was the agency "able to properly and fully implement devops on all the planes of people, process, and tools," Dawson says.

Devops no doubt offers great promise in accelerating your software delivery cycles, but it's up to you and your team to deliver on that promise with a cohesive devops culture and sound devops practices.

Copyright © 2016 IDG Communications, Inc.

How to choose a low-code development platform