Combining operations and development: A view from the trenches

Combining development and operations teams is not for the faint of heart. Learn how others have endured the transition into an efficient devops organization

Over the past few years, extending agile development teams into a devops delivery train has become the de facto next-gen process discussed at tech conferences across the globe. To be fair, the definitions and recipes have been well documented, and there’s zero doubt that “agile ops” will become the de facto replacement for innovation centers as cloud-based apps become ubiquitous.

In fact, the FAAMG (previously FANG) companies relish in describing their ultra-efficient delivery mechanisms with an implication that devops is easy-peasy. But, beyond the textbooks and unicorns, what does it really take to make the transformation? In the following paragraphs, I discuss some real-world experiences in evolving from a highly successful agile engineering team to an agile + IT operations organization.

A seat at the table

It’s no surprise that the ops team often ends up at the tail end of a lengthy list of decisions made in advance of the packaged delivery. While an agile team might brag about an awesome continuous integration (CI) system, continuous delivery (CD) adds a new dimension where ops can’t succeed with information after the fact. Integrating ops and a traditional development org has several facets that require negotiation and complete transparency for a successful merger. To reveal the inner workings of each group, cross-team exposure to each group’s backlog needs to occur that should lead to combined iteration planning. The result is a unified backlog covering all engineering activities. Admittedly, this process may be jarring as duplicate efforts are realized and other prioritization realities are given proper attention.

To achieve maximum efficiency, a common tool chain must also be established. Even though certain tools may have been used in isolation in the past, agreeing to scale the team outweighs any minor efficiencies gained by a few individuals. Tools that fall into this category are source control, automation stacks, defect tracking, and potentially IDEs, if warranted. In a similar vein, normalization of working processes will need examination because code contributions from both groups will now happen across a single backlog. Common peer review processes need to be established as well as processes for scanning code for third-party libraries, security holes, etc. Consensus building will endure some amount of organizational angst, but overcommunication along with focus on the longer-term goals will help alleviate some friction. Again, it’s quite likely strong opinions from opposing sides will be voiced, but these growing pains are a necessary part of the effort to streamline the code-to-production value chain.

Choosing a cloud

One of the largest decisions in moving towards an automated delivery mechanism is choosing which cloud provider will become the future foundation. A well-researched technical decision coupled with strategic leadership influence should help settle any debates about why and how; however, the decision is often shaded with varying levels of organizational politics. Inertia from leveraging existing cloud services can be in direct conflict with time-to-market demands of using an alternative cloud provider. Stakeholder requirements from all groups have to be considered, and consultation with sales and architects from potential cloud vendors should be vetted during the decision phase. Beyond the costs and technical comparisons, choosing a vendor that will provide true partnering efforts is more than a tie-breaker as it has the potential to elevate new product beyond sheer functionality alone.

Regardless of the vendor chosen, an important lesson from the trenches is to leverage the unique abilities of the cloud vendor, including serverless technologies, deployment automation, and management. This suggestion is in complete opposition to the notion of trying to keep the software platform-agnostic with the goal of easily operating across multiple cloud vendors. While some vendor lockin will occur by using cloud specific functionality, the time savings of not having to self-manage clusters for generic code will far outweigh the maintenance of two variable code sets. Granted, the need for running across multiple cloud vendors is rare and should be carefully weighed against the best practice recommendation to start with one vendor, manage to success and consider a second vendor only afterward.

Site reliability

Google’s Site Reliability Engineering has now become a de facto method for leveraging the strengths of an operations team combined with a development team for the sole purpose of balancing reliability and innovation for a web-based service. Understanding what service-level indicators (SLI) are necessary to achieve certain service-level objectives (SLO) is critical in early days of a service’s life, because it is often much more difficult to instrument after the fact. Determining what factors end users care about can be a delicate process as there is a temptation to focus on too many metrics at once. Availability and user latency are often two favorites that jump to the top of any discussion, and the art of determining an appropriate SLO is capturing user expectations into something the organization can measure and effect a meaningful stance. To delve into the mechanics of this process further, having ops and development read Google’s free book on the topic is a must.

Tracking service repair times as well as conducting regular retrospectives for outages enable the organization to grow with less finger-pointing and focus more on how to prevent future problems. While Google has established a separate SRE team to act as a neutral agent across departments, this luxury may not be feasible for smaller organizations, and in these cases, teams should establish a weekly cross-department meeting to discuss reliability, quality assurance, and process improvements needed to keep the service healthy. Even removing the stigma of legacy-based titles can help the transition, including the use of “Engineering” in titles as opposed to “QA” or “Development.” Again, the contention to rapidly provide customer value while keeping the service reliable will make these conversations challenging at times but rallying around SLOs will keep the teams focused on the bigger prize: customers.

Customers for the win

The payoff of melding of two departments, which previously behaved in a throw-it-over-the-wall manner, is a challenging but fruitful achievement—both in terms of sustained comradery as well as customer satisfaction. Paradigm shifts for both development and operations will result in a squad of people who exchange best practices and realize they had much more in common than previously realized. Once an organization has chosen a cloud-based future, every decision from deciding how to review code all the way to choosing a cloud vendor will require more consensus building than ever before. Leveraging practices such as those (example: SRE) being used by the most dependable web companies in the world will help steer teams towards predictable outcomes with the least amount of internal abrasion. Subsequently, the breakdown of internal partitions will result in a modern dev-to-ops value chain that provides users with trustworthiness while allowing them to benefit from your improved bottom line.

This article is published as part of the IDG Contributor Network. Want to Join?