Don’t let Netflix envy cloud your devops judgment

In the search for rapid deployment, think twice before following in Silicon Valley's footsteps

netflix
Credit: REUTERS/Mike Blake

There is a lot of talk about devops these days, particularly how it’s a Trojan horse for slower IT shops that have struggled to digitize at a fast-enough pace. But at what cost do you place speed over upholding the integrity and soundness of your software?

I recently spoke on a panel at the MIT CIO Symposium called Running IT Like a Factory. One of my co-panelists, the CIO of a major bank, talked a lot about cloud-native companies, and how Netflix does 3,000 releases per month and Amazon does 11,000 releases per year. He also referenced the robustness of AWS and how companies like this can create a ton of value very quickly.

Netflix and Amazon both have a market cap that’s the envy of any large enterprise. Feedback from the media and analysts isn’t helping. By comparison, many enterprise executives are made to feel that the work their IT organization is doing just doesn’t measure up to what Silicon Valley startups can build with small teams of pimply teenagers in their garages.

This is unfortunate. 

Whether Amazon is releasing apps at a rate of 11 per hour or 11,000 per year, the message to IT pros is the same: Theirs goes to 11. Yours probably only goes up to three.

Somehow release rate and cycle time seem to have become the vogue metrics for application development. If your cycle time is nice and short, you’re like the cloud-native Silicon Valley geniuses. This is what we, in the measurement world, call vanity metrics. But, since Wall Street values cloud-native startups on a different scale than Fortune 100 enterprises that generate revenue, we might as well turn our app dev metrics upside down. 

How could the number of releases be a relevant measure for anything? What if I screw something up five times a day and have to re-release it with that same frequency? How can I compare that to a release of a critical application that carries in it a synchrony three dozen projects across the enterprise?

These are the questions devops teams must ask themselves before they dive-in feet first.

According to IDC research, the average cost of unplanned application downtime is between $1.25 and $2.5 billion for the Fortune 1000, and the average cost of critical application failure is between $500,000 and $1 million per hour. Just one hour of downtime can cost you a million bucks! And these numbers don’t even account for the sheer panic and amount of rework required by development teams to fix these critical production issues.

I’ve read that it’s not uncommon for post-release help desk ticket volume to account for more than 50 percent of an organization’s annual total of tickets.

One of the biggest fallacies of cloud-native companies is that their systems actually require any significant level of resilience. Or that they have a truly significant level of software complexity. Netflix and Amazon don’t have anything near the mission-critical systems of, say, a bank. You might object that the Amazon e-commerce site is mission-critical. Sure. It’s pretty critical. But, if they add a new feature and it makes you lose your order, or sends you the wrong item, then Amazon will just make good by giving you that item for free. You’ll be happy and Amazon will roll back and fix their bug. And that’s worst case. In most cases, the site will just have yet another glitch that we won’t even notice due to the level of glitchiness we’ve come to expect from web-native apps and services. And don’t even get me started on the mission-criticality of Netflix’s systems.

If you’re a bank and you screw up a transaction, that will be noticed. If you do it enough times, it will be reported by the Wall Street Journal. Your regulators will breathe fire on you and your customers will rather hide their money in their mattresses. If you’re an airline, you’ll cause lines throughout the airport, scores of irate Twitter streams, and major operational and brand cost. Same issues for telcos, utilities, logistics companies—just a bit higher stakes, I think.

The point is that we can’t blindly follow the same metrics in all scenarios, and we can’t compare apples to oranges. Cycle time and release rate metrics do tell us something of relevance, but it’s far from the whole story—even when considering throughput. And rolling out canary releases while tracking MTTR for your fast fails is nice, but it doesn’t fit every business scenario.

We’ve been seeing Netflix and Amazon both showing up at architecture and app dev events, looking for solutions to deal with their growing system complexity. It’s nice to be lightweight and greenfield. Everyone gets to be young once. But eventually—and usually it’s when you start to actually generate revenue—your technical debt catches up and things stop being so simple and carefree. This is when the real work of managing your technology landscape begins.

Our cloud-native friends are starting to come full cycle. At some point, they too will look like legacy. Let’s hope they are building robustness and changeability into their code bases. Else they will also suffer next-gen envy.

This article is published as part of the IDG Contributor Network. Want to Join?