Ignite 2021: Azure’s Chaos Studio goes public

Azure CTO Mark Russinovich explains how Microsoft is bringing chaos engineering tools to its Azure customers.

Ignite 2021: Azure’s Chaos Studio goes public

Back at the beginning of 2021, Azure CTO Mark Russinovich’s regular Ignite runthrough of the Azure architecture gave us a first look at Chaos Studio, the platform’s fault injection tool. Building on the chaos monkey concept introduced by Netflix, the growing discipline of chaos engineering is focused on helping developers understand what happens to cloud-scale applications when they fail.

Now, with 2021’s second Ignite opening its digital doors, Microsoft is unveiling the first public preview of Chaos Studio as part of its push to deliver better and more resilient cloud applications. I had the opportunity to talk to Mark Russinovich in advance of the preview’s launch about Azure’s approach to chaos engineering and how he sees developers taking advantage of these technologies.

Adding chaos to Azure

Chaos engineering in Azure isn’t new. As he says, “We’ve been doing chaos engineering and Azure since pretty close to the start. It’s been a lot of homegrown chaos.” But as the service has grown, what began as tooling unique to specific teams has had to become something that works for everyone building on and in Azure. He says, “Over the last few years, we’ve realized, ‘hey, we should consolidate these efforts in chaos engineering into a common tool, a common framework service that we can apply across our services.’ ”

That common tool was the basis for Chaos Studio, and although it began life as an internal tool, Russinovich points out that it was always intended to become customer-facing. What customers need might not be what Microsoft needs, but the lessons they learn could help make Azure better for all its users, inside and outside Redmond. “We think, besides customers having the benefits of a service that’s operating for them, we can grow an ecosystem to have on top of this with customers. The extensibility they bring produces fault injections that we can then leverage across the ecosystem and even internally,” he says. 

azure chaos studio overview IDG

Introducing Chaos Studio

Chaos Studio is a tool that lets developers and testers script fault injections into running systems, starting with failing virtual machines and then offering more detailed, lower-level faults, including CPU and memory stress. Faults are either agent-based, which require a Chaos Studio agent as part of a VM build (both for Windows and Linux), or service-direct. Once the agent and any prerequisites are installed, you can use Chaos Studio to choose the type of test to run and how to run it. For example, if you’re stress testing the CPU, you first define how long you want to add CPU pressure and how much pressure you want to add. 

azure chaos studio experiment designer detail IDG

When you’re running a stress test like this, you’ll need tools like Azure Monitor alongside Chaos Studio to give you visibility on what’s happening to your systems. The same is true for service-direct faults. These are used to affect Azure resources, like Cosmos DB, once you’ve linked a service to your Chaos Studio instance. Here you can set up a test to see how your application responds to, say, a cross-region failover of a key service.

One of the key aspects of a tool like Chaos Studio is its focus on an experimental approach to testing. This is essential when it comes to large-scale distributed systems where the underlying system state is unknown. Using Chaos Studio, you can validate assumptions about application behavior. For example, you may want to build a test that validates what happens when an Azure zone fails or when you lose a server hosting a set of virtual machines.

Chaos as science: using experiments

The essence of chaos engineering is building a hypothesis and then proving it in order to tease out the edge cases that can cause problems to your users. As Russinovich says, this part of building an observable, manageable distributed system “really becomes a platform to validate the behavior of the system, and it just doesn’t work without observability on the other side. If you can’t observe what the test is doing, the test is useless. So it also is testing your observability, because you’d say, ‘hey, if it loses a few VMs or more than x threshold, then an alert should fire.’ Well, if that alert doesn’t fire, that’s because your observability systems are not tuned to catch those things that you want to catch.”

Using an experiment-led approach to chaos treats it as a tool for continuously validating your applications. Chaos engineering may sound random, but it isn’t. You are taking an engineering-led approach to disrupting a complex system, with the intent of understanding what effects that disruption has on the system as a whole. Have you designed a shopping cart system that failsover to a new instance if the e-commerce system crashes, or will a customer lose all their shopping and have to repeat everything? You have an assumption about how your application works. Chaos Studio allows you to test everyday operations while simultaneously exploring what happens in more challenging environments.

These are what Russinovich calls “game day” events, using Chaos Studio to experiment with what-if scenarios. He describes how customers on the preview have been using the service: “Let’s say that [they have] an e-commerce application, which is globally distributed for high availability and resiliency, and an Azure region becomes inaccessible, and the application in that region fails. How does the system behave? That’s a game-day kind of experiment that they’ll run.”

This type of usage allows you to build Chaos Studio experiments into your CI/CD pipeline, using it on staging and test deployments alongside load generators before moving code into production. Here it becomes a way of validating deployments and their associated virtual infrastructures before updates are released to the public. By using Azure private VNets to host your canary builds, you can quickly deploy, test, and tear down an instance, keeping costs to a minimum.

Continuous validation: the root of resilient cloud applications

There’s an interesting point to be made here about the role of continuous validation (CV) as the third leg of a tripod along with continuous integration and continuous delivery (CI/CD) as the foundation of distributed systems devops. As engineers, we’re tasked with building resilient applications in what’s at heart, a non-deterministic environment. We’re building systems that run in dynamically self-scaling orchestrated networks of microservices, where services are shared between different applications and where concurrency and consistency make it hard to determine what is causing a problem.

Russinovich is clearly excited by the possibilities of systems like this, noting that what’s shipping with the public preview of Chaos Studio is only the beginning of something much bigger. “This is kind of a first step in a comprehensive system. It’s just going to get more and more sophisticated over time.”

On one side of our applications are observability tools that allow us to infer the state of an application from its many outputs. What Chaos Studio gives us, along with various test frameworks, is a way of controlling more of the inputs to help us understand how changes in infrastructure and services affect our code. It’s clear from my conversation with Russinovich that Microsoft has plans to take Chaos Studio further, looking to use it to test services as well as infrastructure.

As we treat cloud platform services as composable infrastructure elements, this approach makes sense, bringing concepts from security testing, like fuzzing, into API tests. We need to be able to see what happens to a system when it receives incorrect inputs just as much as we need to see what happens when an element fails. As Russinovich points out, if a system fails on Cyber Monday, there could be significant business consequences. “[If it] goes down and now I can’t process orders, that’s costing me literally millions of dollars an hour or tens of millions,” he says.

With that much business at risk, chaos engineering is increasingly important for cloud architects. As systems get more and more complex, there’s a need to understand how they fail. Without that knowledge, we can’t build the resilient tools necessary to support our businesses. By delivering a common tool for injecting faults into our systems, Microsoft is giving us much of what’s necessary to add continuous validation to our build pipelines and to our CI/CD processes. Maybe someday we’ll have CI/CD/CV, but for now, we can start to explore what system faults really do to our code.

Copyright © 2021 IDG Communications, Inc.