3 key steps for running chaos engineering experiments

Only by breaking your systems can you learn to make them better. Just break them in a controlled way

3 key steps for running chaos engineering experiments
Rhys A. (Creative Commons BY or BY-SA)

Chaos engineering is the practice of running thoughtful, planned experiments that teach us how our systems behave in the face of failure. Given the trends around dynamic cloud environments and the rise of microservices, the web continues to grow increasingly complex alongside our dependency on these systems. Making sure failures are mitigated and proactively deterred is more important now than ever.

Even brief issues can hurt customer experience and impact a company’s bottom line. The cost of downtime is becoming a major KPI for engineering teams, and when there’s a major outage the cost can be devastating. In 2017, 98 percent of ITIC surveyed organizations said a single hour of downtime would cost their business over $100,000. One major outage could cost a single company millions of dollars. The CEO of British Airways recently revealed that a technological failure that stranded tens of thousands of British Airways passengers in May 2017 cost the company 80 million pounds ($102.19 million USD).

This is why companies who proactively prepare for these scenarios will be much better off than those who wait for the next incident. Below are three key steps for running effective chaos engineering experiments within your organization. Start with a single host, container, or microservice in your test environment. Then try to crash several of them. Once you’ve hit 100 percent in your test environment, reset to the smallest bit possible in production and take it from there. 

Chaos engineering step #1: Plan an experiment

One of the most powerful questions in chaos engineering is “What could go wrong?” Start with forming a hypothesis about how a system should behave when it becomes under stress. By thinking about your services and environments upfront, you can better prioritize which scenarios are more likely (or more frightening) and should be tested first. By sitting down as a team and whiteboarding your services, dependencies, and data stores, you can also formulate some worst case scenarios.

If you don’t know exactly where to start, injecting a failure or a delay into each of your dependencies is a great way to begin understanding your systems better. And by discussing the scenario with your team, you can hypothesize on the expected outcome when running live. What will be the impact to customers, to your service, or to your dependencies?

Chaos engineering step #2: Contain the blast radius

Next, you should design the smallest possible experiment to effectively test your system. By starting small, even if things screw up, it won’t cause an outage. The idea is to understand how failure plays out, then scale it up as trust in the system grows. At each step you should be validating assumptions at scale and building confidence in your systems. 

Start by asking questions like does your code handle error conditions? Do you have insight into these failures? Then at a larger scale, you can focus on scale. Do you protect yourself from load spikes by “backing off” on dependencies that are under water? It is good to have a key performance metric that correlates to customer success (such as orders per minute, or stream starts per second). If you see an impact to these metrics, you want to halt the experiment immediately.

Chaos engineering step #3: Scale or roll back

Finally, you want to measure the impact of the failure at each step. This could be the impact on latency, requests per second, or system resources. You should also survey your dashboards and alarms for unintended side effects. When the experiment completes, you should have a much better understanding of your systems’ real-world behavior.

Important to note: Always have a plan in case things go wrong. If you’re running commands by hand, be careful not to break SSH or control plane access to your instances. One of the core aspects of our chaos engineering tool, Gremlin, is safety—all of Gremlin’s attacks can be reverted immediately, allowing you to safely abort and return to steady state if things go wrong.

After running an effective chaos engineering experiment, there are essentially two outcomes: Either you’ve verified that your system is resilient to the failure you introduced, or you’ve found a problem that you need to fix. Both of these are good outcomes! Now go have fun breaking things. 

Kolton Andrus is co-founder and CEO of Gremlin. Previously he was a chaos engineer at Netflix improving streaming reliability and operating the edge services. He designed and built F.I.T., Netflix’s failure injection service. 

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.