Let loose your Chaos Monkey

Is all of the redundancy you build into your infrastructure really worth the trouble if you aren't willing to test it in production?

No matter the size of your IT infrastructure, you've built in some form of redundancy, whether it's as simple as RAID in direct-attached storage on your servers or as complex as multiple, cascaded, geographically separated hot sites.

In the past, I've strongly advocated setting aside a portion of planned downtime windows to test that redundancy -- which just about everyone can and should do. But how would you feel about testing that redundancy smack dab in the middle of a production day? When was the last time you yanked a disk out of a RAID set or unplugged a redundant network link just to see what would happen?

If the answer is never -- why? After all, you've invested the capital in providing the redundancy. What does it say about that investment if you're unwilling to test it when it matters most?

If your gut reaction is "that's crazy -- who would do that?" you can find your answer in Netflix. As the company moved its content delivery network to Amazon Web Services a few years ago, one of the very first things it did was deploy a devious piece of software called Chaos Monkey whose sole job it is to try to break the infrastructure. While very few of us have the time or inclination to develop software to trash our internal systems, great lessons can be learned from Netflix's example.

Monkeying around

Chaos Monkey's original task was to randomly disable production services so that Netflix could evaluate how its application infrastructure would react when the unexpected happened. By setting the monkey free during times when lots of engineering resources were available and ready to pick up the pieces in the absence of a proper recovery, Netflix could evaluate the impact of a failure on real, live production services. This significantly enhanced the company's ability to make its systems more durable -- and to validate that its redundant infrastructure actually worked.

Since that initial implementation several years ago, Chaos Monkey has grown and diversified into an army of treacherous rascals hell-bent on introducing everything from high network and service latencies to full AWS availability zone failures. It is in no small part why Netflix weathered last April's AWS outage without suffering the kinds of downtime that other, less prepared AWS customers experienced.

Chaos on a budget

You might not have the resources to develop the likes of Chaos Monkey, but that doesn't mean you can't monkey around yourself. The key tenet of the Chaos Monkey approach is to introduce failures to a running infrastructure while it's in actual use. Naturally, you should start by testing off-hours or during a maintenance window to ensure that you definitely aren't going to cause a problem. But at some point, you need to test while real users are trying to do real business on your systems.

You can't truly know what users experience when a failure happens unless you actually do it. For instance, off-hours testing may tell you that dropping one of four 10 Gigabit Ethernet links to a blade chassis will result in momentary high latency to a quarter of the VMs followed by a smooth recovery -- but what does this latency mean to the users? Will they get kicked out of VDI sessions? Will an upstream hardware load balancer interpret that latency to mean your virtualized Web servers are down -- and magnify the disruption? Will your monitoring deployment even tell you that it's happened? You may have a guess, but you simply can't know until it happens.

Once you know what will take place during a particular failure scenario, you can adjust your infrastructure to handle it more gracefully; maybe you can implement changes as simple as modifying a couple of settings in your network gear or lengthening timeouts on client devices. Better yet, you can go back and test again to make sure your changes truly solved the problem.

The biggest benefit to this approach is that you're producing failures when you know the parameters and are ready to deal with the fallout. This is far better than coming in cold to a failure in progress on a weekend without having any idea of the cause. As Jeff Atwood at Coding Horror notes in an account of his experiences fighting an unintentional Chaos Monkey within his infrastructure: "The best way to avoid failure is to fail constantly." I couldn't agree more.

This article, "Let loose your Chaos Monkey," originally appeared at InfoWorld.com. Read more of Matt Prigge's Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

Copyright © 2012 IDG Communications, Inc.