Let loose your Chaos Monkey

Is all of the redundancy you build into your infrastructure really worth the trouble if you aren't willing to test it in production?

No matter the size of your IT infrastructure, you've built in some form of redundancy, whether it's as simple as RAID in direct-attached storage on your servers or as complex as multiple, cascaded, geographically separated hot sites.

In the past, I've strongly advocated setting aside a portion of planned downtime windows to test that redundancy -- which just about everyone can and should do. But how would you feel about testing that redundancy smack dab in the middle of a production day? When was the last time you yanked a disk out of a RAID set or unplugged a redundant network link just to see what would happen?

If the answer is never -- why? After all, you've invested the capital in providing the redundancy. What does it say about that investment if you're unwilling to test it when it matters most?

If your gut reaction is "that's crazy -- who would do that?" you can find your answer in Netflix. As the company moved its content delivery network to Amazon Web Services a few years ago, one of the very first things it did was deploy a devious piece of software called Chaos Monkey whose sole job it is to try to break the infrastructure. While very few of us have the time or inclination to develop software to trash our internal systems, great lessons can be learned from Netflix's example.

To continue reading this article register now

How to choose a low-code development platform