First, we need to thoroughly exercise the storage from every host in the cluster. Fortunately, it's extremely easy with virtualization. A quick Linux VM build and scripts using Bonnie++ or even dd run through a loop and clone the whole shebang as many times as necessary to introduce a significant load on each physical host in the cluster, hitting every planned LUN or share on the storage. With randomized sleep times, this produces a randomized workload of streaming reads and writes or a randomized workload of random reads and writes or whatever you like. If you really want to stress out a storage subsystem, there are few better ways to do it.
Now, after watching that for a few days and noting the absence of network or storage errors, we should add to the load. Toss Netperf or a similar network stress tool on each of those test VMs, and write a quick script to randomize TCP traffic of different sizes and payloads, with different test durations between all the VMs, and loop it the same way. Run that concurrently with the storage workload. If you want to add to the misery, throw in a few other VMs with a large number of virtual CPUs and RAM, then run CPU and RAM stress routines on them. At this point we should be hammering the hell out of just about every aspect of the cluster, from CPU to storage, from RAM to the network. If something's going to break, this would be where that happens, at least in theory.
Right about at that point is where I'd start trying to break things. Pull a host's power and make sure any fail-over actions happen appropriately. Run an automated host upgrade process and watch it carefully. Yank a network cable, or shut down the relevant switch port and make sure that bonding and fail-over network links work like they're supposed to. Also check to see all this happens under load -- that's when it's most important.
For some, this is one of the best parts: to come up with ways to beat the stuffing out of fresh gear, poking for weaknesses and holes. For everyone, the benefits are indispensable. For one thing, it allows a certain peace of mind after the production workload shifts; for another, it's vastly easier than trying to fix a big problem that was missed early on and winds up causing production outages.
So test, test, and test some more. Have some fun cooking up creative ways to stress every subsystem, every component, and ease everything into production after a reasonable breaking-in period. That light will still be on when you get there, perhaps a bit brighter and more soothing than before.
This story, "Didn't test? Then don't deploy," was originally published at InfoWorld.com. Read more of Paul Venezia's The Deep End blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.