7 simple rules for better systems testing

Whether you're testing DR procedures or double-checking backups, these tips help you get the most out of those tests

Generally speaking, you should never expect anything you don't test regularly to work properly. This is true across all kinds of technologies, but the need for regular testing is often overlooked. Would you expect a car you parked in a barn two years ago to start today? If it did, you'd feel lucky. IT systems are no different. You shouldn't count on a successful site failover, to take one important example, if you haven't tested or maintained the systems that make it work.

As critical as testing is, it's often overlooked in favor of the never-ending backlog of seemingly more critical tasks. Forgoing testing completely is obviously dangerous, but it's also dangerous to test your systems in ways that don't meaningfully reflect how the systems would be used when they are really needed. Here are seven things you can do to make your testing count -- and to ensure that the confidence you have in your systems and procedures is well founded.

Testing rule No. 1: Perform real-world tests

The very first step to take is to ensure your tests are as close to real-world circumstances as possible. For example, if you're attempting to test your capability to perform a site failover, be sure to completely isolate yourself from the primary site just as if it has been rendered completely inaccessible. You may find that certain parts of your procedures (such as passwords or the procedures themselves!) are either located in or depend upon things at the primary site.

The best way to do this is by staging a test at a time when the production environment can be disabled for the purpose, but few of us have user communities and management that will support that idea. Instead, you will probably need to invest some time in being absolutely sure you're not depending upon the functionality of the infrastructure or services you might be trying to recover.

Testing rule No. 2: Include the human element

Similarly, it's also crucial to involve the human element in your tests. It's one thing to make sure all the systems work, but what about the people? Do they remember what they need to do? Do they know where critical documentation is located and how to get to it? Will they recognize an emergency and react the way you hope they will?

Because most things you'll want to test are responses to unexpected events, it's often eye opening to try to perform the tests when the people responsible for carrying them out don't expect them. You'll learn about your staff's response time and ability to carry out procedures without guidance that could end up being just as important as what you'll learn about the systems you're testing.

Testing rule No. 3: Observe the effect on monitoring tools

If you're lucky enough to be allowed to perform an outage-inducing test, be absolutely sure to evaluate the information that comes out of your monitoring and alerting tools. Would the data they provide be enough to guide your staff to the right conclusion? What can you tweak or monitor to make it faster and easier to determine the root cause of a major incident?

In my experience, engaging staff, then identifying the problem and deciding how to react can often use up far more time than whatever recovery steps might be necessary. The quality of your monitoring and alerting tools can play a big role in the diagnostic part of the process.

One recent outage sticks out in my mind as being a good example of both a failure to test using real-world circumstances and deficiencies in the configuration of monitoring tools. In this case, the failover of a critical piece of network hardware was typically tested by unplugging the secondary device from the primary and observing whether the secondary would properly assume the role of the primary. Because some manual intervention was necessary, the monitoring systems were also tested. During the tests, the monitoring systems detected the loss of the link and alerted properly, so operators believed everything was working well.

When a real-life failure eventually occurred, a routing adjacency between the devices was lost as the primary system encountered a software bug and crashed, but the physical link between the devices did not go down. Thus, not only did the secondary system fail to take over as it should, but the monitoring system that was only looking at the physical link didn't alert operators properly. Time was needlessly wasted trying to figure out what had happened.

Testing rule No. 4: Use your documentation

When you're performing testing, make sure you actually use whatever documentation you might have created -- whether textual documentation or diagrams -- to guide you through the process. All too often, documentation such as disaster recovery plans is created once so that it can be trotted out every so often for an auditor to look at, but is hardly ever examined by the people who actually need to rely on it. Unless you operate a very simple environment, your documentation should be regularly maintained and up-to-the minute accurate. Documentation is often the first item people reach for when an incident takes place. Make sure it's up to the task.

Testing rule No. 5: Involve secondary staff

Even if you don't need the documentation because you know the systems in question so well, imagine a situation in which you're not available and someone less familiar with them needs to execute the procedure. For these folks, good documentation will be crucial. It follows, then, that having team members who aren't primary for the system in question perform the testing can make an awful lot of sense. You will not only be testing the system, but also your documentation and the readiness of your team members to take over for their peers during an emergency.

Testing rule No. 6: Lessons learned

The most important part of a testing regimen is what you do after the test is done. If you uncover deficiencies in your systems, your documentation, or your team, it's critical to make sure you've actually learned from anything that might not have worked properly. After all, the reason you're testing is to learn which parts of the process didn't work ahead of actually needing them. If everything worked fine, everyone knew exactly what to do, and your documentation proved completely up to date, that's great. Most of us, though, will find that some part of the equation won't add up the way we'd like it to. More likely than not, part of a system will need to be fixed, a new team member will need more training, or the documentation will need to be updated.

Testing rule No. 7: Lather, rinse, repeat

After you've run through the entire testing process and identified any weaknesses that might be present, it's time to do it all over again. If your testing uncovered no problems, that might be months later, but if you've found deficiencies, it's important to test your fixes for them by executing another test on the heels of the first to be certain that you've actually solved the problems you encountered.

Whatever you do, make sure you set aside as much time as you possibly can to test the systems you've put in place. Only testing can ensure that your business will stay up and running, so you'll want to make the best use of the limited time you can spend on it. Trust me, you'll thank yourself for doing it when a real failure takes place and you'll sleep better knowing with certainty that your systems will work when you need them.

This article, "7 simple rules for better systems testing," originally appeared atInfoWorld.com. Read more of Matt Prigge's Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

Copyright © 2013 IDG Communications, Inc.