Lessons from NASA: How to solve problems quickly under pressure

Make sure you have the right tools in place to determine the source of unexpected events quickly -- they do in fact exist

In my column last week, I explained the great importance of communicating quickly and clearly to business stakeholders both during and after an outage. That can be a challenging prospect in situations where you and your team are so caught up in fixing the problem that you don't have time to keep the rest of the organization up to date. It's even worse when you simply don't have anything to tell them yet. Although sometimes it's simply necessary, nobody likes to say, "I don't know what's wrong," when systems are down and productivity is hindered.

Reliably knowing what the source of or resolution to an unexpected outage is in its early minutes requires a lot of planning and forethought. At first, that might seem illogical. After all, can you really plan for the unplanned?

As it turns out, you can. All it requires is giving yourself the right tools to work with. Tracking a complex problem to its source almost always requires a substantial amount of information. The trick is making sure the information is available to you any time you need it, and it's easy to search and correlate. To do so, you need monitoring systems in place to record and ensure that those systems are set up correctly and carefully maintained.

The best analogy I can think of for this can be found in the space program. Even if you're not a space-exploration junkie, you've probably seen stunning video footage of some kind of space launch. These days, that sort of video is increasingly being used as a sales tool to garner public support for these programs, but that's not really the reason the video exists. It's there so that engineers have an absolute wealth of information to work with in the event that an unexpected event takes place.

The piece of foam that damaged the NASA space shuttle Columbia and eventually resulted in its destruction is a fantastic (and extremely unfortunate) example of a failure to collect and interpret enough information to assess a problem after it had occurred. Following that disaster, a multitude of additional cameras and in-space examinations of other shuttle launches were used to identify similar problems and prevent future loss of life.

Although the space program and an enterprise network might not seem to have much in common, in this sense they are very similar. Most networks have some kind of monitoring system in place. At the very least, a system that informs administrators if something has broken. Of course, you never want the first indication that something is wrong to be an army of users calling to tell you your system is broken (though a striking number of enterprises routinely find themselves in that position). In my book, a reactive monitoring system is basically mandatory if you're running mission-critical systems on a computer network (and who isn't?).

However, what many enterprises don't have is the equivalent of the 50 to 100 cameras that might be watching a manned spaceflight launch. Though it's true that most operating systems and applications produce large amounts of log data, this information is difficult to use in an emergency and is sometimes even unusable in when assembling a root cause analysis well after the emergency is over.

Primarily, this is because event logs and the like are either entirely ephemeral (in the case of the memory buffer-based logs you find on network devices like switches and routers) or stand to be lost in the event that a server or storage are catastrophically damaged. Worse, many default logging configurations use log-rotation schemes that are quickly overflowed when the large volume of logging that sometimes goes hand in hand with outages.

You can go a long ways toward facilitating the collection of useful forensic data by setting up a centralized logging system to ensure that logs of all shapes and kinds across your entire network are amassed into a single point where they can be safely retained. However, that doesn't really address the problem. Simply having the logs may make a root cause analysis possible, but it might not help you out in the first minutes of an outage unless you know exactly what to look for. If you're searching for the proverbial needle in a haystack, that's where you need something more than just logs.

To aid in that fight are an increasing number of tools available in the market today that let you search a vast array of logging and alerting sources, then correlate different events. Even better, there are systems that can determine what's "normal" for a network and alert you when multiple parameters exceed the thresholds the system has come to expect.

In the first instance, you might get a report that a certain user can't access a particular application while other users aren't having that problem. Using a logging tool like Splunk, you can simply punch in the user's ID or perhaps their workstation's IP address and see all the log entries from every system, network device, or application that logged it across the whole network in just seconds -- a truly powerful capability.

In the second instance, you might get reports that certain virtual machines on your VMware-based virtualization infrastructure are slow at a specific time of night. Using a tool like VMware's Operations Manager, you would have been alerted if a variety of parameters (including the likes of disk latency) went outside their usually observed levels.

Tools like this can be incredibly useful because they try to statistically correlate a wide variety of data you might never think to put together on your own. They might be able to immediately determine there's high disk latency only on a specific SAN volume or only on one host. Or maybe the disk latency on a single host also correlates with extremely high CPU load and disk I/O on a single VM that's always running when the alert occurs. From experience, I can tell you that even though you can assemble the information on your own, it will take you far, far longer to do without a tool that automates much of the footwork.

Although every enterprise's approach will be slightly different, take a few cues courtesy of the space program and make sure you're providing your future self with enough information to get you out of the jam you'll undoubtedly be in the next time something goes wrong.

Yes, some of the more capable log and event correlation packages can be pretty pricey -- even outside the reach of your budget -- but simply setting up a centralized log repository costs peanuts and is well worth the effort. No matter what you do, don't wait until after a catastrophic disaster to put these measures in place.

This article, "Lessons from NASA: How to solve problems quickly under pressure," originally appeared at InfoWorld.com. Read more of Matt Prigge's Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

Copyright © 2013 IDG Communications, Inc.