Lessons from NASA: How to solve problems quickly under pressure

Make sure you have the right tools in place to determine the source of unexpected events quickly -- they do in fact exist

In my column last week, I explained the great importance of communicating quickly and clearly to business stakeholders both during and after an outage. That can be a challenging prospect in situations where you and your team are so caught up in fixing the problem that you don't have time to keep the rest of the organization up to date. It's even worse when you simply don't have anything to tell them yet. Although sometimes it's simply necessary, nobody likes to say, "I don't know what's wrong," when systems are down and productivity is hindered.

Reliably knowing what the source of or resolution to an unexpected outage is in its early minutes requires a lot of planning and forethought. At first, that might seem illogical. After all, can you really plan for the unplanned?

As it turns out, you can. All it requires is giving yourself the right tools to work with. Tracking a complex problem to its source almost always requires a substantial amount of information. The trick is making sure the information is available to you any time you need it, and it's easy to search and correlate. To do so, you need monitoring systems in place to record and ensure that those systems are set up correctly and carefully maintained.

The best analogy I can think of for this can be found in the space program. Even if you're not a space-exploration junkie, you've probably seen stunning video footage of some kind of space launch. These days, that sort of video is increasingly being used as a sales tool to garner public support for these programs, but that's not really the reason the video exists. It's there so that engineers have an absolute wealth of information to work with in the event that an unexpected event takes place.

The piece of foam that damaged the NASA space shuttle Columbia and eventually resulted in its destruction is a fantastic (and extremely unfortunate) example of a failure to collect and interpret enough information to assess a problem after it had occurred. Following that disaster, a multitude of additional cameras and in-space examinations of other shuttle launches were used to identify similar problems and prevent future loss of life.

To continue reading this article register now

How to choose a low-code development platform