Less downtime, faster recovery: the new mantra for automated systems

From aviation to Wall Street, organizations need better plans and more resilient automated systems to minimize the impact of failures

If you were one of the millions of people across the globe traveling a couple of weeks ago (and even if you weren't), chances are that you saw firsthand the utter chaos created when United Airlines grounded almost 5,000 flights worldwide due to a glitch in its automated software systems.

The maelstrom was the result of a glitch in the airline's automated reservation system, which caused the FAA to issue what is known as a "ground stop" on all United flights. Passengers around the world experienced delayed departures and arrivals, waited for grounded flights, and unleashed a siege of angry Twitter messages.

Automated software systems are designed to make things easier, and generally they do, but (very public) instances like this highlight the unavoidable reality: that these complex systems can shut down if there is as little as one error in millions of lines of computer code.

In the wake of "glitch Wednesday," which also saw the failure of the New York Stock Exchange and the Wall Street Journal's website, organizations need to think about their strategy for mitigating the risks associated with automation. Whether it's a reservations system for seats on thousands of commercial flights a day, the world's biggest financial market, or internal automation software for operations at an enterprise, companies need strategies to minimize the impact of failures like this and decrease recovery time.

Here are some ideas:

Worry constantly, so you don't have to worry. The best way to minimize the impact of failure? Prevent it from happening in the first place. One prevention strategy? Constant updates.

One of the issues that plagued United Airlines is that, like many large enterprises, they eventually start to carry technical debt caused by the aging of technology infrastructure. Put more bluntly, many of their systems and technologies are old, according to James Record, a professor of aviation at Dowling College.

Rather than doing an overhaul and investing in entirely new systems (which can significantly delay the process), companies can instead redevelop existing software applications in small batches, over time. This extends the life of applications while reducing risks; by virtue of continuous improvement, technologists in the organization can identify issues early and often, reducing the need to worry about long-term problems.

Get flexible. Just like your muscles, when your automated systems are flexible it can significantly decrease recovery time after a failure (or injury). One of the reasons that United Airline's problems were so widespread, was that once the reservations system experienced problems, the built-in safeguards and back-ups brought the entire thing down. Although it was only down for a little more than an hour, passengers felt the impact for several days.

The best way for companies to build flexibility into systems is to spend some time on the front end of implementation, asking the hard questions of vendors and in-house champions (including both advocates in the business and those on the IT team). By scenario planning at the point of implementation, companies can increase the flexibility of systems, and therefore the entire infrastructure.

Plan for the worst. If Sales is the Tigger of an organization and Finance is the nervous Rabbit, IT is Eeyore, the grumpy but sweet donkey of Winnie-the-Pooh's world. And for good reason: it's IT's job to anticipate technology disasters because it's generally IT's job to fix things when they go wrong.

That said, it's also true that successful disaster recovery requires cross-functional collaboration. Just as the management of an effective security policy needs buy-in and compliance from all facets of the business preparing for and getting past a big technology fail requires action from all corners, including customer service, social media and communications, HR, and the C-suite. IT must not only have contingency plans, but also share those plans and integrate recovery processes across the whole business.

Automated systems are at their best when they are totally obvious and completely invisible. A reservation system should have a front-end that is totally user-friendly, and a back-end that is absolutely cordoned off from public view. When big glitches happen, that wiring gets exposed, and it's rarely pretty.

An ounce of prevention isn't a foolproof way to stop problems from happening, but it's better than the pound of headache tablets you'll be tempted to swallow to stop the pain that comes with preventable glitches.

Copyright © 2015 IDG Communications, Inc.

How to choose a low-code development platform