When the data center goes down: Preparing for the big one

No matter what you do, your power will fail some day. Here are the 4 steps you can take to prepare yourself

It's a calm, sunny weekend day without a cloud in the sky. The barbecue is lit and beers have been cracked open. Things couldn't get any better. But lurking somewhere in the power grid is a faulty component that's been hanging by a thread for weeks. And it has picked today to be its last.

Power to the data center is abruptly cut. Uninterruptible power supplies assume the load in anticipation of the backup generator startup -- a startup that never comes due to a tripped circuit breaker. A few minutes later, the data center plunges into darkness. Burgers and beers will have to wait. There's work to be done.

[ Paul Venezia has the scoop on how to stay connected when disaster strikes. | InfoWorld's Disaster Recovery Deep Dive Report walks you through all the steps in anticipating and handling worst-case scenarios. Download it today! | Sign up for InfoWorld's Data Explosion newsletter to help deal with growing volumes of data. ]

This is a scenario I've seen play out in strikingly similar ways about once every year. The first I can remember was in a colocation center in downtown Los Angeles near the height of the dot-com boom. The last one was only a few days ago, on the morning of July 4. In the first case, a sizable office building containing three subbasements' worth of data center gear were unceremoniously brought down, despite the presence of an enormous facilitywide battery-backup system, three mutually redundant backup generators -- each large enough to power a small town -- and path-diverse access to two separate commercial power grids.

The exact reasons behind the outages aside, it's clear that no matter how much capital you've invested in your data center infrastructure (or in a state-of-the-art colo), you're bound to lose power someday. However, very few of us actually go through the trouble to test our systems' response to an unexpected power outage. In my experience, the larger the data center or organization, the less likely it will have tested a power outage on purpose. Unfortunately, these same large organizations have the infrastructural complexity that almost guarantee continued trouble even after power is restored.

As anyone who regularly reads my blog knows, I'm a huge proponent of testing everything, both during planned downtime and in the midst of production. However, I'm not so naïve to believe that everyone has the resources -- or ever-important managerial backing -- to do this kind of testing. Most of us simply have to prepare diligently, wait for the "big one" to hit, then deal with it the best we can. To that end, you can take a few steps ahead of a large-scale disaster -- whether or not it's power-related.

Step 1: Plan external access

Murphy's Law dictates that if you're going to have a large, data-center-wide disaster, it won't occur when you're ready and on site. In all but one instance I've seen, data center power outages occurred on a night, weekend, or holiday when a full-strength engineering staff had to be called in from home to stand the site back up. Unless your entire staff lives two minutes from the data center, cutting that automatic RTO (return to operations) penalty out of the mix can be a huge benefit.

However, providing remote access to a data center that may be completely dark, then actually making that access useful isn't as easy as it might seem. It typically requires the implementation of a completely separate management network with its own significantly oversized battery-backed power (designed to provide runtime in hours rather than minutes) and its own dedicated Internet access. For more on that, check out one of my previous articles and another by my colleague Paul Venezia.

Step 2: Chart dependencies

The next most important measure to take is to build a dependency tree that includes all your major infrastructural components (and applications, if you can). This tree should show the order in which your systems should be returned to production. Most important, it will point out situations where you have a circular dependency.

1 2 Page 1
Page 1 of 2