It's a calm, sunny weekend day without a cloud in the sky. The barbecue is lit and beers have been cracked open. Things couldn't get any better. But lurking somewhere in the power grid is a faulty component that's been hanging by a thread for weeks. And it has picked today to be its last.
Power to the data center is abruptly cut. Uninterruptible power supplies assume the load in anticipation of the backup generator startup -- a startup that never comes due to a tripped circuit breaker. A few minutes later, the data center plunges into darkness. Burgers and beers will have to wait. There's work to be done.
[ Paul Venezia has the scoop on how to stay connected when disaster strikes. | InfoWorld's Disaster Recovery Deep Dive Report walks you through all the steps in anticipating and handling worst-case scenarios. Download it today! | Sign up for InfoWorld's Data Explosion newsletter to help deal with growing volumes of data. ]
This is a scenario I've seen play out in strikingly similar ways about once every year. The first I can remember was in a colocation center in downtown Los Angeles near the height of the dot-com boom. The last one was only a few days ago, on the morning of July 4. In the first case, a sizable office building containing three subbasements' worth of data center gear were unceremoniously brought down, despite the presence of an enormous facilitywide battery-backup system, three mutually redundant backup generators -- each large enough to power a small town -- and path-diverse access to two separate commercial power grids.
The exact reasons behind the outages aside, it's clear that no matter how much capital you've invested in your data center infrastructure (or in a state-of-the-art colo), you're bound to lose power someday. However, very few of us actually go through the trouble to test our systems' response to an unexpected power outage. In my experience, the larger the data center or organization, the less likely it will have tested a power outage on purpose. Unfortunately, these same large organizations have the infrastructural complexity that almost guarantee continued trouble even after power is restored.
Circular dependencies are often the worst, most time-consuming issues to resolve in the aftermath of an outage. For example, in the data center blackout that took place last week, a critical infrastructure service depended on a database server that depended on the availability of the infrastructure service to be reached on the network. During normal operating conditions, that worked fine, but in a power-restoration scenario, that tangle had to be pulled apart and fixed -- costing an extra hour or so of downtime. Simply charting the dependencies ahead of time would have saved a lot of head-scratching and missed barbecue time.
Step 3: Introduce power management hardware and software
Once you have a solid idea of the order in which your systems should be returned to production, next organize your data center equipment and software in such a way that it will automatically return to operation in the correct sequence. In some cases, this will involve implementing intelligent power distribution hardware that can power on individual outlets using predetermined delays between each step. In others, you may be scripting power-on sequences for virtual machines or simply organizing virtual machines into vApps (in VMware parlance) so that they'll start in a predetermined order.
This work need not encompass the entire data center. Instead, you may decide to focus on the most basic systems and manually sort out the rest. The most common order-of-operations issue that I see involve the SAN and virtualization cluster. In most cases, virtualization hosts boot faster than enterprise-class SAN hardware -- especially large SANs. This can lead to a scenario where the virtualization hosts power up and fail to automatically restart virtual machines because storage isn't yet ready. Simply delaying the startup of those hosts until the SAN is open for business can mean the difference between a largely automatic restart and a lengthy process that requires a lot of manual intervention.
Step 4: Learn from what you missed