Credit: Teerawut Punsorn
Unprecedented Arctic blasts cause all manner of mayhem, especially in areas where cold and snow are generally mythical. And mayhem means hard-learned lessons. Last week offered its fair share -- like what happens to rooftop cooling units that have the right glycol concentration for normal weather but lose their minds when the wind chill drops to -15 Fahrenheit. (Answer: The brutally ironic situation of a chiller freezing due to cold, and a data center roasting when it's -15 outside.)
But that's a problem caused by Mother Nature. You can't prevent it, nor can it be even realistically forecasted. The fix (usually) is to raise the glycol concentration, but that needs to be removed in the spring to prevent problems when the weather warms up. In an installation that is eight years old and has never seen a problem like this, well, you have to take your lumps and deal with it as best you can. Mother Nature is not one to be trifled with.
[ Also on InfoWorld: Downtime is ... good? | InfoWorld's Disaster Recovery Deep Dive Report walks you through all the steps in anticipating and handling worst-case scenarios. Download it today! | Get the latest practical data center info and news with InfoWorld's Data Center newsletter. ]
Man-made problems, however, can be upsetting on another level altogether. Take that enforced, four-hour-long power shutdown notice you get for the whole building starting at 6 a.m. on a Sunday. Sure, there could be generator backup, but as luck would have it, there was no generator capacity available during the build and no way to add new generator capacity to the facility. Instead, you have a monster UPS that can carry the room for nearly an hour but definitely not four.
Of course, the one benefit this type of man-made disaster has over Mother Nature is that it's scheduled. Backhoe operators and high winds rarely forecast the exact time and date they're going to ruin your day, but electricians usually do.
Still, what to do in this instance? You don't want to take down the whole operation if you can help it. Having the entire room go dark will set a feast for gremlins, as storage arrays spin down for the first time ever and cooling systems that have been running nonstop for years go silent. Objects in motion tend to stay in motion, indeed. When those systems fire back up, it's a virtual guarantee that something will fail, and you're suddenly fighting multiple fires on a Sunday morning.
The best way to deal with this situation is to identify everything that can reasonably be powered down, and pare back the data center to the leanest it can be without going completely offline. Leave storage arrays running, but shut down as many physical servers as possible. I've written scripts that take in a list of VMs that can be stopped, implements the shutdowns gracefully, consolidates all the remaining VMs on as few physical hosts as possible, and closes down the rest. Every watt you can remove from the UPS load will give you more time on the clock, and that's the goal.