Your data center is lying to you

No matter how much monitoring you do, false status readings can slip through -- you can't just take the data center's word for it

It was one of those days. Around nine in the morning, I suddenly had to contend with an significant IT disaster: a failed UPS in a medium-size data center. The loss of all three phases kicked in the UPS, which held the load for all of six seconds before it quit. Poof! The whole data center went down.

Power was restored less than 20 seconds later, but the damage was done. Due to a variety of issues, I was then responsible for getting that data center back on its feet from 250 miles away. Because most of the servers ran Linux, the next hour was full of rapid keystrokes, IM communications, and a gallon of coffee.

[ Doing server virtualization right is not so simple. InfoWorld's expert contributors show you how to get it right in this 24-page "Server Virtualization Deep Dive" PDF guide. ]

When a data center goes down and then back up without physical intervention, it doesn't come up nicely. Storage arrays initialize after servers that try to mount their shares, while some servers boot without access to DNS servers that are also booting and thus have other problems -- it's a mess.

Luckily, there were no data corruption issues, and eventually all servers and services were returned to normal operating state. The next day consisted of trying to figure out why a massive UPS handling a 44 percent load decided to quit after just a few seconds, but that's what postmortems are for.

The battery monitor lied
But the problem is that before dropping the load, the APC Symmetra 40k UPS showed that everything was fine -- except for a pesky failed self-test. I'd noticed the self-test failure before the outage, but there didn't seem to be anything wrong with the UPS, and the logs did not point out any reason for the self-test failure.

All the monitored elements -- batteries, intelligence modules, power supplies, everything -- were green, and 100 percent battery capacity was showing on the management status page. I supposedly had 19 minutes of runtime at the current load. That 19 minutes turned into the aforementioned six seconds -- and my day was shot.

1 2 Page 1