This is a story that happened a few years ago about a relatively modern computer data center with a very old air conditioning system -- a nightmare waiting to happen.
The data center was about 66 feet long by 20 feet wide and contained more than 300 physical servers, a couple of SANs, 10 virtual hypervisors, networking devices, and gateways to the Internet.
[ Get a $50 American Express gift cheque if we publish your tech tale from the trenches. Send it to firstname.lastname@example.org. | Follow Off the Record on Twitter for tech's war stories, career takes, and off-the-wall news. ]
As mentioned, the data center was cooled by an out-of-the-ark air conditioning system. Despite repeated warnings and pleas to get an updated cooling system, the business managers refused. The reasons were many, including monetary, political, and logistical.
It wasn't too much of a surprise when the air conditioner broke down one hot summer day in the early hours of the morning.
I arrived at work to find a dozen or so tech managers and engineers running around with the data center doors and windows wide open and half a dozen fans spread across the floors of the room. As you entered, the heat was stifling: The thermometer on one wall showed 100 degrees F and rising. Facilities management had been called in and air conditioning suppliers contacted.
Up until that point, all servers and devices had been humming; from a user's point of view, it had remained business as usual.
But then we had to power off all the devices to save them from overheat damage. We sent a signal from the UPS, enabling all servers installed with a UPS agent to receive a power cut signal and ensuring a healthy operating system shutdown.
Out of the 300-plus servers, only 20 percent cleanly shut down. The rest were still running the operating system and responding on the network.
Half a dozen engineers were told to manually and immediately shut down the data center. Luckily, we had procedures documented to take down the systems in the correct order to prevent data loss or corruption. This included the old first-generation SAN that had more than 50 power on/off switches that needed to be turned off in a specific order, which takes about 15 to 20 minutes to complete correctly.
After a couple of hours of frantically shutting down operating systems and powering off servers, the data center was declared fully powered off. The temperature in the room was unbearable, with the heat of the day mixed in with the heat generated by the equipment.
It took more than 48 hours to fix the air conditioner because specific parts needed were no longer available and had to be sourced from a similar old decommissioned unit. That meant our systems were completely down for two and a half working days. Only the desktop computers and the phones were working -- no network.
As strange as it may sound, this very failure happened again before the organization invested in a secondary air conditioning unit -- not a modern one, because "the money and logistics did not allow it," but an identical old system that could be activated when the main one was down. It was a redundancy of some sort, nevertheless.
The moral is do not base modern systems on old ones. Infrastructures, systems, and technologies need to move together and complement each other. And because business managers don't always understand this need, don't cross the same bridge twice: keep a tested disaster recovery plan updated and close at hand.
This story, "Stubborn suits cause data center chaos," was originally published at InfoWorld.com. Read more crazy-but-true stories in the anonymous Off the Record blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.