This is a story that happened a few years ago about a relatively modern computer data center with a very old air conditioning system -- a nightmare waiting to happen.
The data center was about 66 feet long by 20 feet wide and contained more than 300 physical servers, a couple of SANs, 10 virtual hypervisors, networking devices, and gateways to the Internet.
[ Get a $50 American Express gift cheque if we publish your tech tale from the trenches. Send it to firstname.lastname@example.org. | Follow Off the Record on Twitter for tech's war stories, career takes, and off-the-wall news. ]
As mentioned, the data center was cooled by an out-of-the-ark air conditioning system. Despite repeated warnings and pleas to get an updated cooling system, the business managers refused. The reasons were many, including monetary, political, and logistical.
It wasn't too much of a surprise when the air conditioner broke down one hot summer day in the early hours of the morning.
I arrived at work to find a dozen or so tech managers and engineers running around with the data center doors and windows wide open and half a dozen fans spread across the floors of the room. As you entered, the heat was stifling: The thermometer on one wall showed 100 degrees F and rising. Facilities management had been called in and air conditioning suppliers contacted.
Up until that point, all servers and devices had been humming; from a user's point of view, it had remained business as usual.
But then we had to power off all the devices to save them from overheat damage. We sent a signal from the UPS, enabling all servers installed with a UPS agent to receive a power cut signal and ensuring a healthy operating system shutdown.
Out of the 300-plus servers, only 20 percent cleanly shut down. The rest were still running the operating system and responding on the network.
Half a dozen engineers were told to manually and immediately shut down the data center. Luckily, we had procedures documented to take down the systems in the correct order to prevent data loss or corruption. This included the old first-generation SAN that had more than 50 power on/off switches that needed to be turned off in a specific order, which takes about 15 to 20 minutes to complete correctly.