A $1 part brings down the data center

Amid a tour with senior programmers and top execs, the power goes out and it takes a troop of techs to bring it back

Page 2 of 2

Back in the pitch-black computer room, we managed to find a flashlight and call our electrician (you have lots of those when you work for the electric company). The electrician arrived, and after 10 minutes of repeatedly asking us "what the hell did you guys do?" found that both 600-amp, 415-volt circuit breakers had tripped. To put this in perspective, these two circuit breakers had enough power to run over 60 average homes and should've been sufficient to juice our one largish room.

He reset both breakers, but the eagerly anticipated bang didn't arrive. We had to hit the Power On button on the console before the system lurched into action. We breathed a sigh of relief, which lasted until we opened the mainframe door again to see if anything was wrong. Everything looked fine -- then we closed the door and the lights went out.

We quickly agreed to keep our hands off the troublesome door and to call in an IBM tech. After resetting the circuit breakers, reactivating the power, and taping a large note that read "Don't Touch the Door," we managed to run normally for the rest of the day.

That evening, we were ready for the IBM technician when he arrived at 7 o' clock. He kept asking us what we'd messed up this time and couldn't quite believe that it all, er, hinged on a door. We asked him to close the door himself, but he insisted it couldn't be the problem. After nearly an hour, he took his head out of the mainframe and proudly told us there was nothing wrong and we must have done something.

He then closed the door. And the power went out.

He wasn't so smug after that. We got the power back on, and after a while longer, he figured out the problem.

As we understood it, the mainframe's air-cooling system was the culprit. It checked the air temperature by using a "temperature sensitive" resistor. As the temperature went up, the resistance went down. If the resistance got too low, the machine shut itself down. The device was worth a dollar or two.

In our case, the resistor had come loose and shook when the door closed. Part of the insulation, of course on the wrong side of the resistor, had stripped back a bit, and when the resistor shook one of its wires touched the metal frame which caused a short circuit. Resistance went to zero. The mainframe thought the air temperature was at "a million degrees" and turned itself off, but did it so dramatically that the building power supply had a heart attack and dropped the circuit breakers.

This tiny artifact had left everyone standing in the dark and asking if they were to blame. For us lowly operators, it was highly gratifying to see the Programmer God, the IBM tech, and even the electrician embarrassed and wondering what they had done -- and us not at fault.

Murphy was right: Things will go wrong at the worst possible time, such as during a demonstration to important people. But my favorite lesson is that a computer system that works is one where you haven't found the bug -- yet.

Send your own crazy-but-true tale of managing IT, personal bloopers, supporting users, or dealing with bureaucratic nonsense to offtherecord@infoworld.com. If we publish it, you'll receive a $50 American Express gift cheque.

This story, "A $1 part brings down the data center," was originally published at InfoWorld.com. Read more crazy-but-true stories in the anonymous Off the Record blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

| 1 2 Page 2