IT can be hilarious.
Every once in a while, a bizarre set of circumstances will produce a situation that you might not believe unless you were there to see it. The stories of these events are the stuff of IT folklore and get retold over beer and peanuts year after year. But just because they're funny -- in hindsight -- doesn't mean they lack good object lessons. In a recent meeting with a client, I was reminded of two stories I'll never forget.
The doomed cabinet
Back when data centers were still dominated by mainframes and Intel-based servers were just starting to make inroads, a client was faced with a data center floor space problem. New systems were being added at an alarming rate, and a reorganization was needed to make room. This particular move required an oversized enclosed cabinet to be removed and its contents -- primarily networking equipment -- to be shifted to a smaller relay rack in a different part of the room.
Most of the equipment could be transported easily. The one real challenge was a single network switch, a Cisco Catalyst 2980, that essentially ran the entire data center. Throughout the 1,500-seat organization, any app that wasn't delivered through a green screen depended at some point or another on this switch to get to server resources.
One lesson is that sometimes you just need downtime. Announced downtime is always better than unexpected downtime. Users can prepare themselves for the outage and deal with applications being unavailable. Sure, it might decrease productivity for a while, but that's a far sight better than it happening without any warning.
I mean, imagine if the saw man hadn't had such steady hands. I shudder to think what would have happened if the cables had been hacked off by mistake or the switch had been dropped. Sourcing a replacement switch and reterminating 70-odd cables would take a little bit more than 30 minutes.
Another lesson is that investing in redundant infrastructure components -- in this case, a secondary core switch -- often buys you more than failover capacity if one of the components should fail unexpectedly. It also buys you the flexibility to deal with operational changes that would otherwise require downtime. That's one reason server virtualization and associated features like VMware's vMotion and Storage vMotion have become so popular. Taking a virtualization host out of service for upgrades or shuffling data across to a new SAN volume simply doesn't require downtime anymore. Sometimes it's hard to sell that value to management, which can't see it in use every day, but it's almost always money well spent.
The soggy switch
Years later, the same client had wisely taken these lessons to heart and invested in dual-redundant core switching. Yet that redundancy can't save you from everything -- especially not an HVAC technician with a blowtorch.
Not long after that, the fire department gave the all-clear. The IT folks entered the room, where the damage was plain to see. Everything was soaked -- cables, patch panels, you name it. One member of the IT team was about to call Cisco for a replacement switch, but another staffer suddenly remembered that the exact same model switch was sitting in a box on the loading dock. It had been purchased to replace a different switch elsewhere on campus and simply hadn't been deployed yet.
Fortunately, the network admin had insisted that the patch cables be labeled with the port that they were supposed to be attached to. The process of swapping out the cabling, shoving the new switch in the rack, recabling, and restoring a saved switch configuration would only take a half-hour or so.
While one member of the team ran down to retrieve the new switch, another started yanking cabling out of the 200-odd network ports. When he was about halfway through, the unthinkable happened: The switch turned back on.
Nobody had remembered to shut the power off. As it turned out, the switch had gone offline not because it had shorted out, but because the water had slowed the exhaust fans down to the point where the switch shut itself off to prevent heat damage.
After the switch finished its boot cycle, the network actually came back up for the users who hadn't yet had their cables unplugged. After recovering from shock, the team decided to press ahead with the replacement. Just because it was up now didn't mean it would stay that way.