The unconscious admin

Just minutes out of general anesthesia, I somehow found myself rebuilding a core router configuration after a hardware failure

Sometimes, the phrase "I could do it in my sleep" is wholly accurate. A few weeks ago, I embodied that notion, though I have only a very fuzzy recollection of the event.

I had surgery on the Monday before Christmas. (Note: Do everything you can to avoid screwing up your back, it's no picnic). The morning of the surgery, there was a chiller failure in a sizable datacenter, and by 5:45 a.m., rack inlet temperatures were nearing 80 degrees. My cell phone had been going nuts with warning SMS messages, which led to a flurry of calls, emergency procedures, and whatnot. By 8 a.m., the problem had been fixed and the datacenter was once again running normally -- crisis averted, or so it seemed.

Unfortunately, this climate-control failure eventually led to the failure of a Cisco 3640 core MPLS router, roughly 15 minutes before I was brought out of general anesthesia. When I woke up, I recall asking for my phone and seeing the warning messages. The rest of this tale was relayed to me by others, since I barely remember anything about it.

I was in the recovery room, still hooked up to an IV drip. Luckily, the hospital had public Wi-Fi that extended to the recovery rooms, and a minutes later I had my MacBook Air in my lap. I sent a few IMs to the admins on site and had them pull the emergency 3620 from the shelf, swap in the T1 controllers from the failed 3640, and boot it with a console cable connected to one of the Linux servers. Somehow, I managed to completely reconfigure the router based off configuration backups made nightly with Pancho, including some relatively involved BGP and OSPF configurations. About 10 minutes later all was well and the WAN was live.

Apparently, during this whole process I had yet to regain full consciousness, so I was verbalizing my thoughts without filter and those included more than a few expletives -- that explains the dirty looks I got from a few nurses.

The original 3640 had failed completely -- it wouldn't boot or even power up. The replacement arrived a few days later and was swapped in for the spare router without incident. I later reconfigured the 3620 with a point-in-time emergency configuration to speed up the process should another failure occur (which would have been the case normally, but that router had been used for another purpose shortly before this event and thus didn't have a viable configuration).

Aside from my state of mind at the time, this was a great example of self-reliant hardware backups for key infrastructure components. The first example was the availability of the backup configurations. Tools like Pancho and Rancid can save the day in situations like this. Needless to say, without those configurations this rescue would have taken a lot longer.

Also, regardless of any service contracts and backup configurations, having a suitable spare router on-site made all the difference. These routers can be had on eBay for very little money, and they can take over just about any task in a pinch, just like this one.

If you don't have a spare router or two kicking around, I suggest making plans to procure one. You might want to pick up a few relevant interfaces, too. Just be sure that you prepare it with a few configurations for various production routers and store them in the flash. That way, if you ever find yourself fighting through an anesthetic fog to reconfigure them, all you have to do is copy flash:some.cfg start and reload the router.

But of course, the best plan would be to make sure that you schedule surgery on a day when there aren't any core infrastructure hardware failures.