I've been on a troubleshooting kick the past few weeks, both in my blog and in the real world. Simply put, I've been inundated with a plague of IT problems.
Well, these things happen. No technology or process in the world can eliminate all future outages, defective code, or random human foolishness. but you can hedge your bets. Sure, you can spend beaucoup bucks on a fully redundant infrastructure, but short of that budget-busting scenario, a few small steps can greatly simplify recovery from all sorts of problems.
[ Also on InfoWorld: Read Paul Venezia's Virtualization Networking Deep Dive and learn. | Check out Paul Venezia's five-year plan to tackle the 8 problems IT must solve. ]
Bulletproof your infrastructure tip No. 1: Keep cold spares of everything
Ideally, you've already standardized on network and server components. Sure, there may be a few odd parts here and there, but your closet switches should all be the same brand, if not the same model. Your servers are homogenous or at least homogeneous to their purpose (such as HP ProLiant DL360s for one major infrastructure component and Dell PowerEdge R415s for another). These servers aren't that expensive, especially if they're purchased in their minimum configuration. In a pinch, you can replace a failed server with the cold spare, moving the functional parts over to the spare in a jiffy. In some cases, you'll even be able to simply swap the disks and have the new box up in no time.
For routers and switches, the same is true. With tools like RANCID to automatically download and archive switch and router configurations, you can dump the configuration of a failed router or switch to the cold spare and save the day. Firewalls work the same way. In many cases, you can even pull your cold spares from eBay auctions and get them cheap: You don't care about support on these units, so you can forgo that expense and still cover your needs. Even if you're running Cisco ASAs, you can probably find an end-of-life Cisco PIX with a similar configuration for a few hundred dollars that can at least bring critical services back up if you experience a failure.
Naturally, you don't want to buy cold spares of big-ticket items like core switches, but if you do a little legwork, you can cover the rest without putting a major dent in your budget.
Bulletproof your infrastructure tip No. 2: Go wiki, baby
What was the serial number of that remote-office switch anyway? What version of IOS was that router running before the power supply blew? I find that the easiest way to collect this data in a way that's easily located is in a wiki. Toss CentOS on a virtual machine, install MediaWiki, and start compiling data on your infrastructure. I paste the output of sh ver on a Cisco device straight to a wiki page as well as write up synopses of the switches' functions and responsibilities; in the event that something goes awry, I can quickly dig up those ever-so-necessary bits of information that can turn a three-hour recovery into 30 minutes.
I don't go so far as to put passwords in wiki documents, but anything short of that is fair game: lists of serial console server ports and what they're connected to, switchport assignments and VLAN blocks for DMZ and public switches, as well as each server, its brand, model, serial number, role, storage, and RAM configuration, and so forth. If it exists in your infrastructure, it should have an entry in the wiki.
Starting this project from scratch is a real pain, but maintaining the information on an ongoing basis is easy. The next time you have an immediate need to know the serial number of a failed remote switch, you'll have it right at your fingertips.