Stop being your own worst enemy

FREE

Become An Insider

Sign up now and get free access to hundreds of Insider articles, guides, reviews, interviews, blogs, and other premium content from the best tech brands on the Internet: CIO, CITEworld, CSO, Computerworld, InfoWorld, ITworld and Network World. Learn more.

System downtime is frequently due to operator error. Document your work as simply as possible to avoid these self-inflicted wounds

I've spent a great deal of the past 15 years doing two things: building new server, storage, and network infrastructures -- and fixing them when they fall flat on their face for one reason or another. Over that time, I've seen one common theme emerge: There are very easy, seemingly unimportant things you can do when you build and maintain infrastructure that will save your bacon later when things go pear shaped.

Most of them involve writing stuff down -- you know, documentation.

I realize if there's one word that can cause a room full of IT folks to roll their eyes, it's "documentation." Usually, when you tell IT pros they need to document what they're doing, the thought of writing a book-length, screenshot-laden tome that would allow a monkey to manage a complex system comes to mind. Nothing could be more horrifying.

We don't need no stinking documentation
But guess what? As it turns out, a monster reference manual is the least useful type of documentation to have in an emergency, because it's so verbose you can never find what you're looking for. Plus, it's a pain to update, people won't do it, and it will quickly become inaccurate. If there's one thing that's worse than having no documentation at all, it's having inaccurate documentation.

Instead, when you're building a new system or making changes to an existing one, note all of the settings you had to change from their defaults, IP addresses you used, which NICs are attached to which switch ports, and that sort of thing. Essentially you're looking for very clean, tabular documentation that contains the Cliffs Notes version of what you've done -- something that would allow you to reconstruct the same system were you to have amnesia or were a data center fairy to steal all of your cabling. Spreadsheets are a great way to do this and have the benefit of being very portable.

If you run a VMware vSphere environment, there are a bunch of scripts built for the VMware Management Appliance that can capture the lion's share of the configuration for your environment and dump it out in an easily readable HTML document. For SANs, your mileage will definitely vary, but you can always ask your SAN vendor. Very likely, there's an internal tool that can capture the details of how your device is configured -- and the vendor will probably be happy to share it with you.

Keeping the docs safe
Once you've amassed this collection of useful quick-reference documentation, you need to protect it. I've been involved in more than one disaster where a huge store of very helpful documentation and raw config data was stored on a system that directly depended upon the very things it was documenting. Putting your docs in a file share on a virtual machine stored on your SAN may be convenient, but it's also a good way to block your access if there's a serious infrastructure failure.

One easy way to solve that problem is to jam a USB flash drive in a workstation or server with good physical security and configure a scheduled task to mirror your documentation onto it on a regular basis. That way, you always have an up-to-date copy of your docs that you can fire up on a laptop even if the facility power is out.

Label everything
And I mean everything. Label your servers, front and back. Label cables. Label racks. Label items to the point where you could describe to your mom over the phone which power button to press or which cable to remove. Does this take a lot of time? Yes, it absolutely does. However, it's a lot more convenient than explaining why the network went down because someone unplugged the wrong unlabeled cable or tracing every cable to where it came from if you have to replace a switch in a hurry.

Labeling doesn't just extend to hardware, either. Everything from network switches to virtualization environments will give you the ability to provide human-readable names for various objects. Make sure those names are populated and accurate any time you make changes. Whether it's confirming a virtual machine's name matches that VM's hostname or the storage LUN tags you can define in vSphere match the names that you've given the volumes on your SAN, just do it. It can save you from committing potentially career-limiting "Please tell me I just didn't do that" mistakes down the road. Trust me -- I've seen it happen more than once.

To continue reading, please begin the free registration process or sign in to your Insider account by entering your email address:
Recommended
Join the discussion
Be the first to comment on this article. Our Commenting Policies