I've spent a great deal of the past 15 years doing two things: building new server, storage, and network infrastructures -- and fixing them when they fall flat on their face for one reason or another. Over that time, I've seen one common theme emerge: There are very easy, seemingly unimportant things you can do when you build and maintain infrastructure that will save your bacon later when things go pear shaped.
Most of them involve writing stuff down -- you know, documentation.
I realize if there's one word that can cause a room full of IT folks to roll their eyes, it's "documentation." Usually, when you tell IT pros they need to document what they're doing, the thought of writing a book-length, screenshot-laden tome that would allow a monkey to manage a complex system comes to mind. Nothing could be more horrifying.
We don't need no stinking documentation
But guess what? As it turns out, a monster reference manual is the least useful type of documentation to have in an emergency, because it's so verbose you can never find what you're looking for. Plus, it's a pain to update, people won't do it, and it will quickly become inaccurate. If there's one thing that's worse than having no documentation at all, it's having inaccurate documentation.
If you run a VMware vSphere environment, there are a bunch of scripts built for the VMware Management Appliance that can capture the lion's share of the configuration for your environment and dump it out in an easily readable HTML document. For SANs, your mileage will definitely vary, but you can always ask your SAN vendor. Very likely, there's an internal tool that can capture the details of how your device is configured -- and the vendor will probably be happy to share it with you.
Keeping the docs safe
Once you've amassed this collection of useful quick-reference documentation, you need to protect it. I've been involved in more than one disaster where a huge store of very helpful documentation and raw config data was stored on a system that directly depended upon the very things it was documenting. Putting your docs in a file share on a virtual machine stored on your SAN may be convenient, but it's also a good way to block your access if there's a serious infrastructure failure.
One easy way to solve that problem is to jam a USB flash drive in a workstation or server with good physical security and configure a scheduled task to mirror your documentation onto it on a regular basis. That way, you always have an up-to-date copy of your docs that you can fire up on a laptop even if the facility power is out.
And I mean everything. Label your servers, front and back. Label cables. Label racks. Label items to the point where you could describe to your mom over the phone which power button to press or which cable to remove. Does this take a lot of time? Yes, it absolutely does. However, it's a lot more convenient than explaining why the network went down because someone unplugged the wrong unlabeled cable or tracing every cable to where it came from if you have to replace a switch in a hurry.
Labeling doesn't just extend to hardware, either. Everything from network switches to virtualization environments will give you the ability to provide human-readable names for various objects. Make sure those names are populated and accurate any time you make changes. Whether it's confirming a virtual machine's name matches that VM's hostname or the storage LUN tags you can define in vSphere match the names that you've given the volumes on your SAN, just do it. It can save you from committing potentially career-limiting "Please tell me I just didn't do that" mistakes down the road. Trust me -- I've seen it happen more than once.