Getting out of jail free

Innocuous details can bring a network to its knees, so expect the worst and prepare for widespread outages before they happen

It's 1:30 in the morning. By some miracle, you were able to get approval for a four-hour downtime window to complete a long list of overdue patching and network maintenance. Even better, you're done a half-hour early. Life is good!

As you're about to email the third shift to let them know they can get back in ahead of schedule, you remember it: that one setting you always knew was wrong and wanted to fix -- and, you thought, shouldn't cause any service disruption -- so you never got around to correcting it. Little do you know that "fix" is going to be your undoing.

It doesn't matter what it is. For me, it's been an incorrectly set spanning tree bridge priority or UPS software configured with an inadequate shutdown delay. Either way, half a second after hitting Enter or clicking Apply, your terminal freezes, pings go unanswered, and panic sets in: You've brought down the entire network, and you have no idea why. You thought you finished 30 minutes ahead of schedule, but now that half-hour may not be enough time to run around with a laptop and console cable to figure out what happened, much less fix it.

You can't avoid situations like this all the time. Bad things happen when you least expect it -- the old adage "If it ain't broke, don't fix it" applies to IT as much as it does to any other field. Nonetheless, you can build safegaurds into your network that will drastically reduce the time it takes to fix problems when they arise.

Out-of-band management
Look around a modern data center and you'll find an overlooked resource in abundance: out-of-band management ports. These days, you can hardly buy any enterprise network, storage, or server hardware that lacks some kind of out-of-band management capability. These ports generally attach to dedicated processors or isolated IP stacks that remain available even if the device they're attached to falls on its face due to misconfiguration or a hardware problem -- providing an easy way to get into the system and determine if it's the cause of an outage or just a victim.

To continue reading this article register now

How to choose a low-code development platform