You break it, you build it -- better than before

To borrow from the old Chinese proverb, a crisis can be an opportunity in disguise, even in the data center

broken glass background for your images isolated on white with clipping path 000010315610

Under normal circumstances, IT's mantra is clear: Nothing should ever break. We make every effort to achieve 99.999 percent uptime. We go through amazing pains to successfully and seamlessly transition from old infrastructure to new. We collectively write billions of lines of code to adapt data structures from one technology to another, testing every conceivable element until we can pull the trigger and hope that, well, nobody notices. However, gains can be had when that last 0.0001 percent appears and we have to deal with the consequences.

For starters, if you’re doing it right, people who've never given a thought to their corporate IT department might actually realize the trains have run on time for a long, long while, and a minor delay or problem now is hardly a blip when viewed across the vast expanse of time when all was well.

Just kidding -- most of the time they’ll conclude that “the network is down” because their internal Web app isn’t loading completely and complain to the CIO.

In all seriousness, on many occasions, propping up elderly technology is costlier than ripping and replacing it with the latest and greatest. The trick is telling the difference between old technology that is truly consuming more resources and draining your productivity versus old technology that's doing fine and can hang around for a while longer. "If it ain’t broke don’t fix it" is not an absolute, especially when it ain’t broke because two people are putting in many hours every day keeping it together.

But even when IT identifies and earmarks a problematic application or service for replacement, the fact that the teetering, underlying infrastructure passes as functional can be detrimental to the design and replacement process. Unless the system is on fire, we think we can take our time designing the replacement and, thus, accept input from anyone and everyone. Too often this results in a new design that’s six months behind schedule, already over budget, and still incomplete.

Meanwhile, a few full-time admins keep the old system chugging along -- they have no other choice. If there had been an actual failure, suddenly the equivocating about what color the login screen should be disappears and actual work gets done because there’s no other alternative.

On the other side, surprise projects appear from nowhere because a budget item was not fully understood, or a vendor pulled strings and a bunch of new hardware or software was purchased and must now be implemented. The fact that the new gear was unasked for, unnecessary, and likely a poor fit is immaterial. This is where perfectly viable and functional parts of a corporate computing infrastructure get replaced for no reason at all, usually leading to immediate problems with a long tail.

Let's say part of the corporate network is eight years old. It’s gigabit at the access layer with bonded multi-gig uplinks, so it lacks 10G. At the same time, the network monitoring and trending tools show it usually runs about 25 percent utilization across all layers back to the core, which is 10G. The hardware is functional with normal replacement of a failed power supply here and there, but otherwise in solid shape.

However, when sales reps get wind of no 10G and “eight years old,” they start salivating and the conversation leaps from whether or not the network needs to be upgraded to whether or not the network needs multiple 10G or multiple 40G uplinks to the core. At the same time, the admins who know the network scratch their heads because their 4G uplinks are pushing only 1G sustained.

If there’s room in the budget -- or even better, the budget needs to be spent in order to be able to request the same budget next year -- the order is placed and several tons of solid gear get yanked for sexy, new hardware that will have zero net positive impact on the network or the users, who will only notice if there’s an outage during the replacement.

There were definite, clear benefits to moving from 10Mbps to 100Mbps at the core and edge, and moving from 100Mbps to 1000Mbps and even 10G in the core, but that’s where normal computing use has generally stalled. We’re busy moving apps to the cloud, implementing VDI, using wireless devices, and shrinking the computational and bandwidth footprints of users, all of which means that for the first time ever, we don’t need to perform forklift upgrades on our networks, no matter what the sales droids say.

Alas, there’s no right or wrong way to work with aging infrastructures. Each element has its own set of dependencies, politics, and champions to be navigated when the time comes to replace it. Sometimes, when that element breaks, it can eventually lead to the best of all possible outcomes: a new system that works so well, nobody notices.