There’s a fine line between bravery and idiocy, and it’s usually determined by the outcome. Such is true in conflict and in IT. One of the major benefits of experience in either is that you develop a sixth sense about when discretion truly is the better part of valor. Shakespeare may have intended this as a joke, but it rings true.
Before we undertake any major action, whether pushing a major new app version to production, migrating massive data sets from one storage array to another and cutting over to production, or performing intricate tasks required to maintain a production system without removing it from production, we hedge our bets. Well, we should hedge our bets.
I try to imagine any possible blocking problems beforehand and determine if there are ways to deal with them before they happen. If at all possible, I like to have already scripted a reversion method to reset everything back to before any work was done, akin to pulling a ripcord. I like to leave nothing to chance.
There may be a time during the work when the infrastructure is in an extremely precarious position, but I like to limit that exposure as much as possible and have a clear path back to safety. This concept is built into some code deployment methods, but it’s not as easy in IT in general.
As we all know, IT is a fickle beast, and there are eventualities that can’t be fully accounted for. A storage array intended as temporary holding space that was completely stable for months will throw a disk or two halfway through the process, becoming a major bottleneck at best or completely blowing up the migration at worst. Or an order of operations mistake will be made, and you'll find yourself painted into a corner -- the only questions being how dirty you will get trying to get out and what you will have to sacrifice along the way.
With enough experience in this world, you can see some of these possibilities before they happen. You can either bail on the planned maintenance or upgrade or quickly develop an alternate plan that evades the problem. However, if more than a few of those issues crop up, even if there’s a seemingly clear path to success, you may hear that little IT voice in the back of your head screaming that it’s a trap, and it’s better to walk away while you still can. It’s usually wise to listen to that voice.
The basic concept is that no matter what, we should never lose data or systems during any IT function. Even if everything goes completely pear-shaped, the resulting questions should center on how long it will take to recover, not if it can be recovered. Even if it requires a few extra days of preparation beforehand, there should always be a way to undo whatever work is being done. It may cost more money in the form of backup storage or systems, but it’s always worth it, even if it’s ultimately not needed.
This is where the cowboys come in. It’s in the midst of sensitive and delicate operations where unforeseen problems appear that a cowboy admin will push forward without a safety net and try to reach the other side. If he succeeds, everyone’s thrilled and admiring, and rounds of beers will be bought at the pub. If he fails, everyone sticks around for hours or even days of constant stress and pressure until whatever can be recovered is recovered. These are situations that you don’t want to be part of if you can help it, because they usually don’t end well.
There’s a trick to determining if the move was a true cowboy move, however, because to an observer it may be hard to distinguish. I’ve made plenty of unorthodox saves in the middle of crises that some might consider unusual or avant garde, but with a backup plan in place if at all possible. It might be as simple as SCPing a broken management VM from one array to another in order to repair it and bring it up on stable storage to facilitate further saving migrations, or reworking iSCSI LUN masking on the fly to block certain problem servers from overloading a failing storage array in order to allow a fragile recovery to complete.
Full disclosure: I've had my share of cowboy moments with no safety net. I'm pretty sure most of us have.
If we lived in a perfect world, these things wouldn’t require any thought or planning at all. Big data and VM migrations, app and database rollouts and upgrades -- everything would be as easy and natural as breathing. We have made great strides in this area over the past few decades, and there may come a day when that is possible, but it’s certainly not today. There is no magic bullet; there is only Zuul.