When failover systems fail
Despite our best attempts to design for the worst, the failure of high-availability systems is shockingly common. Here's how to avoid career-ending mistakes
Follow @infoworldIf you've been in IT long enough, you've seen it happen: The crown jewel mission-critical application, built from the ground up to be highly available, goes down in flames and stays down, though a multitude of expensive backups and safeguards are in place. The reasons why this kind of "impossible" failure occurs are wide and varied, but ultimately trace back to two major factors that generally go hand in hand: complexity and plain old human error.
Complexity everywhere
The complexity of even the smallest business networks today dwarfs that of the enterprises of yesteryear. While I absolutely love server virtualization, virtual machine migration, SAN arrays, snapshots, replication, converged networks, and a whole host of other relatively new technologies, implementing them comes at a severe cost that many tend to overlook.
[ Read Matt Prigge's High-Availability Virtualization Deep Dive Report. | Get the latest on accommodating gobs of data with our Enterprise Data Explosion newsletter. ]
In the good old days, the functionality of a single application might depend upon nothing more than its own internal hardware and the network functioning properly. Today, that dependency tree is likely to include a group of centralized storage devices together with its ever-growing firmware code base, a virtualization hypervisor packed with features, and a more elaborate network architecture to support it all.
In balance, I think we should be happy about all of that -- maintaining tens or hundreds of stand-alone servers each with their own compute and storage hardware is fantastically wasteful and massively time consuming. The complexity is simply a result of forward progress, but it does come at a cost.
Anytime you have to ask "what's it doing now?," you're essentially paying that bill. The solutions we deploy are far more complicated than any of us can really understand completely. If you've spent a few hours sifting through a pile of arcane log files trying to figure out why something that really should work isn't working, you know exactly what I'm talking about.
Perhaps partially as a result of that complexity, but also due to modern business's much larger reliance on technology to function, maintaining high availability has become more and more critical in IT departments of all sizes. Fifteen years ago, many businesses would see the 24-hour failure of a tier-one application as unpleasant, but not disasterous. When an outage like that occurs today, heads roll.









