This uptime imperative results in lots of system redundancy: clustering, replication, and warm sites, to name a few popular solutions. While these systems usually accomplish their goals if they are designed, implemented, and maintained properly, what they mean, in effect, is: "Our systems are too complex and might fail, so we're going to add another level of complexity to solve that." It sounds dumb when it's written that way, but that's exactly what we do. And it works -- most of the time.
The human element
I suspect that if you placed every major IT failure under the same scrutiny that the NTSB applies to airline crashes, you'd see human error listed as the sole factor or contributing factor to nearly every one of them. As much as IT is about racks of equipment, cables, telecommunications lines, and software, it's more about the people that design, build and run all that stuff.
Product design. The human error parade starts before your fancy new equipment ever shows up on your doorstep. This isn't a particularly new phenomenon. Everyone has probably dealt with equipment that died because it wasn't assembled correctly or included some bad components.
Today, though, the complexity present in the systems we use results in much more insidious types of failures. As a case in point, a well-known SAN vendor recently released new firmware for its flagship storage product. The firmware supported a number of very cool features, and it was an exciting release -- that is, until it started crashing arrays and occasionally hosing customer data.
While I don't have the inside track on exactly what the problem was, you can bet it was probably a failure to do enough regression testing. As the solutions these vendors ship become more and more complex, the challenges of testing them against all of the bizarre scenarios that customers will run them through in the field becomes dramatically more difficult. That's not to say that I don't fault them for the failure, but it has become the status quo to expect new software to break something. That's sad, but that's where we are.
In the good old days, if you had a centralized storage device made by a trusted storage company die on you, chances are a card or a drive fried and you'd have well-equipped support technicians rappelling out of the ceiling to fix it in short order. Today, the somewhat unsurprised offshore support tech on the other end of the phone will likely be trying to figure out which one of the unpublicized critical software bugs you've just bumped into.