IT infrastructure: Whatever happened to reliability?

As storage solutions become more advanced, software complexity increases -- and multiplies the possibilities of failure

Page 2 of 2

Several days and many, many hours of engineering time later, the manufacturer had gotten the VTL back up and running with all of its data intact. Though I won't name them, the vendor did quickly realize the severity of the problem (nobody likes disappearing backups) and stepped up to the plate to fix it. But that's not the point. The point is that the more complex we allow our solutions to get, the more likely this kind of catastrophic software failure is.

I could rattle off a huge list of similar incidents I've noticed lately. Like the SAN with some bad code controlling its cache mirroring that crashed both redundant controllers simultaneously, which is ironic because the mirrored cache only exists to ensure uptime. Or maybe the recent and very public debacle surrounding McAfee's antivirus software identifying part of Windows XP as a virus (I'll resist the urge to make a joke about the potential accuracy of that conclusion).

Buckle up and back up your backup

It has really gotten to the point where, regardless of your spotless past history with a piece of storage gear, you may be one firmware upgrade away from introducing a crippling software bug that brings the system down in spite of all of the hardware redundancy you've paid for. That's not a fact that's going to help anybody sleep at night. But what can you do about it?

First, we need to absolutely demand that problems like this get fixed, and be very public about them if they aren't. Companies won't invest more resources in software quality assurance unless their heads are on the chopping block.

Second, never trust anything -- regardless of how redundant it may look. Design a completely independent backup for your backup. Imagine what you'd do if your most critical piece of infrastructure evaporated without much warning or explanation. In the case of the VTL that I mentioned earlier, the client's backups were protected by duplicated backups that were sent to a physical tape library for off-site archiving. They could have cleanly survived that loss as a result.

The bottom line is that as the data explosion grows, so too will the complexity of the solutions we use to combat it. Storage virtualization, online deduplication, and content archiving will be some of our most powerful tools in that battle. They're also all big chunks of black-box software spaghetti, much of it written to get to market quickly and appear in a long list of features on the marketing glossy. Every one of those features has the potential to cause data loss or downtime. Don't lose sight of that as you design your next-generation storage architecture.

This story, "IT infrastructure: Whatever happened to reliability?," was originally published at Read more of Matt Prigge's Information Overload blog at

| 1 2 Page 2