Last week, I gave a three-hour workshop at Techmentor Orlando on the fundamentals of storage virtualization. The primary focus was a review of the concepts and technology that make our storage (and disaster recovery) world what it is: an ever-changing, acronym-rich headache. RAID, SAN, NAS, VTL, iSCSI, and more all make up the current infrastructure. At times during the discussion I opened it up for the audience to give their storage and disaster recovery experiences, and I have to say that some of them were the tales of horror you might only hear around a campfire (well, a campfire of IT admins).
Jarred Fehr, a system administrator at Peachtree Business Products, said, "About three years ago, our company had purchased a new DAS array to replace our aging one. We decided to buy a unit from a different vendor than we usually use due to better cost/storage ratio. Unfortunately, our unit had a batch of bad drives from a well-known drive maker. After one month in production, there were multiple drive failures in one night and we lost all of our data. Even though we had backups of everything, it still took a full week to restore it all and return business to normal. Now we have redundant servers and arrays to prevent such loss in the future." Sounds like the backup saved the day.
[ Read J. Peter Bruzzese's related columns "Keeping pace with disaster recovery" and "With pandemic alert, firms urged to review disaster recovery." ]
Rick Calmes from the Air Force Institute of Technology relayed an experience from a while back, but one that left an indelible mark on his disaster recovery mindset. He reported:
We were decommissioning an old NetApps device that was using multiple arrays of SCSI drives. We discovered that the array would physically fit into the DEC Alpha. So upon further investigation we scrounged a SCSI card that would fit into the DEC Alpha as well. (Being PC admins, we did not realize how proprietary the DEC machines were.) So we power it all down, install all the hardware and cabling, and hit the power button.
Well, the machine goes through its post and we wait and then the dreaded screen: no OS found. Not sure what had happened, but the result was that we no longer had an operating system or a mail store. So we powered back down, removed all the hardware, card, and cabling, but it was too late as the damage had already been done, as we discovered when we tried to restart the Alpha again. Major oops!
Thank goodness for backups, because it saved our bacon. We reloaded the OS and Exchange from scratch, then restored the store. We were back in business. Fortunately, e-mail was not the ultrahigh priority it is currently, as it took us about two days to get everything back. Part of the slowness was the multiple reboots to patch NT and its special HAL to run on the Alpha. And the 8mm tapes weren't very fast, either.
Rick says the morals of the story were: a) you don't know what you don't know and b) backup, backup, backup!
Rick wasn't the only one with a chilling story. Moe Hoskins, a regional network manager, told a harrowing tale of loss:
When the telephone rings with a report of an error opening Outlook and its inability to locate the Exchange server, there is reason for concern. After attempts to remotely manage and ping the Exchange server fail, your concern level heightens. Upon contacting the technician on site, you learn he had discovered the server in an unresponsive state and forcibly cycled power. What you didn't know at the time was during the unattended reboot, the Windows 2003 Exchange server, running RAID 5 with a hot spare, was missing a hotfix(831374), which allowed chkdsk autochk to run and corrupt the Exchange database. The corruption soon became evident after your Exchange Admin spent several hours of failed attempts to recover the database.
You realize you are now facing significant downtime and while not optimal, there is no need for panic. Your nightly backup regime will finally pay off and you can simply roll back to the previous night's backup with a minimal loss of mail. But the battle has yet to begin. After inserting numerous backup tapes, you discover the data set is not complete and you cannot find a restorable database. You frantically begin examining the backup logs.
The harsh reality sets in when you discover the backup job was spanning two tapes. The morning ritual of inserting a new tape for the scheduled nightly backup without first reviewing the backup completion logs was allowing the spanning to go unnoticed. The second half of the backup was subsequently overwritten by the first half of the following backup. In a nutshell, this destructive cycle resulted in a two-week rotation of media that did not contain a restorable Exchange backup. You desperately move on to the monthly tapes and finally discover the two tape spanning began four months ago. As your heart sinks, you are now faced with more than 200 users losing their last four months of messaging information.
While this scenario may seem like a comedy of errors, it is no laughing matter. This messaging disaster did in fact occur and was enabled by the belief that fault tolerance was the primary savior and backup was only necessary for a Hail Mary situation. Databases will crash, hotfixes will be missed, equipment will fail, and human error will always play a part in data integrity. Riddled with failure points, the true source of this disaster was complacency with the backup regime.
What is the takeaway from all of this? "Regardless of your [disaster recovery] solution, in the end, it all comes down to having a reliable backup," said Moe Hoskins.
In a world of disaster recovery sites, high availability, and virtualization, you might easily become complacent on your backup tools. And perhaps IT is moving further into a world where backups may become a thing of the past. But not yet!
Have your own disaster recovery nightmare to relate? By all means add it to the comments section below for others to laugh (or cry) and -- I hope -- benefit from.