Hard drive nosedive exposes backup breakdown

It should've been a simple hard drive swap. Instead, IT must dive into the server’s core for a solution

Call it trial by fire or an IT nightmare or what have you. But after a recent experience, I once again became obsessed with keeping backups current.

We have a server that is notorious for flashing an amber light to warn of potential hard drive issues. Typically, when one light here and there starts flashing, we put in the call for a replacement drive, hot-swap the drives, and go our merry way. But one experience was much different.

On that day, two drives started flashing during a busy spell. It had been on my to-do list for a few days when another IT employee – “Bob” – asked if anything needed attention, so I passed the task to him. He called in an order for the new hard drives to be delivered the next day.

A couple of days on, between trips in and out of my office, Bob gave me an update as he was leaving the office in the late afternoon: The hard drives had been swapped, one was already rebuilt, and the other was taking a while but “should be all right.”

An ominous sign

As I sat down at my desk, an employee came to me with news that they could not access the corporate shared drive. I had started looking into it when we were approached by another user, then another and another about the same problem. The puzzle pieces began to fall into place and panic slowly crept in. All signs pointed to a connection between these evidently new issues and the recently replaced hard drives.

There I was on a Thursday afternoon, within two hours of leaving the office and going out of town the next day, and faced with a big problem. I remotely accessed the server where the problems were occurring; it was the one hosting five virtual servers. The only silver lining at this point was that the heart and soul of the company — that is, our main database — was hosted on a different physical server. It was a small relief, but we’d take what comforts we could get.

As I remotely logged in, I saw an alert that the virtual disk was no longer present and realized that the two hard drives Bob had swapped were pulled at the same time from the same array. The original setup with the server in a RAID 5+0 predated my time at the company, and it was well on its way up the creek without a paddle, to put it mildly.

A deeper problem

After initial moments of denial and hoping the server would magically boot correctly, we turned to the backups, which were supposedly set up to feed the NAS via iSCSI. We had checked the logs over time and seen the jobs completing successfully. But we couldn't verify this because some of the virtual servers on the fritz included our backup software. 

Eventually, we realized the backups were gone. It appeared that the servers had been replicating and storing on the same host as the original virtual servers, which obviously did no good whatsoever in this scenario.

My blood pressure shot up and I was in a full-fledged panic. But I was able to take a deep breath and note exactly what we needed to do in order to get back up and running for the morning — at least to the point that users could log in (since the domain controller was wiped out) and access the corporate data that had thankfully been relocated to the NAS a few months prior.

Bob had been called back into the office soon after the problems were reported, and we spent a long night rebuilding the domain controller, Office 365 controls, print server, and many other functions from scratch. Around 5 a.m., we were able to allow enough productivity for the workers to perform basic tasks, and I headed off to my trip.

Alas, many hours of my excursion were spent on the phone with Bob, and over the next several weeks, we pieced together the missing information from the servers. We eventually dug ourselves out of the massive crater created when the virtual disk was corrupted.

It was a good time for us to revisit our core IT processes and remind ourselves of key takeaways:

  • Always check the physical location of the backups to verify their existence — not the backup log alone.
  • Know your RAID arrays and exactly what you or your client has, and proceed with caution when making changes.
  • Perform tasks such as hard drive swaps after hours in case disaster does strike.
  • Check backups — again!
  • It’s good to be a bit of a night owl working in IT, just in case.
  • Don’t stack all of your eggs (or virtual servers) in one basket. 
  • For good measure, check those backups once more!

Copyright © 2017 IDG Communications, Inc.

How to choose a low-code development platform