Don't Do DR on Friday the 13th

Anyone beats this and I'll have Doug send you a T-Shirt. Friday the 13th was in May last year, and I didn't even have this blog then. But I made notes that day and stored them in my anecdote file because I knew they'd be fodder for something someday. I don't believe in voodoo, vampires or the Loch Ness monster. But after last year, I do believe in benevolent aliens (Brian Chee), Sasquatch (Paul Venezia), and Fr

Anyone beats this and I'll have Doug send you a T-Shirt.

Friday the 13th was in May last year, and I didn't even have this blog then. But I made notes that day and stored them in my anecdote file because I knew they'd be fodder for something someday.

I don't believe in voodoo, vampires or the Loch Ness monster. But after last year, I do believe in benevolent aliens (Brian Chee), Sasquatch (Paul Venezia), and Friday the 13th.

Looking back, I should have known. Planning a disaster recovery rollout on Friday the 13th is just too tempting for Murphy, the Fates or the technoid demon Bitszeelbub who seems to hate me.

DISASTER PLANNING ON DISASTER DAY

It wasn't even that big a deal. Just head over to the client, log every one off a couple of hours early and configure the whole network (desktops and servers) to backup to an off-site service (which shall remain nameless). Strategy was simple: Desktops backup to servers using shadow copy and servers go off-site when the usage coast is clear.

We were also going to configure our monitoring software to check on their routers a few times a day (basic there/not there stuff). Then make sure that if 'not there' comes back that we run a few additional pings to see if it was just one device or everything, then have our system send the appropriate alerts to us and to the client's key geekoids. Smart enough for SMB work and easy enough to do.

Basic DR protection. We were going to follow this up with an off-site work location and a rental contract for the basic servers and workstations the company needed to function--that way, if our alerts came back as 'not there' and there really was a disaster, the company could call key employees to report to a new site, which would be equipped to handle them and have a restore of the latest system backup ready and waiting from the day before. Slicker than hobbit snot, and not too expensive if you know the right people.

THE PAIN BEGINS

Then Friday 13th came along and dropped-kicked me in the trackballs. First bit 'o bad news: The client's UPS alerts go off at 5:06AM. Orderly shut down of the whole system. The client's geek calls us while he's driving over. When he gets there he sees Long Island power crews are in the site behind his, repairing the neighborhood transformer that just got crushed by a semi deciding to backup where it shouldn't.

Now we've got a disaster the day before we put in a disaster recovery plan. Thank you, Lord.

Bite the bullet, reshuffle the work schedule, drive over and roll up our sleeves. Not too bad. By 11am, the power company had the power restored, which I thought was a minor miracle. I'm thinking about lunch now not Friday the 13th.

But suddenly we can't bring up the Exchange Server, and all the other servers are showing unscheduled power shut downs. What happened to the UPSes? Seems our on-site geek shirked his duty when it came to checking battery health on the UPS. So UPS logic was working, but the underlying batteries were crapola and all the servers got a raw power drop anyway. Exchange was the only one that couldn't recover, something corrupted the message store, so now we've got a dead email server--the day before I would have had an off-site backup ready and waiting.

No problem, go to tape. Momma didn't raise no stupnagels. Off-site backup is an add-on, not a single solution. We ALWAYS have local tape backup, too.

HUMAN ERROR ROOT CAUSE

Unfortunately, local tape requires local humans. And the same silly freak who didn't check the batteries on the UPS apparently ignored no less than THREE WEEKs of emails from the backup server complaining of bad media. He just kept sticking new tapes in there and figuring that solved the problem. By this time, I not only believe in Friday the 13th, I also believe in summary capital punishment and I'm about to sacrifice a live chicken to the techno demon just so I can get through the rest of this day.

anger.jpg

Worse, you can't restore a broken Exchange store to the same server for some reason (my Exchange guy was trying to explain it to me but I couldn't hear much over the sound of my forehead repeatedly smacking itself into the wall). Now I've got to reach into my drawers and pull out a spare server. Phone calls, begging, favors promised, dignity damaged, server on its way.

An hour and a half passes while we wait the 15 minutes that it should have taken. The cell phone rings and I actually break into a cold sweat. It's server boy, whose mini-van has just rear ended another car. Worse, he stuck the box in the back of the van and had the rear seats down, so the thing was actually airborne for a split second before it impacted with the passenger seat at about 35mph. Upshot: No Server For You!

Now I'm just mad. I drive all the way home to NJ and grab a lab server, then head back to the Island.

On a Friday.

Unless you're local, you don't know what this means for blood pressure and stress levels. I get back (about three and a half hours later) to find that the Exchange guru has decided to go home. But his cell phone rings before he even gets into his driveway, and let's just say he's either heading back to the site or looking for a way into the Witness Protection Program.

over.jpg
ONE LAST KICK IN THE... FF to 11:36pm and Exchange is back up. We even managed to recover a little of the damaged mail store. The backup issue has been fixed and tested and we got another client's geek to let us borrow compatible batteries until new ones arrived for the UPS. The whole disaster thing has been pushed off until Monday, but at least the Saturday crew will be working as normal. Friday the 13th is over.

FF to 11:36pm and Exchange is back up. We even managed to recover a little of the damaged mail store. The backup issue has been fixed and tested and we got another client's geek to let us borrow compatible batteries until new ones arrived for the UPS. The whole disaster thing has been pushed off until Monday, but at least the Saturday crew will be working as normal. Friday the 13th is over.

Until we go out to the parking lot to find that the Exchange guy left his lights on and now has a dead battery. And nobody has jumper cables. This is where I sit down in the car, pull out the notebook and write down my notes.

After we got the car started, my instinct was to buy everyone some beer, but no sooner are the words spoken than an eerie feeling descends upon us. We just looked at each other, shook our heads and slunk off home. Sometimes you know when you're beat.

I'm working from home today.

Copyright © 2006 IDG Communications, Inc.