The Y2K bug that wasn't

A software snafu turns routine test into both a problem-solving adventure and unexpected cardio exercise

During the Y2K lead-up, I had the privilege of testing my company's distribution center systems for Y2K compliance. These distribution centers ran a variety of automation software and controls that routed containers throughout the warehouses, where automatic dispensing units and human pickers loaded them up with goods to be shipped to the customers. It was (and still is) a very sophisticated system that enables the business to claim a very high service ratio of getting goods to customers as ordered and on time.

Naturally, the company did not want to be caught in a Y2K fiasco, so they assembled a few workers to travel throughout the country running a series of tests on the equipment.

[ Do you know your geek IQ? Take our quiz and find out ]

Since I had written much of the software and it did not deal with dates at all, I knew that this would be nothing more than an act of due-diligence. Nevertheless, I understood the need for this process and gladly lent my hand to the effort.

So, the procedure was like this: You'd arrive at the DC mid-day while the system was not being used. If you didn't know the DC manager, you'd introduce yourself and told him or her that the system would be back online well before the start of the production cycle, which began at 7:00 p.m. and ran until the last container shipped, sometime around 5:00 a.m. the next morning. You then would shut down the computers, set the BIOS dates to 15 minutes before the turn of the millennium, reboot, and run some functional tests. After that was done, you'd shut down the computers again, set the BIOS dates back to the current date and time, and restart the systems. Finally, you would run some additional functional tests. The entire process would usually take about 2 hours.

One of these warehouses happened to be located near New Orleans, a place near and dear to my heart: I went to high school and college in southern Louisiana, and started my career there. I planned out my itinerary that took me to three other cities before my last stop in New Orleans, then a few days of R&R. Planning ahead, I called my good friend -- we'll call him Jack -- and planned to meet him at 6 p.m after I'd be done at the DC.

Everything went as expected in all the other tests I ran, so I didn't anticipate anything different in New Orleans. I did my thing and when I rebooted the computers with the Y2K date, the system failed. The DC Manager looked upset.

"Don't worry, I'll reset everything and we'll resolve the issue offline and have a fix in place well before the end of the year," I told him. (Secretly, I was ecstatic. I had never encountered, nor expected to encounter, a real Y2K bug, and thought I had a live one.)

The only thing was, when I reset the system to the current date, it still failed. I started trying to diagnose the problem. The Facilities Manager pitched in, taking the cases off all the computers and vacuuming the dust.

"You'd be surprised what a little dust can do to a computer," he told me.

After about 3 hours on the phone with a programmer who developed some of the components, I determined that they had applied a patch the prior month. I looked in the event logs and discovered that the systems had not been rebooted after the date of the executables. In other words, the developer came in, applied a software update, and did not test it. Did not even reboot the machine. The DC personnel never rebooted the machines, so this one had been running the old code for a month, with the newly installed, untested code waiting for its chance to wreak havoc. In effect, I was doing a test of the code that was installed a month ago, not a Y2K test (oh well). I ended up rebuilding the code from the source archives and installing it on the failing machine.

All this kept me at the DC past 6, so I called Jack, who was already at the bar, and told him I'd be late. He was OK with that.

And then at 7 p.m., the production cycle began. I discovered that the recovery process had reset all the calibration settings on the sorter, causing containers to miss their exit lanes and crash into the sides. Normally, fixing this problem is a two-person job, with one person at the sorter (usually on a ladder) speaking to another in the computer room via walkie-talkie relaying how much the containers overshot or undershot their exit lanes. But that evening I was on my own, running from room to room and climbing up and down the ladder. After what felt like the equivalent of a 5K run and a stair-master session, I finally had things back to normal at 10 and promised to return at 8 the next morning to do a post-mortem in front of the muckity-mucks. I high-tailed it to the bar and found Jack. He had made many friends.

We ended up spending the major portion of the night out on Bourbon St., shooting pool in a little bar on a side street. We played the locals for pitchers of beer, made the obligatory stop at the Café Du Monde for French doughnuts and coffee, and staggered back to the hotel at about 4:30 in the morning. That wakeup call at 6:30 came way too early. I dragged myself back to the DC, somehow managing to make it through the day until the DC manager was satisfied that I had restored his system back to normalcy.

I don't know if I learned anything from this experience other than that at times it's best to distrust other programmers' methodology -- and how nice it is to have good friends.

Do you have an IT war story or lesson learned? Submit it to InfoWorld's Off the Record. If we publish your story, we'll send you a $50 American Express gift card.


Copyright © 2009 IDG Communications, Inc.

How to choose a low-code development platform