Scenes from a disaster: An upgrade gone too far

It's easy to miscalculate how a major upgrade will fare -- especially when you forget the importance of infrastructure

Not long ago in a land not far away, a manufacturing company was poised to perform a major version upgrade of its most mission-critical application. The company relied on that application to manage the very core of its business. Without it, business could not be done. No orders could be shipped. The company wouldn't know a loyal customer from a stranger.

So the company invested serious time and effort into testing the new version. Users had been cycling through training rooms full of test workstations for months before the go-live date. The application vendor, which was on hand to witness the first large-scale deployment of the new verison and fix bugs where necessary, had been an excellent partner throughout the process. The new functionality had been tested thoroughly and the user base was very excited about it.

[ Everyone likes a good disaster story. InfoWorld's Off the Record blog tells a new one every week. ]

The go-live planning was fairly complicated. The back end of the application was running on a database cluster using some of the best servers money could buy. The database itself lived on a highly redundant storage array that had been implemented during the last application upgrade a few years earlier.

The new application version would see significant structural changes to the way the data was stored in the database, which entailed a complicated data conversion process that had to be performed offline. Fortunately, this process had been tested several times as production mirrors were restored and converted from backups for the test environment. In addition, several libraries on the workstations had to be upgraded. Unfortunately, once those library upgrades were complete, it would be impossible to use the old client version.

To mitigate the failback risk, the migration had been planned down to the smallest detail. It was decided that the system would be brought down after second shift on a Saturday evening. Database backups would be made and the conversion process would begin. Perhaps eight to twelve hours later, testing would be performed by a group of power users. If everything looked good at that point, policies would be pushed to the network to initiate the client upgrades. Soon after, Sunday second-shift employees would be able to get into the new live system and the migration would be complete.

Finally, it was go-live time. Like clockwork, the system came down at 7 p.m. on Saturday and the conversion process began. Early the following morning, the migration team had completed its work. The database looked like it was ready to go and upgraded application servers had been deployed. User testing kicked into gear again and everything looked excellent. Then came the order to push out the new client software -- and roughly 1,000 workstations started to receive the client updates.

The first sign that something was amiss came in the form of a help desk call from a power user on the shop floor. He was trying to update an order he was working on, but it was taking a long time to get from one screen to the next. It wasn't a big deal, it just wasn't anywhere near as snappy as he remembered it being from training. At this point only about 150 users had received the upgrade and logged in to the system. Members of the application team looked at each other with a growing sense of dread. It was as if the temperature in the datacenter had suddenly dropped 20 degrees.

1 2 Page 1