Within the next two hours, it became clear that something had gone horribly, horribly wrong. As more users piled into the system, the slower and slower it became. The first thought was that some kind of network problem had arisen, so the network administrator started digging into that. At the same time, server admins started drilling into performance logs on the application servers and desktop techs were sent out to spot-check workstations. Investigations of the network didn't turn up anything. The application servers and client workstations were also running as expected. No one thought the problem could possibly have anything to do with the database servers; they were fantastically oversized for what they were being asked to do. But it looked more and more like that was the only place left to look.
Six hours after the new version had gone live, the senior server administrator started sifting through the logs on the database servers. CPU and memory usage were significantly higher than they had ever been before the upgrade, but well within the capabilities of the overengineered server. Finally, she looked at the performance logs for the disk array -- and it hit her like a ton of bricks. The disk array utilization was unbelievably high, so high that the array simply could not keep up with the load. Further analysis revealed that a set of complex stored procedures, all associated with the much-anticipated new functionality, were the culprits.
An emergency meeting was called nine hours after launch. Terrible performance notwithstanding, an entire shift's worth of production data had been logged in to the new version. Moreover, a plan had never been devised for downgrading the client application libraries to the old version. Failback was simply not an option. A slow application was better than shutting the plant down for the two or three shifts in order to manually revert all of the clients to the old version.
The next week was the darkest period in the history of the company's IT department. Production schedules were severely affected, customers were unhappy, and untold productivity was lost. The political fallout was massive. In the end, an expensive new high-performance storage array was implemented and production was able to continue on the new version.
* * *
The military likes to refer to this kind of situation as a "teachable moment." In this disaster, there may be seven or eight things to learn. If your first reaction is that you should never perform an upgrade without an easily executable failback -- or never allow yourself to be a software vendor's guinea pig -- you're absolutely right. But there's also a more specific lesson.
The root cause was the failure to gauge the impact of the new software on the performance of the server and storage architecture. If comprehensive load monitoring had been executed during the testing and training phases and extrapolated to simulate the load that a user base 40 times larger would exert, it would have become immediately apparent that the storage architecture would have to be modified to support the load. Never expect a software vendor to do this for you. They don't know your environment like you do, and ultimately you're the one whose job is on the line if it's not done correctly.
As our storage infrastructures grow and become more complex, the number of things that can go wrong increase exponentially along with it. If you have a storage-gone-wrong story, by all means please share it in the comments or drop me an e-mail. I'd love to hear it.