At the time of this story, I worked for a large tech corporation, and we ran into an example of no matter how much pre-planning is done, human error still creeps into the day-to-day work life.
Several administrators, who due to outsourcing were located in different countries, were asked to take on a project. One particular customer had a large database that was outdated. We supported basically everything, and everything needed an upgrade, including server hardware, operating system, cluster, and database software.
[ Know what you're doing as a tech pro? If so, you'll pay the price of being in high demand, while the slackers party on. Find out the details in Paul Venezia's The Deep End blog. | Get a new tech tale in your inbox every week in InfoWorld's Off the Record newsletter or follow Off the Record on Twitter. ]
This being a critical database, all precautions would be taken, including having a current backup in place and a neat fallback plan in case anything went wrong. Instructions for the change were written down and agreed upon by all the technical parties involved. It was a pretty straightforward operation, and everything was ready for an easy, quick, and painless upgrade, with minimal downtime.
The change took place, for some reason, on a Sunday evening. First a full backup was taken, stored on one of the file systems that was about to be moved to the new server. Then the migration started.
Everything went fine for a while: All the file systems were mounted in the new servers, which were already running a shiny new operating system and clustering software. Then the server administrator ran a script, provided by the database administrator, which was supposed to upgrade the database.
One crucial mistake was made at this point: The script was designed both to install a new instance and to upgrade an existing one. The different behavior was controlled by a single command-line switch, which would indicate an upgrade; otherwise, a new install was assumed.
This command-line option was not there in the written change instructions that the server administrator executed. As a feature, when installing a new instance, the script would make sure there were no old files lying around that could conflict with the install, so for good measure it issued a different command to wipe out all of the database-related file systems before installing the binaries.
The server administrator watched as the script ran for several minutes. He somehow overlooked some messages from the script about how it was removing all files from here and there, as they got buried among several screens of information. The script finished with no errors, so the database administrator proceeded to start the database.