I worked for a sales organization in a large company. In this environment the sales group had their own servers for reporting development. When reporting systems matured enough, they were turned over to the IT group, which used a formal change management methodology.
In 2007 we were reviewing systems and architecture and noted that one of our key servers was a good candidate to be replaced by upgraded hardware located in a formal datacenter under IT control. At the time it lived in a closet in a sales center -- certainly a far cry from a state-of-the-art datacenter. Replacement parts were no longer available.
[ Do you have an IT tale when something went right, a war story, or a lesson learned? Submit it to InfoWorld's Off the Record. If we publish your story, we'll send you a $50 American Express gift card.]
This server, seven years old, had started out as a report development convenience for sales -- it pulled information from several phone management servers and kept an archive of the consolidated data. This archive strategy allowed the phone servers to purge data older than one month. Eventually more phone servers were added until they numbered at least three dozen. Yet all of their information was consolidated on this one server for convenience of reporting. With the increase in volume system, performance had begun to suffer, hence the need to upgrade. Often the daily backup was taking more than 24 hours to complete.
In December 2007 the sales organization engaged the IT organization to develop a migration plan. Sales took a backup of the database and provided it to the IT group so they could look things over, size the requirements, and develop a plan for the changeover. IT was doing a number of these migrations so there were already plenty of people in the queue ahead of us, so we mutually agreed that doing the changeover in July of 2008 would work best. We had monthly meetings to ensure everything was on track. At the end of April 2008, IT took another backup of the server to capture any schema changes. The purchase orders had been filled out and were due for on-time delivery of the new hardware. Software licensing was all in order.
On a sultry summer day in early June 2008, the automated data pulls from this server started timing out. It was impossible to connect to the server. Other groups were experiencing the same symptom. A conference call was set up and we all dialed in. One of our team members, who worked in the site that housed the server, came on the line and said, "Hey! Guess who finally quit smoking!" We all asked who, and he said the name of the server. The group let out a collective shudder.
The motherboard had failed in a glorious fashion -- all of its smoke had escaped, leaving a hole where circuit board had once existed. The disk array was moved over to another system to check the state of the database. Apparently, as the CPU was going through its death-throws, it had just enough energy to reach out and corrupt the database.
OK, just pull from the last backup. But this was basically a sales organization. And that problem with daily backups taking over 24 hours to complete? It turned out that the solution sales devised was to stop doing backups. The last full backup the business had was from six months earlier. Fortunately, IT had taken a full backup at the end of April. The phone servers still had their data, so the month of May was still available. But we had to act fast in order to ensure that no data was lost. Additionally, the sales organization was running blind without their operational reporting.
The following had to be completed within 48 hours:
- set up the server hardware
- load the operating system, database software, and security apps
- restore the database from the April backup
- copy data from the call center servers to the new server
- update firewall settings so the proper people, applications, and servers could connect (the IT datacenter was on a different domain)
- update the connecting systems and applications to use the new server
Given the sales group's track record of poor tech decisions and blundered execution, things did not look promising.
Have you ever heard a piano played by a young student, and then played by the master teacher? Or seen shop tools handled by a middle school student, and then handled by a master craftsman? What a difference the same tools make in the proper hands.
Our IT group had the master craftsmen needed for the task at hand. This group completed the system resurrection -- and completed it 12 hours ahead of schedule! It boiled down to their preparation, experience, communication, and professionalism.
Preparation: Fortunately we had an IT group that had been regularly migrating servers for the past six months and had the process down to a science. They knew how to quickly and efficiently move data, check data integrity, verify permissions, resolve firewall settings, etc.
Experience: They had experience in what areas posed the greatest risks and had developed methods and procedures to handle those areas. When permissions and firewalls issues were discovered, they were typically resolved in under 5 minutes.
Communication: During the conversion period there were checkpoint conference calls to see if we were on schedule and ready to proceed to the next step. Users were informed and involved to verify system functionality as early in the process as possible. It was these checkpoint calls that helped us move ahead of schedule.
Professionalism: The IT group had done this so many times that they could remain calm in a situation that others considered a crisis. This set the tone to help everyone work the problem and not panic.
To this day I am still amazed and impressed with the IT group's handling of our tech emergency.