When you work at an entity that collects sensitive, real-time data and is responsible for keeping it up-to-date and available to certain public institutions, you’d think a solid backup and disaster recovery plan would be high on the list of organizational priorities. In theory, yes -- but all it takes is one hotshot manager to break what didn't need to be fixed in the first place.
With this corporate body, branch offices were located in cities throughout several states, and at one time each office maintained its own semi-autonomous IT infrastructure. The sites had their own redundant file servers, database servers, and authentication servers, as well as on-premises IT staff.
One day a new IT director, “Julius,” showed up. He was an MBA who had saved a string of companies lots of money by virtualizing their server infrastructures. While he had lots of experience working with relatively small companies, his experience with large enterprises spread across wide geographic areas was limited.
Virtualization is of course a great way to get more efficiency from your servers and add a level of flexibility that was never available before, but unfortunately Julius ignored some fundamentals of business continuity in his infrastructure design. Having all of your eggs in one basket can make them a lot easier to carry -- but you know how that cliche works out.
Part of the problem or part of the solution?
In his first week in the new role, Julius held a meeting with all of the IT managers and laid out his grand vision for the new server infrastructure. Instead of each site having its own small server farm, they would all be centralized in the business office’s data center.
As the meeting went on, manager reactions began to follow a pattern: The greater their technical expertise, the greater their discomfort with the changes. The biggest concerns brought up: Will the virtual servers have sufficient performance to keep up with the individual sites’ needs? Is there enough bandwidth to serve all the satellite offices? Also, what happens if the central office’s data center was unavailable?
Julius brushed the questions aside with platitudes and jargon, “This is a great opportunity to synergize our infrastructure and reap the benefits of increased operational efficiencies.” Finally, with a note of frustration in his voice, he stopped the discussion and simply warned, “This is happening, so are you going to be part of the problem or part of the solution?”
Despite the managers' concern, Operation Egg Basket proceeded. Several beefy servers were purchased and set up with a common virtualization platform. One at a time, the individual sites’ servers were virtualized, except for the domain controllers, and the old equipment was decommissioned. There were some performance issues, but they were addressed by tweaking the hypervisor. There were also bandwidth issues, but QoS, traffic filtering, and bandwidth upgrades took care of them.
After about a year, the job was done, and Julius patted himself on the back for another successful virtualization rollout. For months everything seemed to work great -- until it didn’t.
First the disaster, then the recovery
Come spring of that year, a violent thunderstorm rolled through and a tornado touched down a mile away from the central business office. The electrical and telephone poles were flattened like grass in a lawn mower, taking out all related service in the area.
The data center had a giant backup generator, so the power loss was no big deal -- until someone realized that the diesel tank was almost empty. That was easily rectified by some urgent phone calls, although this was a significant detail to have overlooked.
However, the real problem was the loss of the fiber optic link to the data center. All network traffic in the company was configured to route through the central office, so the satellite offices lost access to needed services. They couldn’t even get out to the Internet because the proxy server was at the central office. Most of the VoIP telephones were down in the enterprise, as was voicemail: No file servers, no application servers, no databases, nothing.
For the better part of two days, while the phone company scrambled to get the fiber optic lines back up, the whole company remained down. Workers still had to report to their offices because lots of manual assignments needed to be done, but it was now much harder and slower to do. Very likely, a ton of work simply went undocumented. Finally, the phone company reestablished the lines, and everything started functioning again.
A silver lining
After this incident, Julius saw the writing on the wall and graciously departed for another position at another company -- probably peddling his specialty again, but hopefully a bit wiser.
A new manager who specialized in disaster recovery was brought in, and the infrastructure was overhauled once again, this time to ensure redundancy and resilience by eliminating single points of failure. A hot backup data center was brought online in case the primary went away, and the most critical systems were placed back in the individual satellite offices again.
Ultimately, there was an upside to the fiasco. We ended up with a highly resilient infrastructure that properly utilized virtualization while maintaining the other fundamentals of business continuity. Namely: Don’t keep all your eggs in one basket!