Confidentiality, integrity, and availability (CIA) are the standard concepts encapsulating security. I often talk about the first two, but rarely the last. This column will be different, and I'll cover a few topics surrounding fault tolerance and disaster recovery. After all, if the service or server is not available for legitimate use, then you don't really have to worry about all that other security stuff.
For any high-availability service, make sure you have two or more servers in two or more locations that cannot be taken down by the same disaster. Many companies I know locate their fault-tolerant servers at different sides of the same city. Refer to any recent flood, hurricane, or tornado news to see how futile that idea might be. A fair-sized roving energy outage might take down an entire city. Better to place the redundant servers in different states, different sides of the country, or in entirely different countries altogether, if possible.
[ How to bring high availability and disaster recovery to virtual servers? See "Test Center review: Always-on virtualization." ]
Servers should have redundant drives, and for performance reasons, logs and databases should reside on separate physical drives whenever possible. If you're using RAID, go with the RAID technology that gives you the best performance bang for the available buck.
Two servers are better than one
Should you cluster your servers or use two or more independent computers to serve up the same service? Clustering implies that two or more computers share the same database, configuration, and service name. The upside of clusters: When one node goes down, the others have access to the original's data and continue to process requests without interruption. The downside: Sharing the same database and configuration can lead to application failures.
I remember the first time I spent $150,000 to cluster two servers. I made everything high-performance and high-availability, including utilizing a separate backplane channel for the clustering fail-over. I promised my CEO that we would be up 100 percent of the time. Boy, I was innocent back then. The next day, some random piece of data got corrupted, each of the participating cluster nodes dutifully duplicated the corruption, and the entire $150,000 solution went down hard. The CEO wasn't happy. There's something to be said for skipping clusters and using regular load balancing between separate servers instead.
Of course, virtualization is changing disaster recovery in a big way. Several companies I know are using VMs to virtualize everything across two or more datacenters. If a server or an entire datacenter goes down, just move the services to the other waiting virtualized servers. Several virtualization vendors even provide software to make the whole process nearly seamless.
In the near future, cloud services might be part of the solution, but the clouds are still immature and often can't sustain the uptime you can provide yourself. That should change as service availability contracts and other revenue obligations take over.
Network and end-user systems
Are your wide-area networks redundant? Does your current bandwidth provider guarantee your company two ways into and out of your buildings and onto the WAN? Or do you use two vendors? Either way, make sure that they aren't sharing the same fiber lines or data pathways into your building. Often they do, and cut fiber lines account for a significant percentage of data interruption events.
If you have high-availability servers on the Internet, do you have defenses against distributed-denial-of-service attacks? Can your current anti-DDoS solution handle an attack of any size? Even if your anti-DDoS solution can thwart a massive attack, can your ISP handle it if the attack moves downstream, or will you be thrown out to save the rest of the clients? If you have to move your servers to get out of the way of an attack, do you have pre-arranged agreements for a fast switch-over? How long does it take to update your DNS entries? Have you tested the above?
Are your end-users' systems reliable? Are you using stable drivers and testing the stability of new drivers before deployng them? I know of many companies that have caused their end-users more problems because they attempted to give them some fancy, new functionality they requested. If the end-user's system is not reliable, they won't care as much about the new features.
Consider using metrics to determine what systems are least reliable versus what systems are most reliable, and prune out the lessons to be learned. Do you monitor free hard drive space, network utilization, and CPU utilization? Do you make sure that users are backing up?
Do you test your backups? Many companies find out all too late that their expensive tape backup systems and diligent, well-meaning administrators do not make a reliable data recovery system make. Test, test, test.
When alert systems fail
Will your event monitoring system be able to alert you if e-mail or the phone system is down? Many of the very high-volume worms in the early 2000s (think Melissa, Iloveyou, Slammer, etc.) so overpowered the network that alerting systems failed.
When the Iloveyou worm attacked, I remember my pager going off over and over again with messages of undying love while I was at an offsite location. I knew something was going on, so I tried to call on my cell phone. It didn't work. I tried to use a landline, and it was dead, too. The Internet worms were being replayed into e-mail and SMS systems, rendering them useless. It took more than an hour for the landlines to recover, and my cell phone remained unusable for most of the day.
If the big one ever hits again, don't expect the Internet to function properly. It won't. Oh, you'll know you have a problem, but you won't be able to contact anyone to find out what it is or be able to direct technicians to the scene of the crime.
What do you do in that circumstance? I'm not sure. In one instance, we used overhead PA systems to communicate and pasted handwritten signs at entryways to communicate to employees and staff. The key is to expect the unexpected and build it into the response plan. Who calls whom? Is your contact list up to date? What are the second, third, and fourth communication methods? I suggest RFC 1149 for the last line of communications. It works in Brazilian prisons.
This will sound strange, unless your company has dealt with a recent global outbreak (that is, Conficker), but I think the lack of 10-minute, global-wide, Internet worms over the last half decade have lulled us into a false sense of self-assurance. We think we have crisis response down, but perhaps we have just been lucky. Make sure your availability (and response) plans are up to date and ready to react to whatever heads your way.