Betwixt

Unfortunately, into every server's life a little rain shall fall. Your bias may lend you to claim that your chosen platform isn't susceptible to certain problems, and that may be so, but other problems will arise, fear not. In a large single-purpose farm, such as a Web server farm, the moving parts in the OS should be small in number. These servers should do one thing and do it well; if they fail at that task, t

Unfortunately, into every server's life a little rain shall fall. Your bias may lend you to claim that your chosen platform isn't susceptible to certain problems, and that may be so, but other problems will arise, fear not. In a large single-purpose farm, such as a Web server farm, the moving parts in the OS should be small in number. These servers should do one thing and do it well; if they fail at that task, then troubleshooting the problem is likely to be counter-productive. Re-image the server and get on with other things.

For smaller infrastructures, that's a luxury.

Consider the fate of a medium-size corporation with 500 employees. The majority of the users are not pushing Office documents around, but are working in a custom Oracle application to facilitate production, ordering, and shipping. Thus, only about 25% of these employees will actually need to store files on a fileserver. Thus, the fileserver need not be substantial. The IT department is running a 2P 3.06Ghz XEON server with, say, 600GB of local storage and 2GB of RAM running Windows Server 2003. This server is definitely up to the task of sharing files and printers, and is backed up to tape via the network.

It's not really financially appropriate for this infrastructure to implement redundant systems. Two of these servers are not needed for the load, and purchasing a redundant server and a SAN array isn't in the budget. With adequate backups, recovery from a total failure will be slow, but should result in complete data restoration.

The rain in this server's life isn't likely to come from disk failure, since there's a RAID5 array with a hot-spare, and disk is quickly replaced. It's not likely to come from the users, as they're simply sharing files and printers. The rain is likely to come from the admins, and they will probably never know why.

Specifically, an admin installs Brightstor on the server. The plan is to backup this server to a locally-attached IDE RAID array to facilitate quicker backups and restorations, and permit archiving to tape during production, since the backup window was slipping into the workday. The admin uses a similar non-production system to test the installation beforehand, and all is well. He then installs the application on the production server. Unfortunately, 75% through the installation, the installer simply stops. The window redraws, so the installer isn't completely locked up, but interaction with the installer is not possible. Windows event logs show nothing applicable to the situation, and the admin prudently decides to go to lunch.

Returning from lunch, the installer is in the same state. At this point, the admin decides that he will have to wait until after-hours, reboot the server, and try again. Unfortunately, the server is now in an unstable state. RPC connections are fine, printer spooling and file sharing are unaffected, but the system will not properly shut down. When a shutdown is triggered, the system simply sits at the desktop, with no attempt made to quit the shell, as the installer process is still running. The admin uses the shutdown.exe command to force the shutdown, and the server reboots.

Unfortunately, the server is still unstable. Logins are fine, logouts hang. When Add/Remove Programs is run to attempt to uninstall the Brightstor application, the window never refreshes with a list of installed apps. This is the Windows Twilight Zone. Nothing can be counted on at this point. Applets may run, they may not. Bizarre errors will be referenced for which there is no known ailment, much less a cure. Throughout it all, file and print services work without issue. The admin looks at his watch, calls home, and will be late for dinner.

This can happen unfortunately easily. Especially at risk are Microsoft Terminal Servers and Citrix servers that serve applications directly to users. A seemingly benign installation of an application may overwrite files needed by another application and all hell breaks loose, or some other problem shows up from left field. Some of these problems can be forecast by Googling, or talking to the ISV, but many are simply unknowns.

The easiest way out of this predicament requires only a few seconds of time BEFORE the installation. Pull a drive.

If the server referenced above had been built with a RAID1 array for the system partition on separate disks, and a RAID5 array for the data, the admin has the opportunity to take an instant, unalterable snapshot of the server before any changes are made. If the admin has a cold-spare disk for the RAID1 system partition, so much the better. He can remove one of the mirrored drives, put it on a shelf with a label of "GOOD-07/28/04", throw the cold-spare in the drive bay, and then proceed to make whatever changes are necessary. If all goes well, then the drive on the shelf is the new cold-spare. If the system enters the Twilight Zone, then the admin isn't late for dinner; he simply shuts down the server, pulls the two system drives out, puts the disk pulled before the change into the box, boots the system back to the exact point that the server was in before any changes, and the mirror resyncs while he's driving home. Tomorrow will bring more research on why the installer failed, and not how to escape the Windows Twilight Zone on a production system. That's more productive in anyone's book.

By the way, the above story is true; and I'm sure you have one or two to share yourself.

Mobile Security Insider: iOS vs. Android vs. BlackBerry vs. Windows Phone
Join the discussion
Be the first to comment on this article. Our Commenting Policies