For those striving to avoid system downtime, change is enemy No. 1
Ask an expert about data availability and how to ensure it, and the conversation quickly turns to the subject of human error. Not that IT mistakes are the leading cause of unplanned downtime; the research firm Gartner identifies software failures as the chief culprit, and “operator error” as the second most common cause, ahead of hardware outages; building or site disasters; and metro disasters, such as storms or floods, in that order. But of all of these major causes, human error is the one that IT can really do something about.
IT folks close to the action generally agree with Gartner’s ranking, although some suggest that Garter may even have underestimated the role of mistakes. Software failures often result from configuration errors, and sometimes they arise as the result of improper testing: an incompatibility isn’t discovered because an application was tested on a different system configuration than the one in production, for example, or performance testing didn’t give the app the workout it would get in real life.
Even many hardware failures can be laid at the feet of IT malpractice. If systems aren’t cooled properly, if they’re improperly racked, or if the procedure for starting them up and shutting them down isn’t followed correctly, equipment life is shortened and premature failures can result. Even for dumb hardware, it pays to read the manual.
But whether it’s software testing practices, hardware maintenance procedures, or the plain old boneheaded mistake lurking in the dark, the question is what to do about it.
If you’ve recently suffered from a blunder-induced outage, you might be tempted to ask, Why me? Mauricio Daher, a principal consultant with the storage services provider GlassHouse Technologies, can tell you: Not enough red tape.
In Daher’s line of work, which is helping large IT organizations prepare for disaster and recover from outages, he’s seen his fair share of glitches attributable to human error.
“Out of those,” he says, “it is mostly, ‘Gee, somebody reconfigured a LUN [logical unit number] that was actually a production LUN but they thought it was something else.’ These are simple things that I see happening again and again because of the nature of my business.”
You might think human error is an equal-opportunity affliction, but these sorts of slips just don’t happen in better-run enterprises, Daher points out. “By the time you get to a point where you can input those commands, you’ve been through so many bits of red tape that it’s impossible to make a mistake,” he says. “That type of mistake really doesn’t happen in a mature organization, because there are so many safeguards.”
Daher and GlassHouse use the CMM (Capability and Maturity Model) to evaluate datacenters. Essentially, CMM is a model for process improvement that measures maturity level on a five-point scale. When Daher assesses an IT organization, he is looking for standard operating procedures, whether they have SLAs in place, how they measure against those SLAs, and whether there is accountability at various points in the personnel chart.
Training, documentation, and standardization are the essential ingredients of process success. Falling short on the CMM scale typically has more to do with a lack of discipline than a shortage of skills.
“At one end [of the CMM scale], you might have some superstars who do a really good job of managing [the datacenter], and they’re indispensable, but unfortunately they haven’t documented fully, and if one of those guys gets hit by the proverbial bus, you’re in trouble,” Daher says. “And the other extreme is a fully documented environment where everything is automated, and if something’s not automated, there is a manual procedure in place that runs like clockwork.”Which of those descriptions hits closest to home? Choosing a well-known standard such as ITIL (Information Technology Infrastructure Library) is helpful in that new hires already versed in it will get up to speed in your environment faster, although Daher notes that many successful datacenters had similarly rigorous practices in place years before ITIL became fashionable. The key is that your internal standards be rigorous, well documented, and drilled into everyone in the organization. And those standards should extend all the way down to simple tasks such as configuring a switch and even to the naming conventions used for your zone sets.
That last recommendation came out of Daher’s work with a large oil company, in which the two administrators who managed the storage fabric used different naming conventions, and even these were inconsistent. This worked just fine on a day-to-day basis, but it’s a potential showstopper if one of those admins — or worse, someone else in the IT organization — had to recover from an outage on his own.
“A lack of consistency in the documentation of such a simple thing seems minor, but it can really kill you and prolong your pain when you’re trying to do really complex things at 2 in the morning.” It all comes down to accountability, Daher says, adding, “If their boss had really been accountable for hard results, that sort of thing just wouldn’t happen.”
Ironing out the process
For Tom Ferris, manager of servers and storage for an international financial institution that prefers to remain nameless, the success of his company’s high-availability initiative depends as much on implementing standardization and controls as it does on traditional disaster-recovery planning. He says most of the problems his group experiences are due to inadequate testing, misconfiguration, or other mistakes, and they are revamping their processes to address them. “A lot of the emphasis of the high-availability program is on putting the technology in place for redundancy and fail-over capabilities and that type of thing, but in my mind that doesn’t really get you high availability,” he says. “Most of the outages that we’ve experienced, and if you look at what the analysts say, most of the outages in general, are not caused by the technology, they’re caused by people making changes.”
The high-availability program dovetails with a utility computing initiative also going on at the company, giving Ferris and his group an opportunity to change the processes for application provisioning and administration in a way that serves both. The goal is to move away from dedicated servers for each application to a shared infrastructure model, in which the application owners will purchase a set of services — compute, storage, availability, and so on — from the IT group.
Each of the IT services will be available in gold, silver, and standard service levels. Before deploying an application, the owners will need to determine how much computing resource it needs, how much storage it needs, and the level of availability it requires, all of which will determine whether the app is deployed on a stand-alone machine, into a cluster with local fail-over, or into a cluster that supports both local fail-over and fail-over to a business continuity site 30 miles away. While each service level maps to a specific standard configuration, the administrative model will be consistent across all three tiers. The consolidated infrastructure dramatically lowers hardware costs, especially for high-availability configurations and, as Ferris notes, especially if you are faced with different groups having their own separate test and dev, staging, and production servers.
“Especially when you get into high availability,” he says, “[having all of your apps running on their own servers] becomes very unwieldy. If you can take all of your Oracle databases and combine them on, let’s say, a three-node cluster, like we’re doing, you can house a lot of databases there. You don’t have to have 15 separate database servers, and based on the requirements of the application you can configure the database for the type of fail-over you need pretty easily, because you’ve already got your cluster built.”
One key element is standardizing on configurations for production servers and ensuring that the servers in test and development match it. A central group responsible for release management will usher any new code or changes into production, making sure they are bundled up from test and development, put into staging, run through a checklist of tests, and finally promoted into production.
“In the staging and production environments, the application developers and application owners won’t have administrative access anymore,” Ferris explains. “They might not even have administrative access in test and development.” If they do, Ferris says, the environment would be closely managed to ensure that the configurations in testing match those of production servers.
The IT group uses BladeLogic to manage those configurations and control releases, and to run compliance reports to check for variance from standard configurations. The controls help prevent mistakes from impacting production servers, and the standard system images help speed up provisioning — a benefit that extends to disaster recovery.
“We’ve packaged the configuration of [our] Veritas cluster server, the baseline OS, and the Oracle database into a reusable configuration that makes it easy to rebuild the environment from scratch,” Ferris says. “You can set variables for IP addresses, so it’s easy to re-create a multitier application in a new environment.”
Investing in availability
In addition to providing important safeguards and making complex infrastructure easier to manage, the combination of standard configurations, standard procedures, automated provisioning tools, and a consolidated infrastructure helps to drive down the cost of high availability. Other technologies are playing a role here, too, notably clustered storage and server virtualization. (See sidebar.)
But while many of the associated costs are coming down, keeping datacenters running will always require significant investment in the people that maintain them, not to mention the time and effort poured into improving the processes by which the whole infrastructure is managed. Training, standards, and careful management of changes will only increase in importance as applications continue to become more complex and more interdependent.
You might find a good lesson in the famous case of the missing NetWare server that ran for four years after being sealed behind a wall by construction workers: The best thing you can do for a system is to leave it alone. Of course, that’s not possible for most business applications, especially in these days of rapid change. But if you can’t build a wall, you can at least start laying down some red tape.
This weekend's Windows 10 upgrade has users angry, and it's unclear if the ploy will continue
Here’s the best of the best for Windows 10. Sometimes good things come in free packages
Speaking at the O'Reilly Fluent conference, Eich also endorsed the Service Workers mobile app...
The new upgrade introduces small improvements across the board, but nothing to sway Windows 7 stalwarts...
These tiny Windows systems can be hidden away yet offer complete computing power
After long suffering from stagnant development, the IronPython project for running Python on .Net is...
Windows 7 and 8.1 customers have another new version of GWX, now with a countdown clock