Improve availability of enterprise data
For those striving to avoid system downtime, change is enemy No. 1
“At one end [of the CMM scale], you might have some superstars who do a really good job of managing [the datacenter], and they’re indispensable, but unfortunately they haven’t documented fully, and if one of those guys gets hit by the proverbial bus, you’re in trouble,” Daher says. “And the other extreme is a fully documented environment where everything is automated, and if something’s not automated, there is a manual procedure in place that runs like clockwork.”Which of those descriptions hits closest to home? Choosing a well-known standard such as ITIL (Information Technology Infrastructure Library) is helpful in that new hires already versed in it will get up to speed in your environment faster, although Daher notes that many successful datacenters had similarly rigorous practices in place years before ITIL became fashionable. The key is that your internal standards be rigorous, well documented, and drilled into everyone in the organization. And those standards should extend all the way down to simple tasks such as configuring a switch and even to the naming conventions used for your zone sets.
That last recommendation came out of Daher’s work with a large oil company, in which the two administrators who managed the storage fabric used different naming conventions, and even these were inconsistent. This worked just fine on a day-to-day basis, but it’s a potential showstopper if one of those admins — or worse, someone else in the IT organization — had to recover from an outage on his own.
“A lack of consistency in the documentation of such a simple thing seems minor, but it can really kill you and prolong your pain when you’re trying to do really complex things at 2 in the morning.” It all comes down to accountability, Daher says, adding, “If their boss had really been accountable for hard results, that sort of thing just wouldn’t happen.”
Ironing out the process
For Tom Ferris, manager of servers and storage for an international financial institution that prefers to remain nameless, the success of his company’s high-availability initiative depends as much on implementing standardization and controls as it does on traditional disaster-recovery planning. He says most of the problems his group experiences are due to inadequate testing, misconfiguration, or other mistakes, and they are revamping their processes to address them. “A lot of the emphasis of the high-availability program is on putting the technology in place for redundancy and fail-over capabilities and that type of thing, but in my mind that doesn’t really get you high availability,” he says. “Most of the outages that we’ve experienced, and if you look at what the analysts say, most of the outages in general, are not caused by the technology, they’re caused by people making changes.”
The high-availability program dovetails with a utility computing initiative also going on at the company, giving Ferris and his group an opportunity to change the processes for application provisioning and administration in a way that serves both. The goal is to move away from dedicated servers for each application to a shared infrastructure model, in which the application owners will purchase a set of services — compute, storage, availability, and so on — from the IT group.