Last week I wrote a piece on why it isn't a good idea to just reboot Unix boxes when something goes wrong without first trying to determine why it went amiss. The post was predicated on my 20 years of wrangling Unix and Linux boxes. I thought it was an innocuous point, but it stirred up a hornet's nest.
From some of the feedback I received via email and in the ensuing massive Slashdot thread, you'd think that Unix admins reboot their boxes every morning during their first cup of coffee. Yes, I got positive comments from those who understood what I was saying. But there were plenty of others who seemed to think that rebooting is something best done early and often. (For the record, I couldn't give a rat's ass about uptime numbers -- that's completely unrelated.)
[ Paul Venezia has been on a tear, writing about when to reboot a Unix box and the nine traits of the veteran Unix admin. He was recently seen shaking his fist at the sky. | Also on InfoWorld: Read Paul Venezia's Deep Dive PDF on virtualization networking. ]
I just don't get it. What the hell has gone wrong with system administration?
One of the more telling comments I received was the idea that since the advent of virtualization, there's no point in trying to fix anything anymore. If a weird error pops up, just redeploy the original template and toss the old VM on the scrap heap. Similar ideas revolved around re-imaging laptops and desktops rather than fixing the problem. OK. Full stop. A laptop or desktop is most certainly not a server, and servers should not be treated that way. But even that's not the full reality of the situation.
I'm starting to think that current server virtualization technologies are contributing to the decline of real server administration skills. Sure, you can redeploy a template in 10 minutes or so, but that's only a "solution" if you're talking about a stateless server that gets its data from elsewhere in the infrastructure and operates in a farm. You're not going to wantonly redeploy a database server just because something went slightly wrong somewhere, are you? Sure, it's possible to build structures to support that, but unless it's a massive infrastructure where there are dozens or hundreds of database servers, what's the point?
This has always been the (many times undeserved) joke about clueless Windows admins: They have a small arsenal of possible fixes, and once they've exhausted the supply, they punt and rebuild the server from scratch rather than dig deeper. On the Unix side of the house, that concept has been met with derision since the dawn of time, but as Linux has moved into the mainstream -- and the number of marginal Linux admins has grown -- those ideas are suddenly somehow rational.
The worst part is that these ideas are not just limited to proper reboot etiquette. There are numerous examples of poor Unix hygiene in many shops. To me, that's deeply unsettling, especially because some of the main tenets of Unix administration are structured explicitly to encourage proper maintenance of the operating system. The simplest example of this is filesystem structure: Configuration info goes in /etc, logs in /var, local files in /usr/local, libraries in /lib and /usr/lib, and so forth. Sure, you can scatter crap all over the disk, but everything is much simpler and cleaner if you color within those particular lines.
But if all it takes is a few clicks of a mouse in vSphere's Windows-based client to pop out a cloned server instance (ostensibly built by someone who knew what they were doing), then what does it matter? It's all very convenient and cool, right?
Wrong. If you don't understand the underpinnings, you're missing the point. Anyone can drive the car, but if it doesn't start for some reason, you're helpless. That's a problem if you're paid to know how to fix the car.
This story, "The decline and fall of system administration," was originally published at InfoWorld.com. Read more of Paul Venezia's The Deep End blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.