This week I'm taking a break from railing against the impending demise of the Internet as we know it to chat about one of my favorite aspects of data center tech and IT in general: automation. In IT, and especially in data center architecture, construction, and maintenance, we derive satisfaction from building new things and watching them work. It isn't too far removed from building that motorized Lego car when we were eight -- we like to make stuff work, then play with it. This is true throughout all aspects of a data center, from the network to the storage and back again, and definitely with the software running on our creations. We build stuff.
Once we've built stuff, we begin refining it. We tuck in the edges here, make adjustments there, and monitor everything to make sure it's exactly as it should be. In most cases, this will ultimately involve some level of automation, which is where the toolbox scripting and development come into play. We work up some code to make a manual task automatic, put it into production, and move on to the next item.
[ The InfoWorld Test Center review: Puppet vs. Chef vs. Ansible vs. Salt | Why admins should know how to code | Get the latest on data center hardware and software with InfoWorld's Data Center newsletter. ]
Ideally, we've written that code with as much error checking as possible, but I've come across more than my fair share of relatively critical automation scripts with no real error checking to speak of. Now you're manufacturing problems, as if there weren't enough organic issues to deal with.
Let's try a real-world example. We have a virtual server template that's built to allow for automatic service scaling. This template gets used to build out Web servers when load increases on a Web app. This is simple stuff -- we just need to be able to push a button (or automate it!) and have another half-dozen Web servers get spun up.
Let's assume that we already have the hooks in place to tweak the load balancer and add the fresh Web servers to the pile, so all we're really concerned with is making sure the application stack on the servers is stable and healthy when they boot. We write some code and stuff it in an init
script that reaches out to other servers to download certain variable elements that each Web server needs before it can operate properly. This, again, is simple stuff. We can automate an rsync
or scp
process and pull whatever we need. We can write and test that code very quickly and easily, and it will likely work fine.
However, if we have not put enough error checking in that code, we may find that in six months, the entire app starts intermittently crashing. Perhaps a file name changed, or a server was replaced, or someone altered an authorized_keys
file, or what have you. That seemingly innocuous change rippled through the infrastructure. Now when those Web servers spin up, they can't access something they need for proper operation.