Kludges are a scourge -- except when they aren't

You should know real kludges when you see them, but sometimes a quick hack turns out to be the best way to go

messy pile computer parts motherboard wires cables
Loyd Case

When you work in IT, you quickly discover situations where a slapdash solution is needed right away -- because a better, long-term solution will take too long to shore up an item that's crumbling to bits.

These fixes are known as kludges. They're best used sparingly, then replaced with a respectable solution as soon as possible.

A major source of pain in IT comes from having to maintain kludges that have existed far beyond their intended lifetime and have become the long-term fix by default, though their upkeep is a burden on the infrastructure as a whole.

The nature of the beast

Kludges can be found anywhere, such as small scripts running on their own or embedded deep in a larger application or framework. They don’t have to be scripts or even code. It might be a symlink to a network share that is set once and forgotten, but is now depended upon by an application or service that will fail spectacularly if that share goes away -- yet nobody remembers it exists or why it's there.

Heck, some kludges are physical, such as production servers in racks without rails -- naturally, someone needs to pull the system it’s currently sitting on top of. Then there's the miasma of power cabling behind a rack that includes a $5 switchable power strip because someone ran out of PDU outlets and was in a hurry. Trip over that last kludge (literally) and you'll take out critical services.

Another example: Say we discover a problem with a specific service on a specific server. This service dies without fanfare a few times a week, but doesn’t leave behind any useful indicators about why it croaked. It seems random, but occurs every 36 to 48 hours. The fix for this problem is to work through whatever logging might be available, profile the process, and see what it’s doing when it dies -- and start searching mailing lists or forums for an answer, assuming it’s a commercial or open source product. The kludge would be to schedule that service to be restarted every 24 hours.

To maintain the peace, the scheduled restart might be implemented while the search for the ultimate answer is under way, then removed it once a suitable fix is found. But many times that second part never quite happens, and the problem falls through the cracks because the service is "working."

Kludge or solution?

Not all kludges are necessarily kludges. Here's what I mean: In my opinion, a kludge becomes a solution, though it may be a weak one, when it’s fully documented. Also, I’ve seen plenty of situations where something is considered a kludge, but efforts to improve on or reinvent it entirely simply lead back to the original fix -- which by definition means it isn’t a kludge anymore. It's ugly, perhaps, but no one can come up with a better option.

For example, we might have a one-off situation where a user has requested that an automatically generated file is sent to them every day. This file is an audit log on a running server. It doesn’t contain sensitive information, but it's critical for a specific project, so the requesting user needs to inspect it daily.

Assuming the application or service creating this log is not capable of emailing the log and can only dump it out to a file, would it then be a kludge to write a few lines of Bash to email the file to the user? What would the alternative be?

I suppose the application or service that created the file could have the feature written into the codebase, but that’s not going to happen soon, if ever. Should we try to copy the file to a fileshare for the user to pick up? That solution is worse, because it depends on a functional mount of the remote filesystem. Sending an email doesn’t require state on a single mount; it only requires that the mail server is operational. Should we install a Web application or expose that file via FTP or HTTP? That’s not a great idea for a variety of security and workflow reasons. Do we give the user an account on the server and teach them how to connect and navigate to the file? No, that’s a terrible idea. Then should we tell the user they’re out of luck?

No, we write a few lines of Bash that email the damned file, or if applicable, use the log emailing features in logrotate, and document it. That's what I call a solution, not a kludge.

If similar requests begin to come in for other files related to the same application, service, or process, we wind up with several overlapping solutions. At that point it’s probably time to look at refactoring and solving what is now a bigger requirement in a better way than writing several similar scripts -- but it doesn’t change the fact that the original request didn’t and shouldn’t require a massive undertaking to resolve.

One of the great parts of IT is that we’re always searching for a “better” way to do things. This is how we’ve built up IT for decades: We always try to improve and streamline processes and frameworks. It’s allowed us to speed up software development dramatically while generally increasing security and functionality. We’re constantly refactoring code, tidying up loose edges, slimming down operations when and where we can. It’s useful, important, and generally for the best. But we also need to realize when we might be tipping over into diminishing returns -- when the “fix” is ultimately worse than the problem. Recognizing that ain’t always easy.

In the words of Supreme Court Justice Potter Stewart: “I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description; and perhaps I could never succeed in intelligibly doing so. But I know it when I see it.”