Your own worst critic

How to be proactive when writing processes

One thing I used to see in consulting all the time, and that I still see now, is how little devs actually critique their own processes. Quite often, there's wasted motion everywhere and not only does nobody seem to mind, but they don't even notice it most of the time.

We have a saying in Kenpo: Never do in three motions what you can do in two, and don't do in two motions what you can do in one. And don't worry, this is open to as much interpretation in IT as it is in Kenpo. If you don't mind too much, I'm going to skip the Kenpo interpretations and jump straight into IT -- and more specifically, DBs.

What I mean is simply this: If you're writing an ETL process and not really transforming the data in any significant way, then it doesn't need to be staged. This is one of those things I see all the time. I see ETL processes that stage data in a separate staging DB regardless of whether it's actually being staged or not. And of course the reason is always "continuity." All the other tables are staged and we want to keep the same process for everything? Why? That follows a certain amateur logic, but you're doing a double-hop and increasing the time of your load for no reason. Symmetry has it place in paintings, culinary arts, body building, etc., but it's not a good reason to promote a process that has adverse effects. In IT, the end result is all that matters. And unlike in personal matters, the ends almost always justify the means. Not adhering to strict symmetry is quite often revered as thinking outside the box.

So the short answer is never to anything just because that's the way you're doing it everywhere else. Everything in IT exists for a reason and there's a lot of overlap of functionality. DBs are really bad about that. There are so many ways to move data it's not even funny and they're not all equal in all situations. For instance, you would never even consider (I hope) writing an ETL cycle that revolved around BCP. BCP isn't meant for ETL for many reasons. Mainly, with writing the file, copying the file, and then loading it, it's typically just not fast enough. So you would never use BCP as your ETL mechanism. But you would use BCP as a tool in your ETL process.

Say you had 3 billion rows in a single table that you had no mechanism for loading incrementally. You have to ship the entire table to the other DB every night. You surely wouldn't want to copy that much data across the wire. It would take forever. But you could do a parallel extract into text files, Zip them up and then load them in parallel on the other side. It's not pretty, and it's still going to take a long time, but if it's your only choice then it's your only choice. What this doesn't mean, though, is that you don't BCP every table in the load, nor do you use oledb for this table because every other table uses it. Use what works in a given situation.

My old chess coach used to make me justify every move I made. If I moved a pawn, I had to have a reason. If I took a piece, I had to have a reason -- and not just a reason, but a good one. I had to analyze the situation and I had to ensure that I was coming out ahead on the exchange, or that I wasn't leaving myself open to something a few moves down the way. DBs have the same number of gotchas. Every time you write something, you have to ask yourself if you're doing the right thing. Be your own worst critic. Justify everything you do in the code or in the process.

If you choose to drop and re-create the tables every night, then give a good reason and weigh it against the alternative of keeping the schema intact. If you do a snapshot load of a table every night instead of an incremental load, then justify it. And you have to ask yourself what the alternatives are. Will this put you ahead of the game? Everything will put you ahead in some situation or another, and behind in others. So you need to seriously think about what situations you want to protect your process from and code for those.

Think about the possible breaking points. Where could this go wrong and what kind of position will you be in once it does? Can you recover easily? Can you recover quickly? Is recovery going to be manual or automatic? I'm sick of seeing devs write code in a vacuum without any thought to anything that could happen in production. You're a professional, so start acting like it. If you don't know where to start, then get with your DBAs and ask them what they typically see in failures and if they know of a solid way to prevent them.

There's no crime in asking the question and getting the process right, but it is a crime to not ask the question and doing something stupid because it might make you feel bad that you had to ask for help. Before something hits production, anything goes. You can fight, scream, test, mess up, whine, or whatever it is you and your group has to do to make a good decision. I've been involved in some pretty heated discussions about processes myself. But what matters is when your code hits the box, will it perform? That's the bottom line and that's all that counts.

So ask yourself the questions. And don't be afraid to tackle the tough issues because chances are that any issue you avoid is the one that'll get you out of bed and you'll have to face it then anyway. And keep a list if you have to. Write down all the ways a process can fail and check off the ones you're interested in protecting this one from.

The creator of American Kenpo, Ed Parker, said (paraphrased), "Anyone can come up with an answer, but it takes a genius to ask the right questions."