How do you set up SQL queries in your packages? Let's say you have an Execute SQL task and you need to use a large table join to query the source table. There are several ways you can store the queries, and you can even combine some of them. I'll list them here, but I may forget one (or one combination); don't take this as an exclusive list.
[ Cut straight to the key news for technology development and IT management with our once-a-day summary of the top tech news. Subscribe to the InfoWorld Daily newsletter. ]
You can store the query as:
- Stored procedure
- Package variable
- .SQL file on the file system
- Directly in the Execute SQL task as a query
- Directly in the Execute SQL task as a stored procedure
- Directly in the Execute SQL task as a view
- View inside a stored procedure
- Stored procedure in a .sql file on the file system
- Stored procedure in a variable
- View in a variable
- View in a .sql file on the file system
- The variable can also be stored in a config file
I realize that some of these are splitting hairs, but there are a lot of choices to make. Let's assume this query will have 12 joins and return 70 columns, most of which will have a CASE statement or function or other manipulation applied to them. This query could easily be five pages long depending on how you format it. You definitely have to pick the storage method that will be the easiest to maintain.
The first thing you should do is decide how the code will live in relation to the database itself. Should it live inside the DB or the package? What are the factors involved in where you put the code? Well, you have to consider network traffic. If the query's huge, then do you want to push it across several calls on the network? How often will the package run for this to even be an issue? Will this package be rolled out to several servers, and if so, does it need to run with different parameters in the query? How often will the query need to be changed, and what are your change procedures? What's the security like in your database? Will you likely mess with the SP?
Those are a few factors involved in deciding where the query should live. Quite often, the choice is made to use an SP. This is perfectly valid, and some might even say it's a best practice itself. In fact, I often prefer this myself. When using an SP, it's easy for the database admins to support the query without disturbing the package. Since SSIS packages are compiled, we can't just change them on the fly. They have to be loaded into Visual Studio, then recompiled and redeployed. It can be a pain; at the very least, it's a process.
The point is that several factors will decide where you keep your queries. While you may make small changes from time to time, chances are you'll pick a standard and stick with it.
I was prompted to write this recently when I came across a best practice suggesting you store queries in package parameters. In theory, if the package has to be ported to another server, you could pass it in with slightly different query code that's specialized for that server; it's supposed to make the package execution more dynamic without having to redeploy the package. Also, it should be really easy to change the query if needed.
I found it hard to believe that this would be an actual best practice -- it just doesn't seem to be workable enough across the different scenarios to be useful that often. For starters, let's take the problem of the large query. In the above example where the query is very, very large, you certainly wouldn't want to pass this into the package at runtime as a parameter -- if you even could.
Here's a quick list of the problems I have with this best practice:
- Queries can easily be too long to reasonably pass into a parameter.
- It's too easy to easy to accidentally change something in the job.
- You can’t easily look at the package and know what params are being used. You have to look at the job too.
SSIS metadata is really touchy, and if you change something in the query, you could throw the metadata out of whack. Not having to verify the package after a change can cause it to fail when it next runs. This isn’t necessarily a problem when you’re changing a condition, but if you’re changing one of the columns returned, you could throw the metadata out of sync.
Let me get this straight: If I change the where clause, I won’t have to open the package. Correct. How about if I change one of the columns? It depends on what you do. How about if I add a function to one of the columns? Same thing, it depends. This sounds too complicated to me. Can I just store it either in the database or in the package and just open the package every time? Sure, but it's a little cumbersome, isn’t it?
And again, I’m not saying you can’t do this -- but you need to be aware of what you’re getting yourself into. This is the same with all best practices. They're developed for a reason, but the reason is never given, so you’re stuck with figuring out what it is for yourself.
I’ve heard a best practice defined as something that works 51 percent of the time. That’s a reasonable explanation, but this best practice doesn’t hold up under such scrutiny. There’s no way that 51 percent of your processes will benefit from this standard.