The other night, I gave a presentation on SSIS architecture to a local user group, and I thought I'd share the discussion here. This is part of a beginner's course on SSIS I'm teaching, where I share tips on how to design your packages in a way that's both supportable and extensible.
The discussion the other night was on what to do when you have multiple tables to load. Do you put them in a single package or in multiple packages, each with a single table load? Of course, like everything else in databases, there's no single answer that works across the board. You have to look at every scenario you write and make this decision again and again. It's not that bad, though; after a while, you get a feel for it, and it becomes much easier.
Before I get into specifics, I'd like to say a couple words on why this is necessary. I've been supporting both DTS and SSIS for many years. In the process, I've seen all types of architectures for processes, and I can say they're not all created equal.
Some are much easier to support from the DBA side and to maintain from the dev side, so you have to decide on your goals for your process. Is your goal to move the data as fast as possible with little regard to support or extensibility? Are you looking for a quick recovery should something go wrong? Maybe you want to move the entire process easily to different servers or at least part of the process to a different server.
These are some of the possibilities, but the point is that you can't architect a process until you've determined your goals. How would you know if you're successful unless you measure your progress against concrete plans?
We see this in high-availability scenarios a lot. Someone will say they need high availability for their big box and instantly put in a cluster -- probably because it sounds cool. A few months later, the data typically becomes corrupted in some way and the admins are shocked to find out that the cluster hasn't protected them. This is because they didn't bother writing down their high-availability goals and, thus, lack a way to measure the solution.
I'll add that this is the same decision process you go through for replication scenarios. How do you want to divide up your publications? It's impractical to think that you'll have 80 tables in the same publication -- if something goes wrong and you have to resync the data via snapshot, you run the risk of greatly increase its duration. If you break it up, you can resync only the affected tables.
Initially, you have to decide how to slice it. Do you want to do it by table activity (putting the most updated tables together), by relationship (PK/FK tables go together), or by size (big tables get their own publication)? You must make these calls when you do replication, as well as when you write an ETL process.
This post is the short Web version of the SSIS architecture discussion. If you want to see the full analysis on these points, you can download the PDF.
Here are the general guidelines we landed on, with some more detailed discussion on each.
Put your tables in a single package when:
- You have a lot of interdependencies between tables.
- Network, data, and servers are really stable (won't need to restart often).
- You may need to change package-level info frequently.
- You have non-SQL dependencies.
That's mostly it for single package design, but if I left out any major details, send it to me and I'll add it to the list.
Now on to multiple package design -- here are the factors that could lead you down this road:
- Network, data, servers aren't reliable and process is likely to need to be restarted often.
- Tables are more autonomous (not deciding factor on its own).
- You aren't doing any logging, or logging is set in stone and won't change.
Despite all of this discussion, you'll be glad to know that you're not limited to two options. It's more than just a decision to put your tables all together or by themselves -- that's the good news. The bad news is that your choice on how to architect your package got a lot harder. Here are, more or less, your full list of alternatives:
- You can use a single package if it's small and stable enough.
- You can put all tables in their own package.
- You can use a master package.
- You can group tables in packages.
There's a lot more discussion on this topic that would have made this post too long. If you're interested in the finer points, I urge you to download the PDF to understand the full implication of what I'm saying. Just remember to always have a plan and know what you want to get out of your process. That way, you can evaluate your process against your blueprint and see if you've accomplished your goals.
Feel free to email me with questions or comments.
Also, here's the video of the presentation if you'd like to hear the dialogue in real time.
This article, "SSIS architecture: Tips for package design," was originally published at InfoWorld.com. Read more of Sean McCown's Database Underground blog and follow the latest developments in databases and data management at InfoWorld.com.