Troubleshooting DB processes

A quick guide to troubleshooting database loads.

This is a quick guide I put together for a dev yesterday, and I'm just pasting it in here as-is for you guys. It's not inclusive, but it just shows how to follow a systematic pattern in order to determine the cause of a load failure.

This was written with an SSIS process in mind, but it applies in one form or another to just about anything DB related that fails.

The point to take away from this is to not make assumptions and to be systematic in making your determinations. Don't jump all over the place. Start at the process and work your way outward.

Anyway, here it is. This was just an e-mail, so it's pretty informal. The specific scenario is that an SSIS load took three times longer than usual.

Here is the process i would follow for load issues like this:

1. Determine that the load time is indeed outside the normal parameters. This is done by comparing it against enough history to have an accurate trend.

2. Determine that the load completed the way it's supposed to. The extended time could have happened because there was an error that caused a long transaction to rollback. That's just one possibility, but load completeness needs to be verified.

3. Examine the process logs to find the portion of the load that extended the process. This could be because of increased rows, a change in an index, etc.

3a. If the above yields no significant change in data or index processing, then check for increased load times. If a table has tripled its load time without tripling its data, then there's a definite performance issue. If the issue is system-wide, that is to say, if all the tables are taking longer, then the problem could be at the physical level. In this case, I'd look at the health of the network, disks, NIC, etc. This is the point where it's time to get someone else involved who specialized in this level of system troubleshooting.

3b. If it's merely a single table or two that's having trouble, you can rule out network or Windows because it wouldn't effect just two tables. It still may be a disk, though, if these two tables are sitting on their own disk and it's having trouble.

3c. Check the timing of the load. Are the trouble tables happening earlier or later than they usually do? Are they happening earlier or later in relation to the rest of the load? If either of these is true, you could just be seeing contention that you normally don't run into because the timing is off. There's nothing wrong with the disks or the system in this case... things are just backed up.

3d. The next question to ask yourself is why is the timing off? Did the process kick off late and run into another process's timeline? Is there something wrong with the source system that's making it feed you these two tables slower, etc.?

This isn't a complete list, and there are some cases I didn't cover, but it's a really good start. If you follow these steps EVERY TIME when you troubleshoot a process, you'll have success. I really stressed every time because it's easy to make assumptions and they're quite often wrong. Every troubleshooting incident is different and you have to collect data from the base level every time. Don't make assumptions. If you have a theory, verify it if you can. There's a reason that doctors run tests even when they know exactly what you have. It's because too many illnesses look alike.

What you want to do is eliminate things from your process out. So start with your process and go down every path there. Eliminate everything you can that I discussed above. That way, you can say with reasonable certainty that your process did what it was supposed to do. Now you can move on to factors outside your process. So you can eliminate the DB itself after that. Once you find that it wasn't your process or the DB, then move on to the next thing... disk/memory/other hardware... next move on to Windows... next the network... next the source system... and on and on... you can trace a problem back to it's source like that. Take one step at a time and be systematic.

This is the process I follow every time. I've learned too many times that you can't take anything for granted. And don't make assumptions.

Sherlock Holmes said: When you strip away the impossible, then whatever's left, no matter how improbable must be the truth -- definitely words to live by.