This is a quick guide I put together for a dev yesterday, and I'm just pasting it in here as-is for you guys. It's not inclusive, but it just shows how to follow a systematic pattern in order to determine the cause of a load failure.
This was written with an SSIS process in mind, but it applies in one form or another to just about anything DB related that fails.
The point to take away from this is to not make assumptions and to be systematic in making your determinations. Don't jump all over the place. Start at the process and work your way outward.
Anyway, here it is. This was just an e-mail, so it's pretty informal. The specific scenario is that an SSIS load took three times longer than usual.
Here is the process i would follow for load issues like this:
1. Determine that the load time is indeed outside the normal parameters. This is done by comparing it against enough history to have an accurate trend.
2. Determine that the load completed the way it's supposed to. The extended time could have happened because there was an error that caused a long transaction to rollback. That's just one possibility, but load completeness needs to be verified.
3. Examine the process logs to find the portion of the load that extended the process. This could be because of increased rows, a change in an index, etc.
3a. If the above yields no significant change in data or index processing, then check for increased load times. If a table has tripled its load time without tripling its data, then there's a definite performance issue. If the issue is system-wide, that is to say, if all the tables are taking longer, then the problem could be at the physical level. In this case, I'd look at the health of the network, disks, NIC, etc. This is the point where it's time to get someone else involved who specialized in this level of system troubleshooting.
3b. If it's merely a single table or two that's having trouble, you can rule out network or Windows because it wouldn't effect just two tables. It still may be a disk, though, if these two tables are sitting on their own disk and it's having trouble.
3c. Check the timing of the load. Are the trouble tables happening earlier or later than they usually do? Are they happening earlier or later in relation to the rest of the load? If either of these is true, you could just be seeing contention that you normally don't run into because the timing is off. There's nothing wrong with the disks or the system in this case... things are just backed up.
3d. The next question to ask yourself is why is the timing off? Did the process kick off late and run into another process's timeline? Is there something wrong with the source system that's making it feed you these two tables slower, etc.?