The secret to troubleshooting performance problems

Performance problems are among the most difficult to solve, requiring much preparation and well-honed soft skills to undo

In my years in IT, I've seen ridiculous actions taken in the name of improving performance. I've seen hundreds of high-end thin clients sporting more horsepower than a typical PC deployed simply to run a remote desktop client. I've seen a whole blade chassis's worth of servers deployed to do the work of a single server. I've seen video cards designed for gaming installed in desktops to make a line-of-business application work better.

In most of these cases and others like them, unquestionably wasteful decisions were made because of a pervasive fear of one of the worst types of user complaints an IT pro can hear: "It's slow." Those two words issued from the right lips into the right ears can touch off a political disaster that often ends with a pile of wasted time and money. Many times, simply being seen to take any action at all -- regardless of whether it helps -- is more valued than the frequently painstaking process of figuring out what the problem really is (and indeed whether there even is one in the first place).

The real challenge for IT pros faced with a high-profile performance complaint is to quickly and decisively determine where the problem may lie before wasteful measures that serve only to distract from the real issue are forced down our throats. This almost always requires the right tools to be in place ahead of the complaint being made, great communications skills, and in the worst cases, the intellectual curiosity to dig into the weeds in search of a smoking gun.

Here's what you need know about each of these three critical troubleshooting weapons.

Troubleshooting weapon No. 1: Preparation

One of the most critical parts of any performance-troubleshooting challenge isn't necessarily proving where a problem is, but rather where it isn't. If you can immediately proclaim upon receiving a performance complaint that the network, compute, and storage infrastructures are not to blame for any perceived slowness, you can short-circuit the knee-jerk "it's the hardware!" conclusion often made early on in the process.

The reason hardware tends to get the automatic blame is because hardware is easy for pretty much everyone to understand (or think they understand). Even a completely nontechnical stakeholder will know that a 10Gbps Ethernet switch is probably somewhere around 10 times as fast as a 1Gbps Ethernet switch. When doubts as to whether the network might be the cause of a performance problem crop up, it's simple for the nontechnical to seize upon a 10-fold increase in performance as a potential solution. The same goes for all kinds of hardware, servers, storage, and network alike.

Being forearmed against this requires you to have thorough monitoring tools in place. If you're monitoring literally every piece of gear in your infrastructure, you'll always be able to prove whether you're really bumping into the limits of the infrastructure or whether the problem is caused by something else. Although monitoring systems take some work to get running and dialed in, I've seen time and time again how incredibly useful they are in steering a troubleshooting conversation in the right direction.

Although simply monitoring performance metrics of the infrastructure will help, also consider configuring your monitoring systems to evaluate user-facing criteria. For example, you'd typically configure a monitoring system to monitor storage latency, storage throughput, database query response times, and network throughput to evaluate the performance of a database-driven line-of-business application. Those are all great metrics to have, but also consider scripting a process that logs into the application as a normal user would, performs a few basic functions that a typical user would perform, and logs out. If you configure that script to be timed by your monitoring system, you'll have a canary in the coal mine to tell you when things are wrong, whereas other, more specific monitoring might miss.

Troubleshooting weapon No. 2: Communication

Many times, if you were sitting behind the user's shoulder when he or she encountered a problem, you'd immediately be able to identify the cause. However, when you're not there and the problem happens randomly or without enough consistency to reproduce, knowing how to ask questions and interpret the answers is absolutely vital. In fact, I'd go so far as to say that someone who knows how to ask good questions and logically parse the replies -- but knows next to nothing about technology -- may be in a better position to identify a problem than the most technically astute person who lacks these skills.

Often, asking an affected user to keep a log of exactly what happens -- and when it happens -- can really help when it comes to matching user complaints to system logs and performance charts. If you can coach the user on what to look for and how to document events accurately, you can save yourself a lot of trouble chasing the ghosts created by approximation, hearsay, or an overactive imagination.  

Getting an accurate description of the problem not long after it is first reported is perhaps the most valuable clue you can work with in the troubleshooting process. I can't tell you how many times I've helped someone chase a problem that didn't actually exist -- at least not the way in which it was described. Not only is this frustrating for the IT pros who waste their time chasing ghosts, it's also intensely frustrating for stakeholders who see no progress being made on an issue they've reported.

Troubleshooting weapon No. 3: Curiosity

Once you have a good description of exactly what the problem looks like in the field, you'll usually already know your answer as to what the problem is or be able to hand it to a third party such as an application vendor to solve. However, the most difficult performance problems won't be untangled that easily. The really fun ones leave no trace of their existence. Performance graphs will look well within limits, error logs will be devoid of useful errors, and hardware/software specifications will all be satisfied.

In these situations, you need to be able to think outside of the box for new ways to tear into the problem. That might involve breaking out medieval tools such as a protocol analyzer (like Wireshark) or process tracer (truss, strace, procmon, regmon, and so on). However, being able to use those tools to good effect basically requires you to be really curious about how things work at a basic level. As with communication, not every IT pro has this skill.

Putting it into practice

Athough I've seen many good examples of this process play out, perhaps the best one is courtesy of an organization that depended heavily on a Web-based line-of-business application. One morning, an executive appeared in person at the help desk with a critical issue: Users were encountering crippling performance problems, and action needed to be taken right away.

However, the exec appeared not just with a problem, but also with the solution: The company's SAN obviously was the culprit and would need to be replaced. He also had just the guys to do it sitting outside, but someone had to talk to them right away and formulate a plan to move forward with the upgrade. Before the IT director had even finished his day's first cup of coffee, he was staring down the barrel of an executive-mandated forklift storage upgrade, an apparently upset user base, and a critical application problem he had never heard of.

However, that IT director was no slouch. He had the right monitoring tools in place and was able to (very, very gingerly) demonstrate that the performance of the existing SAN was up to par: Latencies were low, none of the controllers or interfaces was being taxed, and back-end database performance was not only excellent but had been relatively unchanged for months. Faced with this evidence, the executive realized he might have skipped a few steps in the troubleshooting process and sent the reps from the storage vendor that were sitting in the lobby packing.

A great deal of further discussion uncovered what had taken place. For the previous five or six weeks, the executive had heard generalized griping from some of his reports about the performance of the Web app. Though nobody could pin down exactly when it started, a series of menus in a common workflow had begun lagging by about 30 seconds every time a user touched them for the first time in a workday. Sometimes it would happen more than once a day.

It wasn't enough time for anyone to completely freak out, so nobody ever opened a help desk ticket. However, it was enough for people to become annoyed over time. Eventually, that annoyance made its way to the exec, who started crunching the numbers on the productivity being lost. Then, when the storage company showed up to pitch its shiny, new flash-based array that could cure all your application-performance woes, the exec connected the dots and jumped to a conclusion.

After sitting with users for a long time to observe the problem and playing with an HTTP protocol analyzer (a tool not unlike the timeline feature in Google Chrome's Inspect Element control), the IT director determined that the style sheets supporting the series of application screens in question included a graphic that was hard-coded to the URL of a recently decommissioned test server. Every time the browser would try to load the image, the user's machine would idle while trying to reach the decommissioned server and eventually give up on loading the image. Because the image was needed by the browser to render the screen, the application would appear to hang. In the end, it was not a performance problem at all -- and it was extremely easy to fix.

Due to a combination of the IT department's preparation, the ability to winnow down the details to isolate the actual problem, and the curiosity and tenacity to dig into the details and find the ultimate cause of the problem, the company saved itself from a massively expensive and ultimately pointless storage upgrade. Without those factors, IT would have lost control of the troubleshooting process in the first hours, been completely distracted for months, and still not solved the problem.

This article, "The secret to troubleshooting performance problems," originally appeared at InfoWorld.com. Read more of Matt Prigge's Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

Copyright © 2013 IDG Communications, Inc.