Production system troubleshooting 101: it’s not always about technical knowledge

Sometimes, the ability to suspend assumptions and ego are far more critical than specific technical knowledge to solve issues in production

succession brain sharing intellectual knowledge sharing

One of the biggest misconceptions about troubleshooting systems is that it requires deep, specific technical knowledge to locate and solve production issues. This assumption can often result in extending the time between the discovery and resolution of a problem. At first this may seem counterintuitive, so let’s look at some common scenarios to see which concept is makes the most sense.

To start with, most assumptions about broad concepts are generally wrong because they are based on the expectation that there is a single, best way of doing things every time. There are certainly times when the developer of a particular solution can look at a problem a production application is having and instantly say, “I know why that is happening.” This happens not because the developer deliberately left an issue but because most solutions have multiple, valid approaches. Some of them can have flaws that may not be immediately obvious. In some cases, all options have flaws, and it is a matter of choosing the path with the weakness that is least likely to be found in the wild. The experienced developer will unconsciously be aware of these potential problems and, when presented with the issue in production, will instantly recognize it. In most cases, these things will surface and be addressed in QA before they reach production. By the nature of production systems (where users are always more inventive than the best QA analyst), the application will encounter something that was not anticipated.

Once in production, the key to identifying the cause of the problem is to look at what is happening, where the person with deep, specific knowledge will most likely first look for what is expected to happen. There lies the trap. If a reasonable QA effort was put in before release, it is what is unexpected that is more likely to be the issue. The easiest way to find an issue that isn’t immediately obvious is to have no expectations and instead observe what the behavior is and trace it back to its origin with no anticipation of what will be found. It is much more about applying a way of thinking than it is about knowing something in advance to find the root cause.

There is also the psychological aspect that can occur in having the original developer investigate the issue. For reasons that could fill another article (if not a whole book), the first thing the developer tends to look for is something outside their application as the cause. It is quite possible it is something from outside causing the issue. The more experience the developer has, the more likely this is the case. In troubleshooting, the goal is to fix the problem and having any assumptions at the start can delay finding the problem where ever it is. Yes, sometimes those intuitive assumptions are useful, so long as they are abandoned if they don’t quickly prove out.

When issue is determined to be outside the responsibility of the person or team investigating, the mistake most often made is to hand it off to another team before clearly understanding how the external system is causing the issue. Failure to articulate irrefutable evidence of the source of the issue before passing it on to those responsible for that part of the system to solve can result in an unproductive back and forth between developers or teams as they also expect it is not in their work.

Once the issue is identified, deep knowledge may still hinder resolution and will not always be necessary. I was recently asked to help with an issue where the production support team followed a recommendation from the cloud platform vendor support to address an issue with throttling by moving the offending process on premises in a hybrid solution. While platform support knows its platform well, with the myriad ways it can be implemented is just not possible to always anticipate how combinations will work out. The support team followed the advice without thinking about why that process was deployed to the cloud to begin with. The change resulted in new issues because there were insufficient resources in the on-premises server. Furthermore, when validating the change, it only looked at the cloud monitoring (where the problem originally manifested). The failure point had been moved to the on-premises system and it was the business that reported the new manifestation of the problem (and brought me in to help).

The final solution was to manage the iterations in the process being throttled to bring it within threshold limits. This required no knowledge of the cloud platform beyond that throttling was a factor, and no detailed knowledge of the specific implementation because the logs clearly pointed to where the failure was occurring, which was the point where the counter needed to be added to avoid the threshold.

To sum up the lesson, the ability to suspend assumptions and ego are far more critical than specific technical knowledge to solve issues in production. During development, it is common to be stuck for a while solving a bug and to ask someone else to look at the problem with a fresh perspective. Carrying this process on into production will resolve issues faster and leave more time for working on the next cool iteration.

Copyright © 2018 IDG Communications, Inc.

InfoWorld Technology of the Year Awards 2023. Now open for entries!