Those maddening, mysterious networking problems

No matter how impossible it seems, there must be a solution -- and we have no choice but to find it

Those maddening, mysterious networking problems
Credit: Lord Laitinen

Nothing is so uniquely frustrating as to be in the middle of a technical finger-pointing session with no suitable outcome in sight. Many times, these situations will occur when you have WAN circuits or other interconnections that cross carriers or long distances, and they aren't functioning correctly.

The issue might be internal problems with network connectivity between server clusters at certain sites or even within a single site. Whatever the case, the problem is real and tangible, but nobody wants to go further than the path of least resistance in diagnosing the issue. The server guys point at the network guys who point at the carrier who points right back.

As a customer, user, or even the engineer, this is where it gets very, very frustrating.

A case in point might be a situation where the carrier runs all its tests and verifies there is connectivity to the endpoints. That might be true for simple ICMP tests, but once heavier traffic hits the pipe, there's packet loss and sluggish performance. Nevertheless, the carrier insists everything's fine. After much gnashing of teeth and many angry phone calls, the root cause would later be determined to be a flaky optic on the carrier's switch.

In another example, a data circuit to a remote office will sporadically drop for a few minutes or produce packet loss upward of 50 percent at random times, day or night, sometimes weeks apart. Every test the carrier runs comes back five by five, so they blame flaky hardware on one end or the other. The network engineers pore over everything but find nothing wrong. They swap out the hardware, yet the problems persist. They check all the demarc connections, still the problem persists.

Only after months go by does someone finally piece together that the outages seem to occur when it's raining at the remote site. The wiring is tested from the pole to the building, and a shear in the insulation is discovered. Sure enough, the problem was caused by moisture getting in.

Another example might be a connection between secured networks where a service isn't available from one side or the other, but should be. The server crew tests its end and say there's no problem, while the network team swears up and down that there's no way the network is at fault. The latter can even see the packets traversing the network properly, but no reply comes back from the service side. A week of meetings and frustration passes before one of the server crew realizes the local firewall on one side was somehow enabled (but he's certain he checked it a dozen times!).

Worst of all is the problem that isn't visible to anyone outside of careful lab testing. That's the purview of the firmware bug. When everything is built and configured properly, verified a hundred different ways, maybe even tested in full in a lab setting, yet still doesn't function in production, you have to start thinking about firmware bugs.

These insidious problems will drive a network engineer mad because they violate the laws of physics and computer networking. This is when you might find yourself randomly applying and removing access lists from interfaces, or enabling/disabling services or interfaces (the networking equivalent of turning it off and on again) in a last-ditch effort to provide a new data point or maybe even "fix" the problem.

Worse yet, one of those methods might actually work. You might flip an interface and see a single ICMP packet reply, then back to silence. You might flip that four more times and not see another ICMP reply, yet the fifth time, you do. There might be a bug in the application of an ACL or a route map that let that one packet slip through during interface initialization if the request was timed right, and before a bug drops all subsequent packets due to a corner-case ACL parsing error. This is the stuff of nightmares.

The upshot: There really isn't a fix for any of this. There's no panacea or remedy, no quick fix or simple method to evade any of these problems. Only perseverance and constant, consistent troubleshooting will eventually suss out the root cause. It may require replacement hardware. It may require firmware updates and painstaking perusal of bug reports. It may require setting up a lab if none exists. It will probably require a heavy dose of luck, and it will hopefully result in a “eureka” moment. It will also require diligence, because there will be a root cause somewhere. Giving up isn't an option.

At its core, all networking is simple. If it breaks, it should be simple to diagnose and fix. However, we know it's not always simple, and when it's not, it's usually exceedingly challenging. Then again, what fun would it be if everything worked right the first time? (Tongue firmly in cheek.)

Oh, by the way, none of my hypothetical examples above were hypothetical. I witnessed and ultimately fixed each of them over the years. There are plenty more tales where those came from.