Network duty calls -- every 15 minutes

Tick tock, there goes the network! A techie tries to get to the bottom of an internet outage that runs like clockwork

When issues arise in any profession, those tasked with solving the problems draw upon their unique skills to come up with a solution. In IT, sometimes you use your troubleshooting knowledge, sometimes you use leadership, and sometimes you lean on entirely unexpected assets.

Years ago I was hired by a small but growing company that finally realized it needed IT help. I was tasked with creating a proper technology infrastructure to connect the handful of small offices. It was a great challenge to take on relatively early in my IT career, as I had to wear every possible hat in the one-member IT department.

Being the only individual responsible for all technology in a company comes with great rewards, but it can have its challenges -- as anybody familiar with the situation can attest. Don’t have a backup system in place, and money is tight? Build one and become the expert! Need to interconnect offices, but have never actually done so? Learn networking! The business bought new software and automatically assumes you’re an expert? Read the manual!

But there is an unexpected drawback: In the process of learning lots of things the hard way, there can be a tendency to become self-reliant to a fault. If you were able to solve the last 10 problems through hard work, research, and maybe blind luck, then it’s easy to assume the 11th problem will end up the same.

Easy fix -- or not

This was the assumption I made when confronted with an unusual issue from one office location. I got a call that the internet was down. After I verified the office couldn’t get outside to the internet and performed some other basic troubleshooting, I couldn’t quite pinpoint what was happening. I decided to try the old method of rebooting, and I asked an employee to unplug the gateway device and plug it back in.

It worked! Case closed -- for exactly 15 minutes.

I got the call again and performed the same “fix,” which also worked for exactly 15 minutes. After another round or so of this same pattern, I determined that a pre-emptive strike would ward off having an on-site employee have to physically reset the device: I could reset the WAN interface remotely after 14 minutes, which would effectively reset the clock and result in only a few seconds of downtime. The location in question was customer-facing and needed to process credit card transactions, so downtime was a very big deal.

So went my day, every 14 minutes resetting the office’s internet connection. In between resets I worked hard on the task of figuring out the root cause and solving the problem for real. I’d field support calls and attend to other pressing matters, all the while running back to my computer every 14 minutes to prevent catastrophe. All day long I read logs, tested settings changes, pored over every semipertinent search result -- and pressed that damn button every 14 minutes.

At the time I was following a TV show that featured a similar “push the button at exactly the right time or the world will end” scenario, so there was a certain humor to the situation.

I continued trying to solve the problem well after my workday had ended, even stopping on my commute home to press the button remotely. After a while, I started to feel the first signs of my mental well-being unraveling and had to admit to myself that this situation may not best be solved by the usual reliance on outsmarting the problem. I needed help.

Know the right connections

I was able to persuade (read: bribe with beer) a good friend who was also a really smart network engineer to have a look that evening. After a couple of brews, some Wireshark, and lots of actual networking know-how combined with my earlier findings, we eventually got a view into what was going on.

I’d figured out earlier that some replication jobs were retrying every 15 minutes, but hadn’t been able to determine why that would cause such a huge issue.

The problem had to do with a redundant WAN interface I had set up as a failsafe in case the main line went down. It appeared that when traffic spiked, the gateway device would ignore the correct active/passive configuration and attempt to distribute traffic to the secondary interface; then the whole OS would crash.

We were able to work around the issue temporarily by disabling the second WAN connection. The following day when I showed the issue to the vendor, it turned out to be a yet-unknown bug in the OS, and I spent the next months working with them to test a solution.

The valuable lesson I learned from the ordeal (besides not buying the cheapest possible hardware) was the importance of knowing your limits. Sometimes, even if you’re not the expert, you can figure it out and make it work given enough patience. However, not every situation is conducive to that approach. In larger organizations you can often escalate or at least get a second opinion from a colleague, but it can be harder in other environments.

Even in a situation where you think you’re the only one who can solve a problem, there is almost always somewhere to turn, whether it’s vendor support or a six-pack among friends. The trick is figuring out when and where to look for help.