We as humans and, yes, even IT departments can be our own worst enemy, especially when we ignore the basic concept of communication. At our company, the limited contact between developers and our network crew almost sunk a project.
Too many years ago, we kicked off plans for migration with vague promises to eliminate the expense and complexity of using "dinosaur" systems. The current iteration of this project has been proceeding for a few years, albeit in halting steps that keep getting pushed back. It was forecast to be completed already, but we're only a third of the way through.
In order to shut down the mainframe, all systems have to be migrated to other platforms such as Unix, Linux, or Windows.
The project started well enough, but quickly fell behind. Contractors were hired, experts were consulted, and eventually, the decision was made to move the mainframe function to a remote location owned by a contractor and run from its data center.
A slow project lags further behind
However, the project is hitting problem after problem, and in actuality we won't be off our in-house mainframe for some time. One of those reasons: A disconnect between the developers and the networking team.
As network specialists dealing with switches, routers, and firewalls, our team has always tried to stay ahead of the bandwidth curve. We have dedicated bandwidth, quality-of-service policies, VPN backup, and WAN acceleration, to name a few tools we use to keep the network running.
As the project dragged on, one of the company's critical processes was migrated to another platform and went live at one location. The performance was abysmal. After the developers, contractors, and DBAs all had a poke at it, they defaulted to blaming the network. "It's slow!" they cried.
My team keeps very good history, and we showed the powers-that-be the ample bandwidth and the nearly 10:1 compression on the data transferred on the network. In addition, we demonstrated that the traffic was marked high priority for quality of service.
Apparently, the information was insufficient, so we set up packet captures to see what was transpiring between HQ host A and remote host B. Unfortunately, and unknown to us until after the fact, the developers changed the devices right around this time to try to fix the problem on their end, so we captured a few packets but nothing particularly useful.
A conference call was hastily arranged, and we were clued in to the fact they were now using two specially spun-up virtual machines for testing instead of capturing actual live data.
Fine -- we moved the tap port for the packet capture and started over. This time we got data, and it was telling. The process was a database copy but not a traditional copy. It was a send-commit-acknowledge process. We could clearly see packets arrive and a short delay, then an acknowledgement sent back.
Furthermore, we saw no response to the sending server’s request to increase the amount of data that can be sent before an acknowledgement is required. The results were pretty conclusive: The process was not WAN-friendly.
Another conference call was put together. I explained the results, and one of the developers said, "It seems like it runs a thousand times faster at HQ than it does remotely."
My reply: "We have multiple gigabytes of bandwidth at HQ, a few megabytes at the remote. A thousand times faster is about dead on."
The difference between LAN and WAN
Then it dawned on me: They had never tested this product across a WAN link! The 2-to-3-millisecond delay between LAN connected hosts is not a problem, but the 40-millisecond response remotely for each packet transferred had turned the poor excuse for a file transfer protocol into a snail.
I mentioned this, but was met with confusion. I explained to the developer that the LAN is like a bucket brigade with one guy sitting between the two local servers. He fills the bucket at one server and dumps it immediately at the second server. With the WAN, they are still only using one guy, but he has to run all the way to the remote before he can dump the bucket and all the way back to get more.
We had a test WAN available, and the developers were told of the test WAN via kick-off meetings. They either ignored or forgot it and instead continued to use the LAN for all remote-type testing without ever running the first part of it as it would operate in production. Hence, they suddenly awoke to a performance problem they had never encountered before and attributed to a "slow network."
They never admitted it was a problem with the method of FTP, but they decided to pursue a different file transfer method that was better suited to a WAN. The very first test was 75 percent faster before anyone even attempted to tune it.
All told, the entire network team invested three days of troubleshooting time to the exclusion of almost all other duties, only to prove that what we informed them the first time was true.