I was working for a large IT services provider, many of whose customers were large corporations, and we supported them remotely. As a result, we often did not know the physical location of customers' machines and often we knew only the people who worked directly with us. Plus, all of the work was supposed to be pre-approved by tickets, but all too often somebody would, without our knowledge, make changes.
Our team supported software that automated and kept customers' crucial operations going. It had components running on several machines at once, and a server component was needed to communicate with so-called "agent" components on other computers. Communication took place over the customer's internal network using the TCP protocol, and the network link between components was crucial for operation.
One customer requested we install a new environment, with a new server and about 20 agents. The new environment was installed over the course of several days, then we went into testing -- a phase that was planned to last about two months and that was a near replica of the production environment already in place so we could correct any problems before production.
At first everything went well, until one day three agents stopped responding. Whatever we did, the server component could not communicate with them. We started performing standard network connectivity tests to see if the problem was with the agents themselves, with the machines on which they resided, or with the network link. We couldn't get any answer from the problem machines.
The customer was a huge company, and we soon learned that although the entire testing environment was on the same private network, the physical servers resided in several different states in the U.S. We didn't know which state the problem machines were located in, so we called the customer's Unix team to check the servers; they told us the machines were up and running and there was nothing wrong with them. So it had to be a network problem.