About 20 years ago, I worked as a troubleshooter for a computer vendor. Since our customers were located in many locations worldwide, I traveled a great deal. When this particular story took place, I was on assignment in an Asian country that had purchased our computers to run their stock exchange.
One of the reasons I was there was to investigate why they were having a very high failure rate on components. The stock exchange had seen a huge growth in volume over the last year (rivaling the New York Stock Exchange). Because of the growth, they had upgraded their computer systems twice during the last year and were in the middle of another build-out to expand the computer space from one room to the whole floor.
[ Also on InfoWorld: Read memorable Off the Record stories from 2009 in "Tall tales of tech -- that happen to be true." | Send your IT Off the Record story to firstname.lastname@example.org -- if we publish it, we'll send you a $50 American Express gift cheque. ]
The stock exchange's systems engineers who were in charge of the computers and maintenance gave me a tour. They were very proud of the expansion, of course, but even on a quick walkthrough, I was appalled by the lack of attention to detail.
For instance, as we made our way to the computer room, I noticed they had put up plastic sheets between the computer room and the construction area to try and minimize the dust. There was no tape to connect the plastic sheets to the wall and seal the area. Instead, the plastic sheets were hung so that there were large gaps to allow people to walk back and forth between the construction and the computer room. The engineers seemed unconcerned when I pointed it out.
We began a tour of the computer room. The computers were high-power units that took a large volume of air to keep cool. As I walked past, I touched the top of one of the computer cabinets (before the era of pizza box servers), and I burned my hand. Opening the cabinet door I discovered that the air filter was packed solid with dust from the construction and that there was no air flow.
Mystery solved -- I asked the systems engineers why they were not cleaning out the air filters.
Answer: "Because if we clean them out, they will just get full of dust again."
This opening day was followed by three weeks of similar incidents. The systems were running at 80v rather than the minimum allowed 100v, which accounted for the high power-supply failure issues. At night, the engineers would just hit the master power switch to turn off the system rather than shut the system itself down, which explained some of the disk problems they were having. On the software side, I discovered that the system was badly misconfigured, which caused performance problems.
But the biggest issues were with the IT and business managers, who tended to be close-minded and did not want to change. The business managers tended to say, "There must not be any problems." The IT managers had minimal, if any, technical background, and many of the systems engineers were very junior employees. This mix made for a culture in which it was very hard to speak up and implement positive changes.
The managers' default method of dealing with a problem was to blame the vendor. It took another eight months before they took some responsibility for their own actions. And the main reason they started taking responsibility was a series of news articles about mismanagement at the exchange.
The takeaway I had from the long experience was to start working with those at the top, not in the middle. When I started the assignment, I was working with middle managers who couldn't make policy decisions, and we didn't get anywhere. Even though working with the top IT managers was also a dead end for a while, at least I was dealing with those who could actually make the changes. Chipping away at the problems with the top managers got us further faster than working at the midlevels would have.