Another effect was slightly more mysterious: The SRAM at the top of server racks had a significantly greater number of transient errors than that at the middle or the bottom of the same racks, within both Jaguar and Cielo.
"There is a trend towards a higher rate of SRAM faults as you go up the rack," Sridharan said. "This is something we don't really have a good explanation for."
SRAM on the server on the top of the rack had 20 percent more transient errors than the SRAM on the servers on the lower levels. "This is not a huge effect, but it is a consistent one," Sridharan said.
The difference probably could not be attributed solely to cosmic rays, Sridharan said. He briefly speculated on a number of possible causes. For example, because heat rises, the servers at the top of a rack are hotter than those on the bottom. Heat is a well-known culprit in equipment failure.
A low-cost solution, such as installing heat shielding on server racks, may be worth investigating, Sridharan said.
In the study, the group also looked at the DRAM memory faults. They examined memory from three different vendors and found that the fault rate of one vendor was four times the rate of another vendor. The group did not release the names of the vendors but did alert the vendor with the leading error rate about the comparatively high rate of faults for its products.