"Implementing redundancy is not expensive. It may be high in the number of core counts that are needed, but it avoids the need for rewrites with checkpoint restarts," Fiala said. "The alternative is, of course, to simply rerun jobs until you think you have the right answer."
Fiala recommended running two backup copies of each program, for triple redundancy. Though running multiple copies of a program would initially take up more resources, over time it may actually be more efficient, due to the fact that programs would not need to be rerun to check answers. Also, checkpointing may not be needed when multiple copies are run, which would also save on system resources.
"I think the idea of doing redundancy is actually a great idea. [For] very large computations, involving hundreds of thousands of nodes, there certainly is a chance that errors will creep in," said Ethan Miller, a computer science professor at the University of California Santa Cruz, who attended the presentation. But he said the approach may be not be suitable given the amount of network traffic that such redundancy might create. He suggested running all the applications on the same set of nodes, which could minimize internode traffic.
In another presentation, Ana Gainaru, a Ph.D student from the University of Illinois at Urbana-Champaign, presented a technique of analyzing log files to predict when system failures would occur.
The work combines signal analysis with data mining. Signal analysis is used to characterize normal behavior, so when a failure occurs, it can be easily spotted. Data mining looks for correlations between separate reported failures. Other researchers have shown that multiple failures are sometimes correlated with each other, because a failure with one technology may affect performance in others, according to Gainaru. For instance, when a network card fails, it will soon hobble other system processes that rely on network communication.
The researchers found that 70 percent of correlated failures provide a window of opportunity of more than 10 seconds. In other words, when the first sign of a failure has been detected, the system may have up to 10 seconds to save its work, or move the work to another node, before a more critical failure occurs. "Failure prediction can be merged with other fault-tolerance techniques," Gainaru said.