As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. A few researchers at the recent SC12 conference, held last week in Salt Lake City, offered possible solutions to this growing problem.
Today's high-performance computing (HPC) systems can have 100,000 nodes or more -- with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12.
The problem is not a new one, of course. When Lawrence Livermore National Laboratory's 600-node ASCI (Accelerated Strategic Computing Initiative) White supercomputer went online in 2001, it had a mean time between failures (MTBF) of only five hours, thanks in part to component failures. Later tuning efforts had improved ASCI White's MTBF to 55 hours, Fiala said.
But as the number of supercomputer nodes grows, so will the problem. "Something has to be done about this. It will get worse as we move to exascale," Fiala said, referring to how supercomputers of the next decade are expected to have 10 times the computational power that today's models do.
Today's techniques for dealing with system failure may not scale very well, Fiala said. He cited checkpointing, in which a running program is temporarily halted and its state is saved to disk. Should the program then crash, the system is able to restart the job from the last checkpoint.
The problem with checkpointing, according to Fiala, is that as the number of nodes grows, the amount of system overhead needed to do checkpointing grows as well -- and grows at an exponential rate. On a 100,000-node supercomputer, for example, only about 35 percent of the activity will be involved in conducting work. The rest will be taken up by checkpointing and -- should a system fail -- recovery operations, Fiala estimated.
Because of all the additional hardware needed for exascale systems, which could be built from a million or more components, system reliability will have to be improved by 100 times in order to keep to the same MTBF that today's supercomputers enjoy, Fiala said.
Fiala presented technology that he and fellow researchers developed that may help improve reliability. The technology addresses the problem of silent data corruption, when systems make undetected errors writing data to disk.
Basically, the researchers' approach consists of running multiple copies, or "clones" of a program, simultaneously and then comparing the answers. The software, called RedMPI, is run in conjunction with the Message Passing Interface (MPI), a library for splitting running applications across multiple servers so the different parts of the program can be executed in parallel.
RedMPI intercepts and copies every MPI message that an application sends, and sends copies of the message to the clone (or clones) of the program. If different clones calculate different answers, then the numbers can be recalculated on the fly, which will save time and resources from running the entire program again.