Exascale system development poses a unique set of power, memory, concurrency and resiliency challenges. Resiliency refers to the ability to keep a massive system, with millions of cores, continuously running despite component failures. "I think resiliency is going to be a great challenge and it really would be nice if the computer would stay up for more than a couple of hours," said Harrod.
The scale of the challenge is evident in the power goals. The U.S. wants an exascale system that needs no more than 20 megawatts (MW) of power. In contrast, the leading petascale systems in operation today use as much 8 or more MW.
Although processor capability remains paramount, it is not the center of attention in exascale system design. Dave Turek, vice president of exascale systems at IBM, said the real change with exascale systems isn't around the microprocessor, especially in the era of big data. "It's really settled around the idea of data and minimizing data movement as the principal design philosophy behind what comes in the future," he said.
In today's systems, data has to travel a long way which uses up power. Datasets are "being generated are so large that it's basically impractical to write the data out to disk and bring it all back in to analyze it," said Harrod. "We need systems that have large memory capacity," said Harrod. "If we limit the memory capacity we limit the ability to execute the applications as they need to be run," he said.
Exascale systems require a new programing model, and for now there isn't one. High performance computing allows scientists to model, simulate and visualize processes. The systems can run endless scenarios to test hypothesis, such as discovering how a drug may interact with a cell or how a solar cell operates. Larger systems allow scientists to expand resolution, or look at problems in finer detail, as well as increase the amount of physics to any problem.
The U.S. research effort would aim to fully utilize the potential of exascale, and achieve a "one billion concurrency." To give some perspective on that goal, researchers at the Argonne National Lab developed a multi-petaflop simulation of the universe. Salman Habib, a physicist at the lab, said the simulation achieved 13.94 petaflops sustained on more than 1.5 million cores, with a total concurrency of 6.3 million at 4 threads per core on IBM's Sequoia system.
The project is the largest cosmological simulation to date. "Much as we would all like to, we can't build our own universes to test various ideas about what is happening in the one real universe. Because of this inability to carry out true cosmological experiments, we run virtual experiments inside the computer and then compare the results against observations -- in this sense, large-scale computing is absolutely necessary for cosmology," said Habib.
To accomplish the task, researchers must run hundreds or thousands of virtual universes to tune their understanding. "To carry out such simulation campaigns at high fidelity requires computer power at the exascale" said Habib. "What is exciting is that by the time this power will be available, the observations and the simulations will also be keeping pace."
The total number of nodes in an exascale system will likely be in the 100,000 range, like the smaller systems today. Now, though, each node is becoming more parallel and powerful, said Pete Beckman, the director of the Exascale Technology and Computing Institute at Argonne National Laboratory. The IBM Blue Gene/Q, for instance, has 16 cores with 64 threads. As time goes on, the number of threads will increase from the hundreds to upwards of a thousand.