Let's say you paid $100 million for your machine, and you have all of those CPUs working hard on your problem, and one of them is slightly slower, then it's like degrading the value of your machine by 50% or more. That's how we do many of the computations right now. In terms of power management, the compiler, and the code, and runtime system have to cooperate in deciding when we can speed processors up and when we can slow them down. It can't be a self-deciding component.
What you try to do is make sure all the processors run at exactly the same speed, and they always return the answer at the same speed, so you don't have any lagging slowdown processor, or you try to cull [the laggards] out before they even run. Sometimes there are ways to determine that there are parts of a machine that aren't running as fast. But sometimes it's not so easy to do that.
With the size of memory that we have today, some part of your machine is likely to be correcting a single bit error at any given moment. Single bit errors can be detected and corrected automatically, well, it still takes a few CPU cycles so that means that that processor is still going to be late, just a fraction, to the computation because it had to clean up this fault. As we move to lower power, we also recognize that faults go up. The closer you are to operating at the jagged edge, the threshold of computing, the more noise there is in the system, and therefore the more faults there will be. This issue is quite a complex one.
What impact do you think China's new system will have, or should have on exascale development in the U.S.?
Beckman: My personal hope is it is a demonstrator of how hard work and investment in technology is important to China, and how that should be important to the U.S. as well.
It isn't just exascale. It's this notion that cutting-edge large science systems in computation drive a lot of research and lot of industry. Our investment in this space is really key to remaining competitive and being the innovators of this space. One of the things that's interesting about China's announcement, in my opinion, is they geared up this company, Inspur, to sell these machines inside China. They are building the infrastructure to churn out these systems within China and the question is then, who is next? Will they be shipping any to India? Will they eventually have the expertise to ship these to Brazil and to other countries?
So in sum, is it correct to say that China is accomplishing multiple things here: They are getting their science together, fueling a new IT industry, and are potentially creating new exports?
Beckman: It's exactly that. They are designing their own chips. They have geared up a set of students and professors, industry, and semiconductor companies to build this infrastructure. What about the software? They are not going to download software from around the world. They are designing teams to build the software. Are they preparing to export this system? You bet. They aren't just building this in the university, they've included this company, and that company will then be able to make multiple versions of this.
This article, What China's supercomputing push means for the U.S., was originally published at Computerworld.com.
Patrick Thibodeau covers cloud computing and enterprise applications, outsourcing, government IT policies, data centers and IT workforce issues for Computerworld. Follow Patrick on Twitter at @DCgov or subscribe to Patrick's RSS feed. His e-mail address is firstname.lastname@example.org.
Read more about high performance computing in Computerworld's High Performance Computing Topic Center.