In perusing the big compute news stream this week, as reported from insideHPC.com and other sites, it seems to me that the biggest theme for those worried about building supercomputers for business or science is that chip vendors are unreliable.
This ain’t news. Each of the chip manufacturers have missed schedules and shipped bad products (remember the Intel chip that couldn’t add?) in some years while others have had great years. In bad years they generally claim that making chips is hard, and in good years they claim to have vastly superior technology to their competitors. I think the former is true: making chips is hard. Sometimes you get lucky and ship a (mostly) working product on time, and sometimes you don’t.
AMD was in the news much over the past two weeks as news of their TLB bug circulated widely, followed by the pulling of all Opteron benchmark results by SPEC for failing to meet the general availability requirements. AMD also apparently underestimated the power draw of the Opterons as well, raising its TDP estimates from levels of 68W, 95W and 120W for various flavors of its chips to 79W, 115W and 137W. And we ended the week with news from The Register that Sun may delay its Rock processor into 2009, and that when it does ship it may have a significantly reduced feature set.
None of this is such as big deal for the desktop consumer. In a one socket machine no one much cares if performance is a few percent points off, or if the power draw is slightly higher. But for those of us building parallel computers these errors are a big deal. An 11 watt per socket error in a 1,000 or 10,000 socket machine adds up to real money, if the customer can even support the increased power requirement. And acquisition decisions are often made on the basis of thin margins for better power or application performance. Changes like these can invalidate purchasing decisions, and send organizations back to the drawing board.
We typically don’t buy big compute directly from the chip manufacturers, but from vendors (like Rackable, Sun, SGI, and others) who specialize in combining the servers with interconnects, software, and storage to create a total system. When these vendors commit to delivering a machine, they base the expected performance on what the manufacturers tell them. Then customers build facilities, clear out floorspace, and commit to internal service levels based on what vendors tell them. There is ample anecdotal evidence that the chip manufacturers highly prize their role in the supercomputing market, and often deeply discount their chips to help vendors win bid.
But when these manufacturers flub the fundamentals there is no negative feedback loop. The supercomputing manufacturers make the commitments, and they bear the brunt of late delivery penalties and systems engineering costs to repair ailing platforms. The manufacturers have been paid, and have moved on the next win or the next chip.
This model is fundamentally broken. Other than taking a good will hit, which may or may not translate into damage on future sales, there is little penalty for the chip manufacturers for late delivery, underperformance, or bad infrastructure estimates. The real cost of these mistakes is born by the system vendors, and by the HPC community itself in the form of reduced productivity, and delayed entry of new customers who could use the added power but aren’t prepared to deal with the problems.
The supercomputing and cluster system vendors need to find a business model that connects their suppliers more tightly to the final performance of the systems that they’re all in it together to build.