When I wrote about password guessing using GPUs last week, I mentioned that password guessing is an embarrassingly parallel problem, right up there with 3-D rendering, face recognition, Monte Carlo simulation, particle physics event reconstruction, biological sequence searching, genetic algorithms, and weather modeling. Later that day, I spoke with Andy Keane, General Manager of the GPU Computing business at Nvidia.
While we were talking, Andy looked over my posting and basically said that I'd gotten it: GPU computing, and specifically CUDA, is designed to process massive data parallelism with huge data sets efficiently. It's exactly the embarrassingly parallel problems that exhibit massive data parallelism, and so they are the problems that show the best speed-ups by shifting the processing from the CPU to the GPU. Andy went even farther than that, though, and said "I don't believe in core-scaling parallelism. It doesn't work. It didn't work 20 years ago, and it won't work now."
Underneath the hood, the Nvidia GPU was designed to be data-centric, but for a graphics display the data are triangles and textures. The CUDA architecture layer makes the GPU look more processor-like, and more appropriate for general parallel problems.
Another thing that the CUDA layer does is to hide the underlying processing hardware. If you develop a CUDA program using a $100 garden-variety consumer/gamer GeForce board, you can then take the very same program and run it on a personal supercomputer built around $1,699 Tesla C1060 boards or on a four-teraflop 1U Tesla S1070 server. You don't need to know how many cores are available: you just have to tell the executive about the parallelism of your data.
For embarrassingly parallel problems, for example digital tomography, an under-$10,000 Tesla personal supercomputer can beat a $5 million Sun CalcUA. Does that apply to other problems? It depends, of course, on whether they are inherently parallel.
From a business point of view, the Nvidia GPU Computing group is currently concentrating on oil and gas, finance, and medical imaging applications. They have 250+ customers and ISVs that they can talk about, and more that they can't discuss. They also have over 50 universities teaching CUDA, and an estimated 25,000 active CUDA developers.
Isn't it hard to write parallel code? Well, yes, but no harder than it needs to be. Here's an example (click on the picture to see a larger version):