GPU computing is about massive data parallelism

For embarrassingly parallel problems, for example digital tomography, an under-$10,000 Tesla personal supercomputer can beat a $5 million Sun CalcUA. CUDA makes the parallel programming tractable.

When I wrote about password guessing using GPUs last week, I mentioned that password guessing is an embarrassingly parallel problem, right up there with 3-D rendering, face recognition, Monte Carlo simulation, particle physics event reconstruction, biological sequence searching, genetic algorithms, and weather modeling. Later that day, I spoke with Andy Keane, General Manager of the GPU Computing business at Nvidia.

While we were talking, Andy looked over my posting and basically said that I'd gotten it: GPU computing, and specifically CUDA, is designed to process massive data parallelism with huge data sets efficiently. It's exactly the embarrassingly parallel problems that exhibit massive data parallelism, and so they are the problems that show the best speed-ups by shifting the processing from the CPU to the GPU. Andy went even farther than that, though, and said "I don't believe in core-scaling parallelism. It doesn't work. It didn't work 20 years ago, and it won't work now."

Underneath the hood, the Nvidia GPU was designed to be data-centric, but for a graphics display the data are triangles and textures. The CUDA architecture layer makes the GPU look more processor-like, and more appropriate for general parallel problems.

Another thing that the CUDA layer does is to hide the underlying processing hardware. If you develop a CUDA program using a $100 garden-variety consumer/gamer GeForce board, you can then take the very same program and run it on a personal supercomputer built around $1,699 Tesla C1060 boards or on a four-teraflop 1U Tesla S1070 server. You don't need to know how many cores are available: you just have to tell the executive about the parallelism of your data.

For embarrassingly parallel problems, for example digital tomography, an under-$10,000 Tesla personal supercomputer can beat a $5 million Sun CalcUA. Does that apply to other problems? It depends, of course, on whether they are inherently parallel.

From a business point of view, the Nvidia GPU Computing group is currently concentrating on oil and gas, finance, and medical imaging applications. They have 250+ customers and ISVs that they can talk about, and more that they can't discuss. They also have over 50 universities teaching CUDA, and an estimated 25,000 active CUDA developers.

Isn't it hard to write parallel code? Well, yes, but no harder than it needs to be. Here's an example (click on the picture to see a larger version):

CUDA parallel C code

As you can see, the for loop in the sequential code is turned into a worker that operates on one data point. The call to the sequential code is turned into a very similar parallelized call to the workers using the new <<<nblocks, 256>>> notation.

When I asked about people's GPU computing experience in December, I got three answers. They're all illuminating:

John Stone wrote "Overall, developing GPU accelerated software is relatively easy to learn. My own feeling is that it is no more difficult to learn than any other type of parallel programming. Students in the classes taught at University of Illinois have picked up GPU programming rapidly, and have been successful at applying GPU techniques to a wide diversity of problems. That said, I would expect that experienced professional developers would have no difficulty learning GPU programming with CUDA. Once you learn the general concepts, it becomes pretty easy to find applications that benefit from the massive data parallelism available on these devices."

Richard Edgar wrote "I too found CUDA fairly easy to pick up, based on previous experience with MPI and OpenMP (this is not to say that the examples NVIDIA provides couldn't be clearer....). I got started by throwing together a workstation with a CUDA-capable GPU, and sitting down with the manual for a couple of afternoons.

"There are certainly some pitfalls when trying to optimise CUDA code, but a basic conversion is usually quite straightforward in my experience. One of the biggest problems can be finding enough parallelism -- having 128 independent 'work units' isn't enough all of a sudden."

And Anne C. Elster wrote "I agree with Martin Heller that parallel programming is not easy, but I also side with Richard Edger: The Nvidia CUDA environment makes programming GPUs no harder than any other parallel programming and much much easier than programming in, say in Cg or Cell programming.

"I currently advise 11 master students and 2 PhD students, most of my master students working on GPU projects. These include projects involving CUDA working on projects related to seismic processing, medical imaging, parallel solvers, linear programming, and performance studies.

"CUDA also gives you easy access to local memory on the GPUs, which is critical for performance."

So there you have it. There's more information at Nvidia's Tesla and CUDA pages, as well as on its YouTube channel.