Parallelization is next performance horizon

Parallel execution, streaming, batching, and coprocessing will blast through architectural bottlenecks to pull next-frontier performance out of mainstream hardware

Server and workstation innovations like multi-socket and multi-core technology have steadily boosted performance density; they've increased the work that can be done in a given amount of space and with a level amount of power. This serves a fine purpose, but it stops short of the more rewarding objective of releasing all of the compute power stored in a watt. 2009 will bring landmark changes to processor and system architectures. In the x86 space, AMD and Intel have already gotten an early start with Q4 deliveries of Shanghai and Core i7. As advanced as these new architectures are, we'll see an incremental increase, not a history-making leap, in common benchmarks and production performance. The problem isn't that we lack incredible hardware. It's that software hasn't kept pace.

A history-making leap in x86 server, workstation, desktop, and notebook performance is approaching. It has nothing to do with chip manufacturing process shrink, clock speed, cache, or DDR3 memory. It's about an intelligent way to harness hardware advances for something better than working around that two-ton millstone of PC design: Every process, while active, expects to own the entire PC, so elaborate means are required to permit serial ownership of the physical or logical system. That arrangement foils efforts to unlock the potential of multi-socket, multi-core, multi-threading, and coprocessing hardware. That potential is defined as the system's ability to execute tasks in parallel, and on PCs, we're miles from the ideal that modern hardware makes possible.

CPUs do the best they can to keep multiple independent execution units inside the processor busy by rescheduling instructions such that integer, floating point, and memory access operations happen at the same time. Compilers, some better than others (raises a glass to Sun), do what they can to optimize code so that the compiled application operates efficiently. Then it's up to the operating system.

Operating systems deal CPU ownership to processes in a round-robin fashion. Multiple cores provide a brute force approach to improved performance by giving the OS multiple places to park processes that expect to have the system to themselves, but a closed, generic OS is a poor matchmaker between application requirements and CPU resources. Virtual machine managers play to this weakness by gating guest OS's access to system hardware, effectively restricting the number of places an OS can park processes for the sake of parallel processing.

The problem, expressed in distilled form, is that the CPU, the compiler, the operating system, and the virtual machine manager pull various strings toward a similar end, that being the efficient shared use of fixed resources. Since every actor fancies itself in charge of this goal, none is. We'll have more cores and more sockets, live migration to put processes on idle cores regardless of location, but we should also start attacking the problem from another direction.

At a high level, a fixed, mandated, thoroughly documented, deeply instrumented framework is the top priority. A pervasive set of high-level frameworks from the OS maker serves two purposes: Its existence and documentation obviate the need to reinvent anything already functional in the framework, and low-level code within the framework can be changed without disrupting applications. Look to OS X for frameworks in action. Frameworks are system and inter-framework aware. Sweet global optimizations can be applied across the entire framework stack, and between frameworks and the OS, that aren't possible with libraries that don't dare make assumptions about their environment.

I'd like to see the relatively fixed formula for determining the amount of time a process is permitted residence on a CPU (the quantum) replaced with an adaptive approach. The longer a process remains in residence on a core, the more compile-time optimizations can come into play. Code can be optimized such that opportunities for true parallel operation are identified and exploited, but getting the greatest bang from this technique requires that the OS take some advice from the compiler about how to schedule parallelized applications. That developers must now do this by hand is a limiting factor in parallelization's use.

Another approach to parallelization is to set aside cores and other resources as off limits to the OS scheduler. If you know that you're operating on a 100% compute workload, you could safely cordon off an x86 core or two, a block of GPU cores, and the nearest available memory for a coprocessor. Imagine the effect that setting aside a logical coprocessor just for pattern matching would have on server performance. It would accelerate databases, data compression, intrusion detection, and XML processing, and if it were wired into the framework that everyone uses for pattern matching, the same code would work identically whether the coprocessor were present or not. The bonus is that such logical coprocessors require no additional hardware.

I've liked this concept so much that I've looked into making it work. Operating systems don't like being told that there are parts of a system they can't touch, and developers would have to craft coprocessor code carefully so that it touches nothing but the CPU and memory. It's effectively an embedded paradigm, one that I believe is underutilized but for which OSes and compilers are not tuned.

Fortunately, I don't have to carry my logical coprocessor beyond theory. For years, I've gotten the stinkeye from vendor reps and colleagues when I suggest that workstation and gamer-grade graphics cards belong in servers. Forget about using the display; that's a distraction. 3-D graphics cards are massively parallel machines capable of remarkable feats of general computing. Their computing not only runs independently and parallel to the rest of the system, but multiple tasks are run in parallel on the card itself. The snooty dismiss as irrelevant the magic evident in a desktop PC's ability to take users through a vast, complex, and realistic 3-D landscape with moving objects abiding by laws of physics and (sometimes) convincingly intelligent adversaries. If you took the 3-D card out of this, you'd lose a lot more than the graphics. The PC, no matter how fast, wouldn't be able to handle a fraction of the computing that has nothing to do with pixels.

That's the ideal in parallelization, and unlike my logical coprocessor concept, the hardware is already in place to tap GPUs for server-grade computing. AMD has the ability to gang four fire-breathing workstation graphics cards together through technology it calls CrossFireX. In this configuration, a 2U rack server could have access not only to sixteen AMD Shanghai Opteron cores and 64 GB of RAM, but hundreds of additional number-crunching cores (depending on how you count them) running at 750 MHz, with 8 GB of GDDR5 memory all to themselves. It takes software to unlock that, and we're finally turning the corner on industry consensus around a standard called OpenCL for putting GPUs to work for general computing.

That's heartening, but we still need a change in paradigm. There is enormous potential in the GPU, but similar potential can be extracted from resources that standing servers already have. As long as parallelization has been around, it remains a stranger to the PC. What a waste.

Copyright © 2008 IDG Communications, Inc.