When I first heard "Nehalem," it called to mind the noise that Felix Unger from "The Odd Couple" made to clear his sinuses. If Intel gets its way, its Nehalem CPU and system architecture will have a similar effect on the clogged-up market. I'm hoping it will clear bogged down workloads. It's not that we're suffering mightily with the likes of Intel Core 2 Duo and AMD Opteron Shanghai and Phenom II, but it's time we were rocked by something bigger than a speed bump.
Nehalem isn't strictly new, but I hung back until I could see it in a 2P platform (meaning two CPUs, or two sockets, if that's clearer) that shows it to its best advantage. An early look at such a platform is generally supplied only by the chipmaker itself, with one exception: Apple. It's the only first-tier system maker that's willing to have its high-end machines held up as exemplars of a CPU or system architecture, knowing that, in stories like this one, the architecture is given higher billing than the system itself.
[ AMD's next big thing for servers is the Istanbul six-core Opteron CPU. See "AMD's six-shooter is loaded and ready." See also "AMD spins Moore's Law in IT's favor" and "Intel engineers stage CPU coup." ]
2P Nehalem came to me in the guise of Apple's eight-core Mac Pro. OS X's Activity Monitor shows a pair of Nehalems as a 16-core CPU. Hyper-Threading has returned to the x86, but its role and potential are much changed since it went into rehab after the fall of Pentium IV. With a smart OS scheduler and some smart programmers, Hyper-Threading could do some real damage this time around. You may recall that with single-core CPUs, Intel claimed that Hyper-Threading was capable of boosting performance up to 30 percent. Apple's published benchmarks show that an eight-core Nehalem, running at 2.9GHz, bests its prior 3GHz, eight-core Mac Pro. By my rough weighted averaging and using Apple's own numbers (not mine; that comes next), Nehalem turns in 60 to 70 percent higher numbers.
Taking on faith that Apple's numbers are accurate -- after taking heat for past sins, they tend to be -- I'm left to wonder where Nehalem gets that extra performance. Perhaps it draws some from Hyper-Threading. Some of it unquestionably comes from DDR3 memory, the next step up from DDR2, which is the prevailing standard. AMD criticizes DDR3's higher latency, saying that comparing fast DDR2 to DDR3 is a wash. AMD asserts this while having DDR2 and DDR3 on its near-term roadmap. Nehalem's NUMA (Non-Uniform Memory Access) architecture, which assigns independent banks of memory to each CPU, may counterbalance latency to some extent, the way it helps counterbalance lower cycle speeds on DDR2 with Opteron. Simultaneous memory access does that. It's my view that the Nehalem CPU's on-chip memory controllers and NUMA probably make a bigger difference than the kick up to DDR3. However the magic is done, Apple is claiming a 2.4X rise in memory throughput. I'd like to see that.
Nehalem also marks the return of the Level 3 cache that was present on some NetBurst Xeon CPUs. Level 3 cache is shared by all cores on a chip. Intel has always been a big believer in (big) cache, but I think it had pushed Core 2 Duo's shared Level 2 cache as far as it could go. Three-level cache is the right idea, and this also cuts down substantially on the number of cache probes the system does to make sure one core doesn't have a different picture of memory than another. The last of the significant improvements is TurboBoost. That's a technology to which I need to devote more study. By Intel's pitch, TurboBoost senses when tasks ordinarily spread across cores can be handled by fewer cores, potentially running at a boosted clock speed. Apple tells me that it'll be hard to see this in action with user-level facilities like Activity Monitor. Fortunately Apple and Intel supply tools that allow a closer look.
Nehalem begs for that closer look, and I'm just the guy to do it. I hope you'll come along.