Intel's Nehalem simply sizzles

In a range of tests, the new quad-core Xeon processor shows huuuuge performance gains

Intel's new Nehalem Xeon CPUs, which are being introduced in countless one- and two-socket servers and workstations today, have already generated a lot of heat. While introducing the new processors to technical journalists in February, Nick Knupffer, Intel's global communications manager, boasted that "Nehalem represents the biggest performance jump we've made since the introduction of the Pentium Pro."

This claim was met with outright skepticism by nearly everyone in the room, and certainly by me. But after running a two-socket, eight-core Nehalem system in my lab for the past few weeks, it would appear that Knupffer is right. Intel has built a better mousetrap. And it used part of AMD's blueprints to do it.

[ Intel or AMD? See "Where does Nehalem get its juice?," "Intel engineers stage CPU coup," "AMD's six-shooter is loaded and ready," and "AMD spins Moore's Law in IT's favor." ]

Back when AMD's Opteron was ruling the performance roost, Intel was busy gluing two separate cores onto a single die and calling it a dual-core CPU. Memory bandwidth lagged due to the central off-die memory controller, and while the overall performance of the processor was acceptable, it lacked the NUMA (Non-Uniform Memory Access) punch that was the Opteron's claim to fame. Nehalem is based on a NUMA architecture, much like the Opteron, and its performance is miles ahead of anything else Intel has released to date. Color me impressed.

Inside Nehalem
The Nehalem chips (Xeon 3500 series for single socket and Xeon 5500 series for two-socket systems) feature a quad-core layout with 731 million transistors, 256KB of L2 cache per core, 8MB of L3 cache, deeper and faster caching, and better branch prediction. Essentially, Nehalem is a blend of the strengths of Intel's legacy Xeon processors with a fundamental architecture change in the incorporation of NUMA.

With NUMA, each CPU has its own memory controller. This ties DIMM ranks to a specific CPU and, in the Nehalem architecture, provides memory bandwidth speeds at 25.6GBps per link or 6.4GT (Gigatransfers) per second with DDR3 RAM. Due to this architecture change and the nature of DDR3 RAM, the RAM clock runs at 800MHz, 1,066MHz, or 1,333MHz. If the DIMM ranks are populated with a single RDIMM (Registered DIMM) per channel, the highest speed of 1,333MHz is possible. As RAM is added to those channels, the overall speed drops to 1,066MHz or 800MHz. However, with 4GB RDIMMs, a dual-socket system can run 24GB of RAM at 1,333MHz using only six RDIMMS. Using the Tylersburg chip set, it's possible to bring the RAM total up to 144GB -- 72GB per CPU -- running at 800MHz.

There's more to Nehalem than just NUMA, however. A raft of supporting players also enters into the mix, including updated Virtualization Technology extensions to assist in virtualization use cases; support for DDR3 memory, which can provide double the data rate of DDR2; and SSE 4.2 instructions, a relatively minor update aimed at accelerating text processing. The significantly increased memory bandwidth is the major update, along with the advent of QuickPath, the new processor interconnect that replaces the aged front-side bus. But these additions are quite welcome and round out the package.

One of these new features is dubbed Turbo mode. You might recall the days of Intel 8088 CPUs running at either 8MHz or 16MHz if the "Turbo" switch was enabled. This isn't quite the same thing. The Turbo feature in Nehalem allows the CPU cores to burst to higher clock rates if load requires. Turbo adds what Intel calls "bins" that represent a boost of 133MHz to each core, allowing certain cores to essentially overclock themselves on an as-needed basis.

Turbo sounds slightly gimmicky, but it can assist in single- and lightly threaded workloads, as it can only be utilized on a subset of physical cores. For instance, one or two cores might be able to allocate three additional bins, but several threads running concurrently might only be able to access a single bin on each of the four cores. All of this is dependent on the thermal and power health of the CPU at the time and is dynamically adjusted.

Whoa Nehlly!
All of these features add up to a significant performance boost. How significant? In many of my tests, Nehalem runs roughly twice as fast as Intel Xeon 5300-based platforms, and 50 percent faster than Intel Xeon 5400-based systems in single-threaded operations. It's fast.

For example, in preliminary testing I used an HP ProLiant DL580 with four quad-core Intel Xeon X7350 CPUs running at 2.93GHz per core as a baseline. The Nehalem system was running two quad-core Intel Xeon W5580 CPUs at 3.2GHz per core with HyperThreading enabled.

The tests I ran were mostly single-threaded with the exception of the MySQL InnoDB database performance tests. However, the single-threaded tests were run in batches of 16 simultaneous tasks -- thus, each test pass comprised 16 identical processes for each test scenario. The tests included LAME audio encoding, gzip and bzip2 compression, and MD5 sum tests of large files. Note that the X7350 system had 16 physical cores and the Nehalem test system had only eight, represented as 16 virtual processors via HyperThreading.

Averaged across all tests, the Nehalem system was roughly 60 percent faster than the X7350-based server. For instance, the time required for the X7350 system to encode 16 identical 200MB WAV files to MP3 at 224Kbps was 77 seconds. The Nehalem system completed the task in 40 seconds. The gzip tests showed the X7350 compressing the 16 resulting MP3 files in 6 seconds, while the Nehalem system completed the task in 2 seconds. For a single-thread test, I converted a 27MB MPEG-4 file to FLV (Flash Video) with MEncoder. The X7350 took 43 seconds at roughly 100 frames per second; the Nehalem took 27 seconds at roughly 163 frames per second.

The MySQL tests I ran were based on InnoDB using the mysql-bench test suite. This test runs a large number of concurrent database operations, including Select, Delete, Update, Insert, and so forth. The X7350 system completed all three tests in a total of 833 seconds, while the Nehalem system finished in 713 seconds.

More per core
Without a doubt, these numbers are hugely impressive, even if they are measured against a Tigertown-era chip. A dual-socket Nehalem system handily beats a four-socket X7350 system across the board. And the tests were run with 16 concurrent single-threaded processes, so while the X7350 used one physical core per process, the Nehalem, using HyperThreading, ran two processes per physical core.

Even more impressive, while the X7350 server was equipped with a hardware RAID0 set of four 15,000-rpm SAS drives and doing nothing other than running the test scenarios, the Nehalem system ran four SATA drives in a software RAID5 array -- and serving double-duty as my workstation. At the same time the Nehalem was executing my battery of tests, it was driving a 30-inch and a 24-inch monitor off an Nvidia Quadro FX 5500, playing an MPEG movie in full-screen on the 30-inch monitor, and running more than 500 processes across four virtual desktops, including dozens of terminal sessions, Firefox browser sessions, Java applications, and streaming audio -- and it still put up these numbers.

I also had an opportunity to run a dual-socket 2.93GHz Xeon X5570 Nehalem system in a different suite of tests. This test scenario comprised FPGA (field programmable gate array) synthesis via tools like Synplicity's Synplify Pro and others. These tools are used to build and test ASIC chip design, and full synthesis and mapping runs can take hours or days to complete. Previous to the introduction of the Nehalem system, one specific simulation took just over seven hours to complete when run on a dual-socket, 2.66GHz Xeon X5355. The Xeon X5570 running at 2.93GHz finished in 3.5 hours -- half the time. The potential for the raw power of the Nehalem chips to accelerate the speed of development in this arena cannot be overstated.

As far as power consumption goes, 2cpu.com's Micah Schmidt put it this way: "In identically configured Supermicro workstations, the Nehalem-based Xeon W5580 system draws an average of 70 watts less than the Harpertown-based Xeon X5492 system at idle. Coupled with the additional performance of the new processors, the performance-per-watt difference is huge."

Fasten your seatbelts
Going forward, the raw power of the Nehalem Xeon will accelerate everything it touches, from ASIC design to automobile design to weather simulations to global data models. Heavy data-intensive applications that used to take days might now take hours. Those that took hours might now take minutes. Nehalem will step up the pace with which we develop every modern technology, from cell phones to microwaves. Rendering computer-generated imagery for movies will require far less time. Fully animated movies will be cheaper to produce, and the computer-animated models will be far more realistic due to reduced overhead.

This is true of every advancement in core processing technology, but this one is bigger than most, and it comes at a time when sophisticated modeling and design calculations are becoming more of a reality than ever before. Essentially, processes and procedures that were simply too complex and time-intensive even a few months ago are now completely feasible.

Nehalem isn't just a newer, faster chip -- it's a game-changing development in microprocessor technology. It's also likely a direct result of the time just a few years ago when AMD was busy eating Intel's 64-bit lunch. One might wonder what impetus Intel would have to continue this development trend without significant competition. People run faster with a wolf nipping at their heels. Without that motivation, perhaps a leisurely stroll would be the order of the day.

We should all hope that AMD will continue to provide the push Intel needs, and will soon offer a chip with performance that can compare to the Nehalem. That said, the primary reason behind the Nehalem’s big boost is that Intel finally integrated the memory controller on the CPU, an advantage that once was the hallmark of the Opteron -- but that can only be done once. Intel’s next step -- shrinking the Nehalem process to 32 nanometers with Westmere -- won’t be able to leverage the obvious performance gains derived from that step.

Whatever the reasons and machinations behind the development of the Nehalem chip, and regardless of what the future will bring, the raw power Nehalem represents is simply stunning.

Copyright © 2009 IDG Communications, Inc.