InfoWorld review: Intel Xeon Nehalem-EX lives large
Intel's new Nehalem-EX CPU for SMP servers brings eight cores, massive memory support, mainframe-like RAS features, and huge performance gains to large-scale workloads
The tests I ran are based on common operations found in many applications. The LAME tests convert a 152MB WAV file to MP3 at a 256Kbps bit rate. The compression tests use gzip and bzip2 to compress and uncompress a 55MB MP3 file. The MD5 tests calculate MD5 sums on 152MB files, and the MP4-to-FLV tests transcode a 24MB MP4 file to FLV. These tests are single-threaded, but run concurrently with increasing levels of concurrency to stress physical and logical cores, memory bandwidth, and memory interconnects, as well as disk I/O.
On the Nehalem-EX, I ran these tests with Hyper-Threading enabled and disabled. For comparison, I'll reference the results with Hyper-Threading disabled so that the figures represent the same number of logical CPUs. All tests were run on CentOS 5.4. The reported figures were drawn from tests run from ramdisk to eliminate disk I/O from being a bottleneck.
The results start out somewhat unimpressively. With eight concurrent processes, the four X7350 CPUs in the DL580 were evenly matched against the two Nehalem-EX CPUs in the R810 in the LAME and gzip tests, but were significantly behind in the other tests. At a concurrency level of 16, the gap widened substantially on all tests, with the older system slightly ahead of the Nehalem-EX in the LAME and gzip tests, but running way behind in the remainder. Once the testing started to significantly oversubscribe the number of logical CPUs on each server, the Nehalem-EX pulled way into the lead and stayed there across all tests.
In fact, I ran many test passes at the 48, 64, and 96 concurrent process levels to verify the results because the performance differences were so huge. For example, at 64 concurrent processes, it took 2 minutes, 12 seconds for the two-CPU Nehalem-EX system to complete the MP4-to-FLV test. The four-CPU X7350 system took over 30 minutes to complete the same task. That's a massive performance difference. The performance delta between the two servers only grew wider as the concurrency increased. Not only was I able to ramp the Nehalem-EX up to 768 concurrent processes, but it was still running the tests about 50 percent faster than the X7360 could run 64 concurrent processes.
This extreme performance increase is due to a number of reasons. The older X7350 system might have had two additional CPUs and a 670MHz clock rate bump per core, but it only had 4MB of L3 cache compared to the 24MB L3 cache on the Nehalem-EX. The X7350 also lacked the benefit of QuickPath, and the memory bus became a bottleneck. Thus, in the heavier workload tests, the Nehalem-EX blew the X7360 out of the water, even with a reduced clock rate per core and the same number of cores. In the lighter workloads, the difference was not nearly as significant.
LAME MP3 audio conversion tests, 8 to 96 concurrent processes (times in seconds)
MP4 to FLV transcoding tests, 8 to 96 concurrent processes (times in seconds)