To paraphrase Benjamin Disraeli, there are lies, damn lies, and benchmarks. Synthetic CPU tests can provide numbers pointing one way or the other and can help unmask weaknesses and strengths in chip architectures, but what matters most is how the processors perform in production.
Last month, I used real-world, application-based benchmarks to test a Dell PowerEdge server sporting the spanking new Xeon with EM64T (Extended Memory 64 Technology), otherwise known as Intel’s answer to AMD’s Opteron. But the Intel processor was so fresh, we had to stick with 32-bit benchmarks and postpone tests of the chip’s 64-bit x86 capabilities, which are based on, but not identical to, the AMD’s x86-64 standard.
Well, the wait is over. We’ve polished our 64-bit, app-based benchmarks for EM64T and let ’em rip. The results confirm what Test Center Technical Director Tom Yager asserted in early August: AMD has the first-mover advantage in the x86-64 space, and Intel has some catching up to do. Whereas the EM64T Xeon is no slouch, the Opteron reigns supreme.
Battleof the 64-bit engines
I settled on some simple, real-world benchmarks to meet the goal of measuring production performance. And because the objective was to compare the 64-bit performance of the two processors, only 64-bit code was used throughout.
I made every effort to maintain as much similarity as possible between the test systems. In Intel’s corner, I tested a Dell PowerEdge 2800 with dual 3.6GHz Xeon EM64T CPUs, 4GB of RAM, and 36GB U320 SCSI drives. On the AMD side, I tested a Newisys 2100 with dual 2.4Ghz Opteron 250 CPUs, 4GB of RAM, and 36GB U320 SCSI drives. The client generation system used for the external tests was a Compaq ProLiant ML370 running Red Hat AS 3.0 with two 32-bit, 2.8GHz Xeon processors and 4GB of RAM. All systems were connected via gigabit copper on a flat network.
Linux was the obvious operating system choice for both the Xeon EM64T and Opteron systems. Although Microsoft has yet to ship a 64-bit server platform, Linux has been there for years. On both systems, the base distribution was Red Hat Advanced Server 3.0, running the latest Red Hat kernel, 2.4.21-15.EL. Red Hat’s AS 3.0 U3 for x86-64 was installed and updated with all updates available at the time of the tests.
The benchmarks themselves included MySQL 3.23.58 performance tests run with MySQL’s sql-bench tool, static and dynamic Web serving via Apache 2.0.46. In a nod to those concerned with HPC performance, I also ran Linpack benchmarks via HPL (High Performance Linpack).
The test results were conclusive. In every real-world test, the Opteron 250-based Newisys server bested the EM64T Xeon server, despite the fact that the latter had a faster clock speed. For years, Intel has emphasized clock speed, implying that a 3.6GHz CPU will best a 2.4Ghz CPU without fail. In these tests, the 2.4Ghz Opteron system beat the 3.6Ghz EM64T Xeon system across the board. Clock, it seems, isn’t everything.
Performance highs and lows
Interesting performance differences emerged in nearly all phases of testing. Let’s start with the Web serving benchmarks. Here, I ran Apache’s ab benchmarking tool against a 100.5k static page. I also ran tests on a CGI script written in Perl pulling data from a MySQL database, displaying a table containing 210 rows of data selected from a database with 3,500 rows of 10 columns. The static page tests showed the Opteron easily in the lead with a 21 percent performance delta over the EM64T, with the dynamic tests showing a 25 percent performance edge.
In real numbers, this means that the dual Opteron system served more than 700 requests per second more than the dual EM64T system on the static Web benchmark, and 10 requests per second more on the dynamic tests. These performance advantages shown by the Opteron are not to be ignored.
The MySQL tests also showcased the Opteron’s performance. When I ran the MySQL tests, I saw a 26 percent average performance delta, with the Opteron system finishing almost nine minutes ahead of the EM64T system. This test relies somewhat more on disk I/O than the Web tests, but the two servers were nearly even on disk I/O performance, so the numbers reflected here are good indicators.
Where the Intel chip showed some muscle was in the HPL tests. Because optimized EM64T BLAS (Basic Linear Algebra Subroutines) libraries weren’t available when I began testing, I contacted Kazushige Goto, a member of the Texas Advanced Computing Center at the University of Texas in Austin. Goto is known for his work in optimized BLAS libraries for HPC computing, maintaining libraries for several processors, including the Opteron, PowerPC 970, and the Xeon. Goto had started work on an optimized library for the EM64T processors, but needed time on the newer 3.6GHz CPUs, as well as dual CPU systems. With access to the EM64T system in my lab, and many e-mails, Goto has released an optimized BLAS library for the EM64T CPU (he ran the HPL tests himself on the hardware in my lab).
The Xeon EM64T system turned in high floating-point numbers -- as much as 44 percent higher than the those produced by Opteron. But that’s not the end of the story. As Goto says, “High scores on the HPL benchmark do not mean ‘high performance computing.’” The routines in the HPL DGEMM routines can hide long cache latency, which is a problem on EM64T processors.
A potentially more important number gleaned from testing relates to CPU efficiency. The Opteron and EM64T CPUs were nearly identical in single-CPU tests, with the Opteron showing an edge with 89.9 percent of peak, and the EM64T coming close with 88.3 percent efficiency. The real story came with the dual-CPU tests. The Opteron hit 88.8 percent of peak performance, and the EM64T fell to 84.8 percent. Thus, in an HPC environment, the long latency of the EM64T’s L2 cache will be a liability; the NUMA architecture of the Opteron will be a distinct benefit.
Follow the leader
Intel is accustomed to being the leader in commodity processing. Following AMD’s lead must grate on Intel, particularly when the Itanium has been heavily marketed as the answer to ubiquitous 64-bit computing. There’s little doubt that the IA64 instruction set of the Itanium is superior to the x86-64 instruction set developed by AMD, but that’s only part of the story.
Compatibility and price generally win over performance. That’s why you won’t see a Formula One racing car next to you at the stoplight. With Intel playing both the IA64 and x86-64 fields, and AMD holding on to the first-mover advantage in x86-64, this will be an interesting fight. Intel is in the unfortunate position of marketing two opposing 64-bit processors, and capitulation to AMD with the EM64T chip will undoubtedly hurt Itanium sales, at least for the commodity server market.
Still in the race, yet not included in these tests, are the other 64-bit CPUs available today. There are quite a few, from Sun’s all-but-deceased SPARC to the PowerPC 970 processor from IBM, better known as Apple’s G5 CPU. We wouldn’t run tests against the SPARC (Sun has acknowledged that it’s moving to Opteron for servers in this class), but the G5 is fair game. Apple hasn’t refreshed the Xserve line with the new 2.5GHz G5s as of yet, but the PowerMac workstations are shipping with the new CPUs. As soon as we get a solid representation of the new G5, we’ll be running the same benchmarks, ideally on OS X as well as on Linux for the PPC.
Given the new, built-in migration path from 32-bit to 64-bit computing, expect to see 32-bit x86-based CPU prices fall into the basement -- and watch as hardware vendors relegate today’s top-end Xeon and P4 processors to their ultra-low-end servers. Meanwhile, the midrange servers will ship with x86-64-based processors.
Many vendors were sluggish to embrace AMD’s Opteron following its release, primarily due to their reluctance to disrupt relations with Intel. Recently this has changed, with such large enterprise server vendors as Hewlett-Packard and IBM actively developing and selling Opteron-based server platforms. In fact, Dell is the only major enterprise-class server vendor that has no current Opteron-based server offering.
Both Intel and AMD will be pushing hard for dominance of the new x86-64 market. AMD may be in the lead at the moment, but the race is far from over. The result will be “average” servers that handle much more than the current crop of 32-bit machines. When developers begin writing exclusively for the x86-64 instruction set, the true benefit of these chips will emerge. But with x86 compatibility a reality today, there’s little reason to wait.