Interesting performance differences emerged in nearly all phases of testing. Let’s start with the Web serving benchmarks. Here, I ran Apache’s ab benchmarking tool against a 100.5k static page. I also ran tests on a CGI script written in Perl pulling data from a MySQL database, displaying a table containing 210 rows of data selected from a database with 3,500 rows of 10 columns. The static page tests showed the Opteron easily in the lead with a 21 percent performance delta over the EM64T, with the dynamic tests showing a 25 percent performance edge.
In real numbers, this means that the dual Opteron system served more than 700 requests per second more than the dual EM64T system on the static Web benchmark, and 10 requests per second more on the dynamic tests. These performance advantages shown by the Opteron are not to be ignored.
The MySQL tests also showcased the Opteron’s performance. When I ran the MySQL tests, I saw a 26 percent average performance delta, with the Opteron system finishing almost nine minutes ahead of the EM64T system. This test relies somewhat more on disk I/O than the Web tests, but the two servers were nearly even on disk I/O performance, so the numbers reflected here are good indicators.
Where the Intel chip showed some muscle was in the HPL tests. Because optimized EM64T BLAS (Basic Linear Algebra Subroutines) libraries weren’t available when I began testing, I contacted Kazushige Goto, a member of the Texas Advanced Computing Center at the University of Texas in Austin. Goto is known for his work in optimized BLAS libraries for HPC computing, maintaining libraries for several processors, including the Opteron, PowerPC 970, and the Xeon. Goto had started work on an optimized library for the EM64T processors, but needed time on the newer 3.6GHz CPUs, as well as dual CPU systems. With access to the EM64T system in my lab, and many e-mails, Goto has released an optimized BLAS library for the EM64T CPU (he ran the HPL tests himself on the hardware in my lab).
The Xeon EM64T system turned in high floating-point numbers -- as much as 44 percent higher than the those produced by Opteron. But that’s not the end of the story. As Goto says, “High scores on the HPL benchmark do not mean ‘high performance computing.’” The routines in the HPL DGEMM routines can hide long cache latency, which is a problem on EM64T processors.
A potentially more important number gleaned from testing relates to CPU efficiency. The Opteron and EM64T CPUs were nearly identical in single-CPU tests, with the Opteron showing an edge with 89.9 percent of peak, and the EM64T coming close with 88.3 percent efficiency. The real story came with the dual-CPU tests. The Opteron hit 88.8 percent of peak performance, and the EM64T fell to 84.8 percent. Thus, in an HPC environment, the long latency of the EM64T’s L2 cache will be a liability; the NUMA architecture of the Opteron will be a distinct benefit.
Follow the leader
Intel is accustomed to being the leader in commodity processing. Following AMD’s lead must grate on Intel, particularly when the Itanium has been heavily marketed as the answer to ubiquitous 64-bit computing. There’s little doubt that the IA64 instruction set of the Itanium is superior to the x86-64 instruction set developed by AMD, but that’s only part of the story.