IT is rediscovering a simple but nearly forgotten principle: Throughput and capacity are everything. It hardly matters how fast the processor is if, like a Ferrari in city traffic, it bogs down every time it has to reach out to memory, peripherals, and other CPUs. A Xeon server running a static workload (that is, a predictable set of apps with stable resource requirements) is an unbeatable solution for the money. But today’s dynamic workloads -- from grid applications to real-time BI to virtualized resource partitioning -- force the 32-bit Xeon to spend as much time moving data as it does crunching data. It is optimized purely for the latter.
For dynamic workloads, IT has traditionally turned to 64-bit Unix servers with fast I/O and lots of expandability. Yet those servers’ enormous cost, trailing compute performance, and never-ending maintenance needs fueled the migration to Xeon in the first place. The ideal solution would be big-iron-like throughput and capacity without the sacrifice of Xeon-like compute performance and affordability -- something between cheap 32-bit PC servers and 64-bit IBM Power, Intel Itanium, or Sun Sparc machines. If low cost could be complemented by backward compatibility and smaller form factors, so much the better.
That sweet spot has been filled. Two emerging 64-bit platforms, one built on AMD’s Opteron and the other on IBM’s PowerPC 970FX, have stepped into the breach. Needless to say, we couldn’t wait to unpack two of the first systems based on these chips and see what
64-bit computing on a Xeon budget felt like. Our first test systems were a dual-Opteron reference server from AMD built on an MSI motherboard and a dual-processor Xserve G5 from Apple.
The value of the Opteron and PowerPC 970FX platforms reaches deeper than processor alone. Operating systems, buses, and chipsets all play significant roles. But CPU architecture remains the primary differentiator in this new class of 64-bit systems.
AMD’s Opteron is the server chip in a lineup of 64-bit, Pentium-compatible processors that includes the desktop, mobile Athlon 64, and the groundbreaking Athlon 64 FX-53 for high-performance workstations. Of these, only Opteron supports multiprocessor configurations. Opteron’s magic is its integration of north bridge functionality -- memory and CPU communication -- into the chip itself.
The Xserve G5’s PowerPC 970FX is the latest product of the partnership among Apple, IBM, and Motorola. IBM contributed the core of its Power4 64-bit enterprise CPU to the PowerPC 970FX. Apple handled the rest of the Xserve G5’s system design, including the buses, north bridge logic, south bridge logic, and system (health monitoring) controller.
Both systems were tested as configured by the vendors, except that we bumped up both machines to 4GB of RAM. Standard features in both servers included dual Gigabit Ethernet ports, removable hard drive trays, basic VGA cards, and USB ports. Both machines came with server management software and could monitor multiple servers. The Opteron server was capable of lights-out management -- that is, remote control of power, configuration, and diagnostics while the server’s power is turned off.
You can’t break up this act
Both Opteron and PowerPC 970FX (we’ll use the simpler Apple G5 name for the duration of this article) can run any mix of 32-bit and 64-bit applications simultaneously. Of course, ideal Opteron or G5 performance is achieved when the CPU boots into 64-bit mode and all applications are compiled for 64-bit operation. But pure 64-bit operation is not yet practical for most of us. Tens of thousands of commercial 32-bit applications must be recompiled and validated.
Microsoft was unable to deliver a final release version of 64-bit Windows Server 2003 by press time. To test the Opteron, we used SuSE Enterprise Server 9.0 Linux loaded and configured by AMD. This Linux runs a 64-bit kernel and commands, and its development tools take full advantage of Opteron. That gave us the uncommon advantage of running in a pure, or nearly pure, environment. Opteron-enabled code has been mainstreamed into Linux and should be part of most major distributions.
The Xserve G5 ran OS X Server Version 10.3.4. This is a 32-bit operating system that puts the G5 CPU in a bridge mode that permits access to some of G5’s 64-bit facilities from 32-bit code. Apple’s Xcode development tools, which use an Apple-enhanced GNU compiler collection as a back end, will optimize for G5 to such an extreme that the resulting executable won’t run on older Macs. But Apple can’t do this for the kernel without forking it into 32-bit and 64-bit releases, which is a huge task.
Both test systems booted and functioned perfectly right out of the box. We set about compiling a handful of open source tests and found, as we’d hoped, that every project we compiled for SuSE Linux on Opteron also compiled, unmodified, for OS X Server.
Our performance testing focused on throughput. Memory throughput for both systems, as measured by the memory copy portion of the STREAM benchmark, was comparable at about 2GBps. That isn’t the speed of the channel between memory and the CPUs; it’s merely the speed at which STREAM was able to complete the relocation of a block of data from one place in memory to another. As we added more parallel test processes to both machines, we saw the unavoidable reduction in throughput. The G5’s memory throughput degraded in roughly a straight line, falling by half each time the number of parallel processes was doubled. On the other hand, Opteron had an amazing capability of scaling its memory throughput under increasing load. With eight parallel STREAM processes hitting both Opteron processors, memory throughput rarely fell below 1.2GBps. That speaks to the strength of Opteron’s on-CPU memory controllers and the SuSE Enterprise Linux pure 64-bit OS. However, G5’s memory copy throughput of around 1GBps is nothing to be ashamed of.
In contrast, the G5 has a strong edge in peripheral I/O performance. Apple’s custom I/O controller moved simultaneous data requests very smoothly among devices, regardless of the compute load or the busyness of other peripherals. This is in contrast to Opteron, which shows a pattern of I/O performance degradation under rising compute and/or I/O load that we expected from PC servers. Apple has the advantage of having far less backward-compatibility baggage than AMD, and it’s obvious to us that Apple’s design priority for its architecture was peripheral throughput.
The ability to run 32-bit apps without performance loss and support for both Linux and Unix eliminates the usual penalties suffered by migrating adopters. You can replace a dual Xeon server with a dual Opteron without installed applications -- including Windows enterprise applications -- skipping a beat. Likewise, the Xserve G5 will take over for any Apple G4-based server (including previous generations of Xserve and Power Mac G4) and will run nearly all open source Unix, Linux, and BSD applications.
Both architectures make exceptional Java application servers, with Sun crafting Opteron editions of Java for Linux and Solaris. Apple’s engineers do their own Java work and have built optimized desktop and server editions of Java, along with Apple’s uniquely simple and scalable Web Objects server software.
Commercial software vendors, however, are still deciding whether entry-level 64-bit technology is a wave to ride the crest of now or paddle behind later. Opteron has already gotten a critical mass of major software sign-ons to support its 64-bit extensions to the x86 architecture. But with 100 percent Intel compatibility, vendors don’t need to rush migration, because users can continue to run their 32-bit software indefinitely.
Apple has a much tougher road ahead. It got a comparatively late start in the enterprise space. True, Apple may be able to pull developers in from other platforms (including Windows) with its free dev tools, BSD Unix base, and Cocoa client application framework. But if the company really expects software vendors to make the leap, it will need more than an attractive new 64-bit platform as bait. It must build on its nascent enterprise marketing initiative and do lots of Microsoft-style proselytizing.
Safety glasses, please
For those who like to look under the hood, a few low-level facts are relevant when evaluating the platforms. Let’s begin by spelling out some of the advantages that both of these true 64-bit platforms have over Xeon.
First, despite all that follows regarding significant architectural changes, the compatibility promised by AMD and IBM (and Apple by extension) appears to be dead on. By default, both CPUs boot as 32-bit processors with operating systems and applications that are indistinguishable from 32-bit predecessors.
In Opteron’s case, compute performance in 32-bit legacy mode (see “Protecting Your 32-Bit Investment,” page 49) is comparable to AMD’s 32-bit Athlon MP. Even in legacy mode, Opteron still has on-board memory and multiprocessor buses. It still has a very fast Hypertransport link to the south bridge I/O controller, which in turn can move data from the CPU to expansion cards very quickly. And it still has the NUMA (non-uniform memory access) architecture that streamlines multiprocessing performance and increases the RAM limit to 4GB per CPU.
So from a high altitude, it appears that 32-bit legacy mode turns an Opteron into a Xeon. But that mode switch does not markedly affect Opteron’s throughput performance.
Likewise, the G5 masquerades as its predecessor, the G4, when it boots into native 32-bit mode. And as does Opteron, the advanced, throughput-tuned I/O architecture still operates with 32-bit OSes and apps. But the G5 is a massive reworking of the PowerPC core, so there are new instructions and a revamped internal execution design that extend more of a performance kick to 32-bit applications than Opteron can. Recompiled with optimizations for the unique features of the G5, 32-bit software gets a marked boost in both compute power and I/O throughput.
There is no easy way to access more than the 4GB of RAM that all 32-bit systems can read and write directly. Xeon carves memory above the 4GB line into segments that can be paged in and out of a reserved area within that 4GB space. Opteron’s capability of attaching 4GB of RAM to each CPU creates a potentially more elegant solution: If the OS can rearrange processes to run on the CPU that has the most free RAM, very little paging is needed. But Opteron can use the paging scheme, too, and memory attached to each processor is available to all the others.
Pulling past Xeon
Intel is right when it characterizes Opteron as having few unique advantages when it comes to increased physical memory. However, Intel would have to ditch its road map and redesign Xeon’s CPUs and bus to shift their emphasis from clock speed to throughput. In affordable servers, and to our knowledge servers with as many as eight CPUs that Opteron handles without intervening controllers, there exists no CPU and bus architecture that matches Opteron’s scale-in and overall throughput capabilities.
The G5 architecture, enhanced by Apple’s chipset and its OS X operating system, is quite another beast. The PowerPC 970FX processor’s I/O design is impressive, but in terms of basic design, it is within Intel’s capability of matching or exceeding its performance. What will keep G5 in the running is the superiority of Apple’s custom-designed chipset and IBM’s remarkable Power4 core.
Even when used as a 32-bit chip, the G5 RISC architecture is so efficient that thoroughly optimized applications can scream, especially on floating point math, at which IBM has always excelled, and on vector calculations. The G5, as implemented in the Xserve G5 platform, is more balanced with regard to computing and throughput power than Opteron, which is certain to remain the throughput champion in its price range for some time to come.
There is plenty to consider in choosing entry server architectures, but no matter how you load the scales, Opteron and G5 will be, for many reasons, better choices than Xeon.