If commercial Unix vendors weren’t already worried about Linux, they should be now. Linux has seen wide deployment in datacenters, generally as a Web server or a file server, or to handle network tasks such as DNS and DHCP, but not as a platform for running mission-critical enterprise applications. Solaris, AIX, or HP/UX typically get the nod when an application demands the highest levels of performance and scalability. The recent release of a new Linux kernel, v2.6, promises to change that.
The v2.6 kernel ushers in a new era of support for big iron with big workloads, opening the door for Linux to handle the most demanding tasks that are currently handled by Solaris, AIX, or HP/UX. The new kernel not only supports greater amounts of RAM and a higher processor count, but the core of device management has changed. Previous to this kernel there were limits within the kernel that could constrain large systems, such as a 65,536 process limit before rollover, and 256 devices per chain. The v2.6 kernel moves well beyond these limitations, and it includes support for some of the largest server architectures around.
Will the new Linux really perform in the same league as the big boys? To find out, I put the v2.6.0 kernel through several real-world performance tests, comparing its file server, database server, and Web server performance with a recent v2.4 series kernel, v2.4.23.
Linux Meets Big Iron
A primary focus of the v2.6 kernel is large server architectures. Support for up to 64GB RAM in paged mode, the ability to address file systems larger than 2TB, and support for 64 CPUs in x86-based SMP systems brings this kernel and Linux into the more rarified air of truly mission-critical systems. The included support for NUMA (Non-Uniform Memory Access) systems; a next-generation SMP architecture; and PAE (Physical Address Extensions), providing support for up to 64GB of RAM on 32-bit systems, is also new.
There is much more to v2.6 than just bigger numbers in processor and RAM counts, however. This kernel breaks apart some of the artificial limitations that have been present in Linux from the beginning, such as the number of addressable devices and total available PIDs (Processor Identifiers). The v2.4 kernel supported 255 major devices with 255 minor numbers. (For example, a volume on a SCSI disk located at /dev/sda3 has a major number of 8, since it’s a SCSI device, and a minor number of 3.) On servers with a large number of real or virtual devices, device allocation can become problematic. The v2.6 kernel addresses these issues in a big way, moving to 4,096 major devices with more than one million subdevices per major device. For most users, these numbers are well beyond practical limits, but for enterprise systems with a need to address many devices, it’s a major step.
Also new in v2.6 is NPTL (Native POSIX Threading Library) in lieu of v2.4’s LinuxThreads. NPTL brings enterprise-class threading support to Linux, far surpassing the performance offered by LinuxThreads. As of October 2003, NPTL support was merged into the GNU C library, glibc, and Red Hat first implemented NPTL within Red Hat Linux 9 using a customized v2.4 kernel.
Also introduced in the v2.6 kernel is a new approach to devices. The v2.4 kernel’s devfs-based device handler has a companion in the v2.6 kernel. The newcomer is udev and is an implementation of devfs, but in userspace. Using udev, the system is able to follow devices as they move around on connected busses, with the device identifier remaining static. For instance, the first-seen SCSI device will remain as device sda, using the serial number of the device as an identifier regardless of the order in which it’s found during a later boot. The use of udev is a significant change at the core of the kernel and the cause of some consternation among Linux kernel developers, with solid arguments provided by both sides. It looks like udev/sysfs will be the standard in the future, deprecating devfs, but both are present in the v2.6 kernel and are likely to remain for some time.
And yet another significant change to the v2.6 kernel is the merging of the uClinux project into the core kernel. The uClinux project has been focused on Linux kernel development for embedded devices. The main drive for this functionality is support of processors lacking MMUs (Memory Management Units), commonly found in microcontrollers for embedded systems such as fire alarm controllers or PDAs. The list of embedded controllers that v2.6 supports is quite long, including common processors manufactured by Hitachi, NEC, and Motorola. This definitely shows a separation from the roots of the Linux kernel, as all prior kernels were more or less subject to the limitations of the Intel x86 architecture.
Built for Speed
Prior to the release of the v2.6 kernel, Linux performed tasks on a first-come, first-served basis; interrupting the kernel midtask to handle another process or function was not in the cards. The v2.6 kernel, however, can be pre-empted when needed, and can allocate resources for a process that requires immediate attention, then resume processing on the interrupted task. These interruptions are measured in fractions of a second, and are not generally noticeable, but rather lend an overall feeling of smoothness to system performance. The v2.6 kernel does not bring Linux to the point of being an real-time operating system, but it goes a long way toward assuring that tasks are addressed and completed when required.
At the core of these enhancements is a new process scheduler. The process scheduler in the kernel divides CPU resources among system processes. The performance of the scheduler directly impacts system responsiveness and process latency. In the v2.6 kernel, the new 0(1) scheduler incorporates new algorithms that can substantially increase system performance, especially interactive tasks. The 0(1) scheduler can penalize CPU-hogging processes, improves process prioritization, and provides consistent performance across all processes. Also new in v2.6 are two new I/O schedulers. The scheduler used in the v2.6 kernel by default, the anticipatory scheduler, brings much improved handling of I/O scheduling, ensuring that processes get I/O time when necessary, without unnecessary queuing. Also present is the deadline scheduler, which assigns an expiration to requests using three queues, while anticipatory scheduler attempts to anticipate process I/O requests before they are actually requested.
There has been much debate over the scheduler used in this kernel, and there is support for both schedulers, defined at boot time with options passed to the kernel. The importance of scheduler performance cannot be overstressed. My tests show that the anticipatory scheduler in v2.6 surpasses the v2.4 scheduler handily. Some of my tests show a tenfold performance increase. For instance, a simple read of a 500MB file during a streaming write with a 1MB block size on my Xeon-based test system took 37 seconds with v2.4.23, and 3.9 seconds with v2.6. The deadline scheduler also performs quite well, but may not be as fluid for certain workloads as the anticipatory scheduler. Either way, the new process and I/O schedulers blow v2.4’s schedulers out of the water.
In addition to the new scheduler, v2.6 has plenty of other major architectural changes. The module handling code has been completely rewritten, requiring a new set of userspace module utilities and mkinitrd packages to function. These can be found as updates to most major Linux distributions or via download. The new modutils and module kernel code is much smoother than that found in v2.4, and permits a kernel to be compiled without support for module unloading to ensure the integrity of the production kernel.
Clocking the New Kernel
To test the new kernel, I opted for scenarios that would be most appropriate for real-world users. Testing individual portions of the kernel, such as disk I/O, memory management, and so on could be interesting, but what does it mean for the overall system performance? In order to get the big picture, I selected a few tests representative of expected server workloads and used them to compare the performance of the v2.6 and v2.4 kernels.
Tests were run on three separate hardware platforms: Intel Xeon (x86), Intel Itanium (IA-64), and AMD Opteron (x86_64). The x86 tests were conducted on an IBM eServer x335 1U rack-mount server with dual 3.06GHz P4 Xeon processors and 2GB of RAM. The Itanium tests were run on an IBM eServer x450 3U rack-mount server with dual 1.5GHz Itanium2 processors and 2GB of RAM. And the Opteron tests were run on a Newisys 4300 3U rack-mount server with dual 2.2GHz Opteron 848 processors and 2GB of RAM.
On the Xeon system, the v2.4 kernel pushed 38.85MBps on average, and the v2.6 kernel pushed 67.30MBps -- a 73 percent improvement. The Itanium tests show similar performance differences between the kernels, giving v2.6 a 52 percent gain, albeit with smaller overall figures. And on the Opteron system, which really showed its muscle in this test, the results were a respectable 49.37MBps on the v2.4 kernel and an impressive 72.92MBps under v2.6, an increase of roughly 48 percent.
The performance gains seen in the Samba tests are likely related to the vastly improved scheduler and I/O subsystem in the v2.6 kernel. Disk I/O and network I/O form the core of this test, and the performance improvements in the v2.6 kernel are very visible here.
Across the board, the v2.6 kernel outperformed the v2.4 kernel in the database tests, especially on the Itanium box, where it posted a speed increase of 23 percent (a 519-second lead) over the v2.4 kernel. On the Xeon platform, v2.6 showed almost a 13 percent gain (a 200-second lead) over v2.4. And on Opteron, it registered a 29 percent speed increase (a 415-second lead) over v2.4. The most impressive individual test was table inserts, showing the v2.6 kernel providing a 10 percent performance increase (with a 100-second lead) over v2.4 on Xeon, with even better results found on the Opteron and Itanium platforms.
The Web server tests also showed significant improvement. The static page test used a 21.5KB HTML page with two 25KB images served by Apache 2.0.48. The test was measured in requests per second using Apache’s ab benchmarking tool. The Xeon tests show the v2.6 kernel outperforming v.2.4 by just under 1,000 requests per second, a 40 percent increase. The Itanium tests showed v2.6 providing a 47 percent performance increase, while the Opteron tests showed a 7 percent increase. It should be noted that the Opteron system outperformed the other two servers by more than 1,000 requests per second with the v2.4 kernel, and the smaller increase may be due to network bandwidth constraints imposed on the server. In retrospect, I believe that if I upped the network connectivity of the Newisys box with bonded Gigabit Ethernet NICs, I could push it even faster.