There is one area where the DOS era’s “I own the whole system” attitude persists, and it’s a virtualization millstone: device drivers. You may not realize how ugly the problem and present solution are, or how simple and elegant the real solution will be.
Let’s address ugly first, shall we? At boot time, each peripheral controller in an x86 system is mapped to a range of memory addresses in the OS kernel’s address space. That location is fixed, and the device driver takes sole ownership of it. After that one-to-one relationship between a driver and a controller is established, there’s no place where a virtualization host can tap traffic between devices and drivers so the devices can be shared. Usually, there is no apparent traffic -- just reads and writes to memory, and software cannot intercept these.
Today, virtualization hosts are forced to take ownership of all system devices and present dummy, emulated devices to guest OS instances. This gets the job of sharing done, but with gross overhead and limitations. For example, imagine that you’re running a server with a top-end storage array controller. When the virtualization host initializes, its device driver grabs hold of this controller. When a guest OS boots, the host makes that fancy controller look like a simple device, like an Adaptec PCI SCSI or an Intel parallel ATA adapter. The host mimics such old and simple peripheral controllers, and does so for network adapters, video cards, and all the rest, because older devices are easier to emulate, they use less resources, and it’s likely that all guest OSes will have drivers for them.
In the worst case, which is often the common case, every disk I/O request a guest makes gets converted to a user-level system call by the host, which then trickles down through another two or three layers to get to the real device driver, and the result bubbles up to the guest after the data’s been copied from one location to another several times. Through this whole process, the guest’s driver probably has the guest OS kernel locked while it waits for a response. What’s worse, your fancy storage controller with a 256MB cache might be emulated as a controller with a 128KB buffer. So not only does every request for a block of disk data have to travel all the way up and down the whole host/guest stack, it has to be broken into much smaller bits and many more requests.
And now for the elegant. The specification for AMD’s IOMMU (for I/O memory management unit) shows that the problem has a pretty simple fix: With IOMMU, the virtualization host can create real one-to-one mappings between peripherals and drivers that are unique for each guest and managed entirely by the CPU. The CPU inserts the tap between drivers and peripherals by watching memory transfers and, through address translation that the x86 architecture does not extend to I/O, gives each guest the impression that it is dealing directly with the device.
The host still has to receive and route to guests the hardware interrupts that say “I’m finished,” but that is comparatively simple. IOMMU is useful even in nonvirtualized environments, where it sets up the tantalizing possibility of permitting highly performance-sensitive processes completely safe, near-direct access to peripherals without the overhead of a driver. And that lets me close with an important point: Virtualization needn’t be an all-or-nothing arrangement. All of the hardware and OS facilities we’re putting in place to support efficient virtualization will make nonvirtualized systems and applications more flexible, reliable, and efficient.