Sun ZFS breaks all the rules

The innovative Zettabyte File System soars to new heights in scalability, reliability, and flexibility

It’s somewhat surprising that in the past five years, file systems haven’t changed much on any platform. There are dozens of file systems available for UNIX-like operating systems -- ext3, XFS, UFS, and ReiserFS for example -- and Microsoft’s ubiquitous NTFS, but since the journaling revolution, there’s been a dearth of innovation in mainstream file systems, until now.

[See video: Screencast: Sun's ZFS on Thumper. ]

Soon after I started working with ZFS (Zettabyte File System) , one thing became clear: the file system of the next 10 years will either be ZFS or something extremely similar. The fluidity, the malleability, and the scalability of ZFS far surpass anything available now on any platform. We’re talking about a file system that can address 256 quadrillion zettabytes of storage, and that can handle a maximum file size of 16 exabytes. For reference, a zettabyte is equal to one billion terabytes. In order to bend your mind around what ZFS is and what it can do, you need to toss out just about everything you know about file systems and start over.

[ Sun ZFS was selected for an InfoWorld Technology of the Year award. See the slideshow to view all the winners in the storage category. ]

Perhaps the easiest way to communicate the underlying concepts of ZFS is a comparison the Sun developers drew during the design stages of the file system back in 2001. When you add RAM to a server, you don’t partition it and allocate one DIMM to this application and another DIMM to that application; you throw all of the RAM into a pile and let the memory manager decide who gets what and when. That simple, pragmatic view forms the basis of ZFS: There are no partitions and no fixed block sizes, no file system consistency check, no RAID initialization procedure, and no inodes – there’s just a pile of disk with ZFS in between.

I worked with ZFS extensively on Sun’s 48-disk Sun Fire X4500 server (see companion review), aptly named the Thumper. In fact, without ZFS, the Thumper wouldn’t be half the solution it is. Simply addressing the sheer number of physical drives in the X4500, not to mention the logical volume sizes that are possible, is at best difficult with any other file system. With ZFS, it’s surprisingly simple.

ZFS is a CLI adventure now; you get no luxurious GUI tools to manage the file systems. Given the focus of ZFS, that’s hardly surprising. ZFS is also very simple in practice – now that’s surprising. Creating a ZFS pool of drives can be done in one line. Creating volumes in that pool is another line. Turning a volume into an NFS share or iSCSI target can be accomplished within the same line as the volume creation, and everything is instantaneous – no waiting for RAID initialization or file system creation. Creating a 20TB pool and a few volumes on the X4500 took about 20 seconds (the time required to type in the commands) and it was ready to go. To see for yourself just how fast and easy it is to drive ZFS, click to the accompanying screencast.

Under the covers

There’s far more to ZFS than is possible to cover in this space, so I’m hitting the high points. Starting with the essentials, ZFS is comprised of three parts. The ZPL (ZFS POSIX Layer) runs at a high level, taking instruction from the OS on I/O requests. Below that is the DMU (Data Management Unit) that takes those instructions and translates them into transaction batches. Rather than requesting data blocks and sending single write requests, ZFS batches these into object-based transactions that can be optimized before any disk activity occurs. Once this is done, the batches are handed off to the SPA (Storage Pool Allocator) to schedule and aggregate the raw I/O. The copy-on-write basis of I/O transactions, coupled with checksums performed on a per block basis, precludes the need for journaling. An abrupt power loss will be recoverable at any point.

Perhaps another good example would be to illustrate how ZFS handles simple disk mirrors. In a traditional two-disk mirror, reads from the mirror are handled in a round-robin fashion to increase read times. This means that if there’s bit rot on one disk but not on the other, there's a fifty-fifty chance that data requested by an application will be invalid. With traditional RAID configurations, this data corruption will be largely unnoticed by the underlying layers, but the application will certainly realize that there’s a problem. ZFS overcomes data corruption by checksumming each block as it’s returned from disk. If there’s a disparity between the 256-bit checksum and the block, ZFS will terminate the request and pull the block from the other member of the mirror set, matching the checksums and delivering the valid data to the application. In a subsequent operation, the bad block seen on the first disk is replaced with the valid data from the second, essentially providing a continuous file system check.

But aren’t checksums expensive? Yes. Well, at least they used to be. In the era of multicore CPUs, delegating a single core of a CPU to performing checksums still leaves plenty of horsepower to handle everything else. The benefits offered by this form of I/O consistency validation eclipse the performance hits on modern hardware, and judging by my performance tests, it’s certainly not an issue.

Beyond the mirror

Of course, ZFS is capable of handling many more than two drives. In fact, it’s a 128-bit file system. Thus, the total capacity addressable by ZFS not only exceeds the limits of earthbound storage, but the power requirements for the number of drives required to reach this limit would be enough to boil the earth’s oceans. That’s serious scalability.

ZFS has a number of neat tricks for managing numerous drives. Because all disk is thrown into a single pool, adding drives to existing arrays is instantaneous, and it requires no re-initialization. During quiescent periods, ZFS will reallocate the data across all disks for better performance, even while making newly added storage immediately available, with writes crossing all drives and reads coming from the original array members.

It appears that Sun also gave careful consideration to disk workload profiling. Server file systems are commonly asked to handle multiple sequential requests to single files. At first blush, these calls may appear to be random I/O, but a closer look will often reveal they are not so random. ZFS can smooth this type of workload with intelligent read-ahead caching at the block level, resulting in significant performance gains for streaming media and for some database workloads.

Another facet of the advanced I/O scheduling in ZFS is request prioritization. When a system is I/O bound, it’s generally due to the disk not keeping up with requests, or major swap operations. Once those requests stack up, basic system interaction slows to a crawl, and there’s nothing more frustrating than trying to kill the misbehaving process with a command that takes forever to run because it needs to be fetched from the very same disk that the runaway process is thrashing. Because ZFS gives reads priority over writes, the read necessary to execute the kill command in these cases gets pushed to the front of the queue, allowing order to be restored in a timely manner.

Smooth snapshots, security

As you would expect, ZFS incorporates snapshots with simple one-line CLI commands, and it allows snapshots to be addressed in both read-only and read-write forms. Rollbacks and individual file inspection in snapshots are also easy to do. Further, ZFS has integrated rsync-like file synchronization, allowing for truly different backup methods, such as piping raw file system data across SSH connections to backup servers with enough smarts to be usable across high-latency links.

There’s also the not-so-small matter of ACLs, which ZFS handles with standard POSIX-compliancy and allow/deny inheritance. Checksumming is a boon from a security standpoint as well: Because every block has a checksum, data can’t be modified at that level without detection. Oh, and did I mention that ZFS can also sit on top of other storage elements, such as iSCSI LUNs (logical unit numbers) and swap volumes? Sun says its engineers have subjected ZFS to more than a million forced, violent crashes in the company's labs without losing data integrity or leaking a single block. I haven’t witnessed such a crash, but I have to say I believe Sun’s claims.

There are downsides to ZFS, such as the current inability to boot from a ZFS volume on Solaris, and the fact that if a snapshot is taken during a scrub operation or mirror resilvering, the process will start over. ZFS is not perfect, but existing development efforts within Sun and via the open source community are likely to overcome these hurdles in time.

Speaking of open source, at the moment several projects are under way to port ZFS beyond Solaris. There are nascent Linux and FreeBSD ports in the works as well as ZFS for Mac OS X. Leopard, the next version of Mac OS X, is said to include many capabilities that seem to directly map to ZFS features. Rumors have been flying for months, including some very convincing screenshots, but the proof will be in the final release of Mac OS X 10.5.

It’s not every day that the computer industry delivers the level of innovation found in Sun's ZFS. More and more advances in the science of IT are based on simply multiplying the status quo. ZFS breaks all the rules here, and it arrives in an amazingly well-thought-out and nicely implemented solution. This is the kind of engineering that made Sun a powerhouse. The achievement of ZFS certainly portends well for a company that might just be pulling itself back from also-ran status and into the limelight once more.

Copyright © 2007 IDG Communications, Inc.

How to choose a low-code development platform