Sun ZFS breaks all the rules
The innovative Zettabyte File System soars to new heights in scalability, reliability, and flexibilityFollow @pvenezia
There’s far more to ZFS than is possible to cover in this space, so I’m hitting the high points. Starting with the essentials, ZFS is comprised of three parts. The ZPL (ZFS POSIX Layer) runs at a high level, taking instruction from the OS on I/O requests. Below that is the DMU (Data Management Unit) that takes those instructions and translates them into transaction batches. Rather than requesting data blocks and sending single write requests, ZFS batches these into object-based transactions that can be optimized before any disk activity occurs. Once this is done, the batches are handed off to the SPA (Storage Pool Allocator) to schedule and aggregate the raw I/O. The copy-on-write basis of I/O transactions, coupled with checksums performed on a per block basis, precludes the need for journaling. An abrupt power loss will be recoverable at any point.
Perhaps another good example would be to illustrate how ZFS handles simple disk mirrors. In a traditional two-disk mirror, reads from the mirror are handled in a round-robin fashion to increase read times. This means that if there’s bit rot on one disk but not on the other, there's a fifty-fifty chance that data requested by an application will be invalid. With traditional RAID configurations, this data corruption will be largely unnoticed by the underlying layers, but the application will certainly realize that there’s a problem. ZFS overcomes data corruption by checksumming each block as it’s returned from disk. If there’s a disparity between the 256-bit checksum and the block, ZFS will terminate the request and pull the block from the other member of the mirror set, matching the checksums and delivering the valid data to the application. In a subsequent operation, the bad block seen on the first disk is replaced with the valid data from the second, essentially providing a continuous file system check.
But aren’t checksums expensive? Yes. Well, at least they used to be. In the era of multicore CPUs, delegating a single core of a CPU to performing checksums still leaves plenty of horsepower to handle everything else. The benefits offered by this form of I/O consistency validation eclipse the performance hits on modern hardware, and judging by my performance tests, it’s certainly not an issue.
Beyond the mirror
Of course, ZFS is capable of handling many more than two drives. In fact, it’s a 128-bit file system. Thus, the total capacity addressable by ZFS not only exceeds the limits of earthbound storage, but the power requirements for the number of drives required to reach this limit would be enough to boil the earth’s oceans. That’s serious scalability.
ZFS has a number of neat tricks for managing numerous drives. Because all disk is thrown into a single pool, adding drives to existing arrays is instantaneous, and it requires no re-initialization. During quiescent periods, ZFS will reallocate the data across all disks for better performance, even while making newly added storage immediately available, with writes crossing all drives and reads coming from the original array members.
It appears that Sun also gave careful consideration to disk workload profiling. Server file systems are commonly asked to handle multiple sequential requests to single files. At first blush, these calls may appear to be random I/O, but a closer look will often reveal they are not so random. ZFS can smooth this type of workload with intelligent read-ahead caching at the block level, resulting in significant performance gains for streaming media and for some database workloads.