Over a decade ago, Andrew Tridgell was quoted as saying, "In 50 years' time, I doubt anyone will have ever heard of Samba, but they'll probably be using rsync in one way or another." As one of the principal developers of both utilities, he might have known a bit about what he was talking about.
More than 10 years later, I think that forecast is fairly safe. That's not to say that Samba is going to be forgotten anytime soon, but rsync is -- or should be -- a staple in just about every infrastructure across the globe.
[ Also on InfoWorld: Why Sun's NIS will never die | 9 traits of the veteran Unix admin | Track the latest trends in open source with InfoWorld's Open Sources blog and Technology: Open Source newsletter ]
On its face, rsync seems simple. Give it a source and target directory, along with a method of communication, and it will make the directories identical. However, there's much more going on under the hood than many people realize. Rsync isn't just making sure that one or more files are the same on either end; it's actually using a complex algorithm to compare parts of each file and, thus, significantly reduces the amount of time and bandwidth necessary to synchronize them.
Take, for example, a large file of 5GB. It's very easy to run an MD5 sum on the file on each side of a synchronization path to see if they differed. If the sums didn't match, a simple synchronization utility would then ship the entire source file over to the target. Naturally, this would result in a 5GB file transfer. Rsync, by contrast, runs a rolling checksum across the entire file, comparing checksums for small segments of the file, and then only causes the transfer of those blocks that do not match. This is a very simplified description of what's actually going on, but it gets the point across. If only 2MB of data actually changed in that 5GB file, only about 2MB of data will be transmitted from the source to the target. This is a huge benefit in time and bandwidth savings.
In addition, rsync uses compression to further reduce bandwidth when applicable, and it defaults to using SSH on most *nix systems for security. As a result, a simple rsync command performs more work than you might think.
There are caveats, however, such as when dealing with compressed files. Even when there may be only a small number of actual changes to a file, compression algorithms may substantially alter the final compressed file, resulting in much more data transfer than would otherwise be required. However, even here lurk rsync capabilities that don't get much fanfare, such as gzip's rsyncable mode that causes only minor alterations in the final compressed file, allowing rsync to appropriately transfer only the necessary data. This can result in massive performance and speed gains when dealing with compressed files needing synchronization.
There are other oft-overlooked elements to rsync, such as the ability to create snapshots of directories or file systems without requiring that all of the data be synchronized during every pass. Say you have a 10GB directory tree that you wish to synchronize to another server every night. You wish to retain a week's worth of backups at the target, but you'd really rather not have to rsync 10GB of data every night, nor store 10GB of data for every day of data retention. Using rsync's
--link-dest parameter, you can create a single 10GB backup, then instruct rsync to only back up changes within that hierarchy in subsequent passes. This works by causing rsync to create hard links to all of the files that exist, unchanged, in the main directory.