Data deduplication -- the process of detecting and removing duplicate data from a storage medium or file system -- is one of those simple ideas that gets complex in the implementation. Duplicate data may be detected at the file, bit, or block level, and the deduplication may be done at the client (source deduplication) or on the storage device itself (target deduplication). You can also choose between inline, which removes duplicate data before writing to the storage device, or postprocess, which removes duplicate data after the data is stored. In short, you have lots of flexibility -- and many trade-offs to consider.
File deduplication is the fastest method, but also the least effective because deduplication can occur only when files are identical at the bit level. If two files differ in name only, for example, both files will be stored. Bit- and block-level deduplication take a more granular approach. Both methods peer inside the file to analyze its contents, looking for duplicate sequences of bits or blocks of data. Deduplication at the bit or block level is much more effective than file-level dedupe, but also more processing-intensive. Greater data reduction comes at the cost of performance.
When applied to backups, data deduplication delivers all sorts of benefits: vastly smaller backup sets, less data to push over the WAN for off-site backups, and the ability to keep more backup data online. Deduplication is really made for backups, which, after all, generally store the very same files over and over again. But deduplication is also inching its way into primary storage. Microsoft, for example, has made data deduplication for NTFS volumes a standard feature of Windows Server 2012. A look at the intelligent features of Microsoft's solution -- ignoring encrypted and compressed files, deduping only older files, processing in the background, and so on -- illustrate the many challenges of applying deduplication in a busy production environment.