Each virtualized server is contained in a file. For instance, VMware uses a single VMDK (virtual machine disk) file as the virtual hard disk for the virtual machine. As you would expect, VMDK files tend to be rather large -- at least 2GB in size, and usually much larger.
One of the great features of virtual machines is that admins can stop the VM, copy the VMDK file, and back it up. Simply restart the machine and you're back online. Now what happens with all of these backup copies? That's right -- a lot of duplicated files stored on a file server. Admins keep "golden images" of working virtual servers to spawn new virtual machines -- not to mention the backup copies. Virtualization is a fantastic way to get the most out of CPU and memory, but without deduplication, virtual hard disks can actually increase network storage requirements.
Straining backup systems
How do you back up all this data? Old tape backup systems are too slow and lack the needed capacity. New high-end tape systems have the performance and capacity but are quite expensive. And no matter how good your tape drive is, Murphy's Law has a tendency to jump all over tape when it comes to restoration.
VTLs (virtual tape libraries) provide a modern alternative to tape, using hard disks in configurations that mimic standard tape drives. But at what cost? Additional spindles equal additional cost and additional power consumption. VTLs are fast and provide a reliable backup and restore destination, but if there were less data to back up, you'd have lower hardware and operating costs to begin with.
Data glut compounds the difficulty of disaster recovery, making each stage of near line and offline storage more expensive. Keeping a copy of the backup in near line storage makes restoration of missing or corrupt files easy. But depending on the backup set size and the number of backup sets admins want to keep handy, your near line storage can be quite substantial. The next tier, offline storage, is composed of tapes or other media copies that get thrown in a vault or sent to some other secure location. Again, if the data set is large and growing, this offline media set must expand to fit.
Many disaster recovery plans include sending the backup set to another geographical location over a WAN. Unless your company has deep pockets and can afford a very fast WAN link, it would be beneficial to keep the size of the backup set to a minimum. That goes double for restoring data. If the set is really large, trying to restore from an off-site backup will add downtime and frustration.
Defining data deduplication and its benefits
Simply put, deduplication is the process of detecting and removing duplicate data from a storage medium or file system. Detection of duplicate data may be performed at the file, bit, or block level, depending on the type and aggressiveness of the deduplication process.
The first time a deduplication system sees a file or a chunk of file, that data element is identified. Thereafter, each subsequent identical item is removed from the system but marked with a small placeholder. The placeholder points back to the first instance of the data chunk so that the deduped data can be reassembled when needed.
This deduplication process reduces the amount of storage space needed to represent all of the indexed files in the system. For example, a file system that has 100 copies of the same document from HR in each employee's personal folder can be reduced to a single copy of the original file plus 99 tiny placeholders that point back to the original file. It's easy to see how that can vastly reduce storage requirements -- as well as why it makes much more sense to back up the deduped file system instead of the original file system.