The ins and outs of Windows-based deduplication

All those copies of data increase hardware, power, and IT costs. Data deduplication is the answer, and Windows shops may already have the core technology in place

"Storage is cheap!" you might say. Well, certainly disk storage has become cheaper over time, and there are all sorts of products on the market taking advantage of the price reduction with virtual tape libraries and megastorage appliances. But keep in mind the fact that more disk storage comes at a cost beyond the vendor price tag: power to support and cool, space to house, IT administrators to oversee. You have to factor in all these costs. Products that provide single-instance storage (SIS) and data deduplication can really help mitigate the expense of so-called cheap storage.

You may not realize that you have some of the necessary ingredients to take advantage of deduplication within your Microsoft server environment. But you do.

[ Get the latest on storage developments with InfoWorld's Technology: Storage newsletter. | Learn which vendors support Windows Server 2008's deduplication capabilities. ]

First, let's be clear on what deduplication is: Data deduplication is the process of eliminating data redundancies at the storage repository or from network traffic. You can deduplicate either at the object (file) level, which is also called "single instancing," or at the block (subfile) level, which saves much more space.

Data is naturally duplicated due to mass distribution or data processing needs. Most IT organizations maintain multiple copies of the same file in different repositories, or even a few iterations of files you are working on. In addition, backup applications produce and maintain multiple copies of files so that they are available for recovery. Backup processes have contributed greatly to the explosion of data proliferation in the datacenter.

Consider a simple scenario. An e-mail is sent out with a 10MB video to 100 people. If the e-mail platform doesn't have SIS capabilities and the backup product doesn't have a deduplication feature, you are looking at backing up 1GB of data (which takes space, time, and money) as opposed to a single instance of 10MB.

But there are so many different ways to implement a deduplication product, and each vendor and product may take a different approach. Deduplication can take place in-line (meaning the data is deduplicated before being written to disk) or postprocess (meaning the data is analyzed after it has been stored to disk). It can be done at either the source or the target (the storage appliance or virtual tape library). It can be handled through the software (the OS) or the hardware. It may send your head spinning to think about the many options, but you may be happy to know that Microsoft uses some aspects of deduplication directly in some of its products.

Microsoft Exchange has used SIS for years, using pointers to direct requests for a message to a single copy of the message. Microsoft introduced SIS at the file level in Windows Storage Server 2003 R2. At the block level, Microsoft delivered more space-efficient backup via Windows Home Server.

Windows Storage Server 2008 enhances the deduplication capabilities of its predecessor, using SIS-based data deduplication for the Windows File Services, which eliminates identical files on volumes. The duplicates are replaced by pointers that link to files placed in the SIS Common Store. Obviously, for this to work on the backup side, you need to have a SIS-aware backup product. And that's where Microsoft's System Center Data Protection Manager comes into play.

Some people mistakenly believe the System Center's Data Protection Manager (DPM) to have a deduplication capabilities and may feel they have no need for a hardware product to assist with deduplication. That's not true. DPM may use components that are dedupe-like (for example, block-level change tracking), and DPM certainly does an excellent job of using small amounts of storage to fit a large amount of data or a large number of recovery points (giving the impression that traditional SIS or deduplication must be involved). But DPM does not use the traditional compression, SIS, or deduplication features that you will find in a hardware storage platform. The best scenario is to use both DPM and a hardware deduplication product.

Given that several Microsoft server products have some form of SIS or deduplication, you may think you don't need to acquire a software- or hardware- based deduplication product. You might be right. But be sure to analyze your circumstances to see if you need to go beyond the deduplication capabilities that Microsoft offers. Think about whether you need an in-line or postprocess approach, a source- or target-based approach, a software- and/or hardware-based approach. Do the research, determine the pricing, consider the savings (including energy savings that some hardware vendors may offer through MAID [massive array of idle disks] products), and make your decisions.

I'm curious to hear from readers as to what form of deduplication product they have in place. Are you using software- or hardware-based products? What kind of data reduction or cost savings have you noticed (if any)? Or do you feel locked into a software-only product because the economy, combined with high prices attached to hardware deduplication, makes it impossible for you to do otherwise?