Although disk-to-disk backup methodologies have become incredibly popular over the past few years, the vast majority of enterprises -- large and small -- still use the same tape backups they implemented years ago. As time goes on, however, more and more old-school backup implementations will reach a breaking point where either capacity or performance can't get the job done.
When you realize that tape can't cut it any longer, you'll likely consider using a disk-based backup appliance, which you can get from many vendors, such as EMC Data Domain, Exagrid, and Quantum. But when choosing the right appliance, be careful: Most buyers focus on the most efficient deduplication engine, but that's only one difference to explore.
[ Get more expert advice on backup with InfoWorld's Deep Dive special reports on backup strategy, deduplication, and email archiving. | Sign up for InfoWorld's Data Explosion newsletter for news and updates on how to deal with growing volumes of data in the enterprise. ]
The deduplication engine gets IT's attention because the whole point of implementing dedupe is to shrink the amount of storage you need to hold your backups -- to save both on physical storage costs and to gain longer on-disk retention times. But capacity efficiency is a relatively small issue in practice. Most of the significant operational differences are based on when in the backup cycle that deduplication takes place and how crucially important scalability is achieved.
Inline vs. postprocess deduplication
When you get down to it, there are two predominant dedupe methods in use today: inline deduplication and postprocess deduplication (also known as dedupe at rest). Each approach has significant strengths and weaknesses.
In inline deduplication, data is deduplicated as it is backed up to the appliance. This approach results in the smallest amount of back-end storage usage, because only the deduplicated data stream is written to the appliance's disk. However, it limits the speed at which data can be saved, because the deduplication processing typically requires very large amounts of processor and memory capacity. That same processing has to happen in reverse when you restore the data: Instead of simply reading the data from disk, the appliance must also "rehydrate" the data into its original un-deduplicated state. That's also processor-intensive.
In postprocess deduplication, data is written to the appliance as fast as the network and disks allow. Only once the data has been transferred to the backup storage does the appliance do the depuplication. This task is also processor-intensive, but because it doesn't have to keep up with the flow of backup in real time, as in the case of inline deduplication, it's faster overall. And there's usually no rehydration performance penalty during restore, as most postprocess appliances keep a non-deduplicated copy of the data.
Of course the postprocess approach requires significantly more storage capacity at the backup end -- nearly double, in some cases. And there's a delay in your ability to replicate the backup to a second, mirrored appliance: You have to wait until the deduplication is complete, creating a window of vulnerability and perhaps undermining a stringent multisite backup availability SLA.
Scale-up vs. scale-out
As anyone managing storage resources today knows, data is growing at an almost unbelievable clip. Although it sounds relatively fantastic, the oft-quoted IDC stat suggesting that corporate data is doubling every 18 months hasn't been far off the mark in my own experience. No wonder so many organizations are outgrowing their tape backup systems.