Although disk-to-disk backup methodologies have become incredibly popular over the past few years, the vast majority of enterprises -- large and small -- still use the same tape backups they implemented years ago. As time goes on, however, more and more old-school backup implementations will reach a breaking point where either capacity or performance can't get the job done.
When you realize that tape can't cut it any longer, you'll likely consider using a disk-based backup appliance, which you can get from many vendors, such as EMC Data Domain, Exagrid, and Quantum. But when choosing the right appliance, be careful: Most buyers focus on the most efficient deduplication engine, but that's only one difference to explore.
[ Get more expert advice on backup with InfoWorld's Deep Dive special reports on backup strategy, deduplication, and email archiving. | Sign up for InfoWorld's Data Explosion newsletter for news and updates on how to deal with growing volumes of data in the enterprise. ]
The deduplication engine gets IT's attention because the whole point of implementing dedupe is to shrink the amount of storage you need to hold your backups -- to save both on physical storage costs and to gain longer on-disk retention times. But capacity efficiency is a relatively small issue in practice. Most of the significant operational differences are based on when in the backup cycle that deduplication takes place and how crucially important scalability is achieved.
Inline vs. postprocess deduplication
The scale of the data being backed up is a key factor as well. And for disk-to-disk backup appliances, there are two primary approaches to the scalability issue: scale-up and scale-out.
Users of traditional SAN implementations will be familiar with the traditional scale-up approach, which typically involves the use of static controller/compute resources attached to a variable amount of storage. In these deployments, you can introduce additional capacity relatively cheaply and easily, both to lengthen retention times and to store your growing data pools.
However, you have to carefully consider at the outset the sizing of your controller resources. As with scale-up SAN implementations, you have to estimate up front both the overall intended capacity and performance requirements for the end of the device's expected lifetime -- which is often difficult to do accurately in today's quickly changing IT landscape. A failure to estimate properly might result in large, unexpected capital investments to upgrade the controllers before you had planned or, arguably worse, overbuying at the outset and retiring the device before its full performance potential was ever exercised.
The scale-out approach avoids some of these pitfalls, but isn't without its own problems. In scale-out implementations, controller resources are generally paired with fixed storage resources, and scalability is achieved by scaling the number of devices in a group as performance and storage requirements change. This handily avoids the need to perform accurate long-term planning, since each year's backup storage investments can instead be guided by short-term requirements. It also largely avoids the risk of substantial overbuying or underbuying.
However, the fixed relationship between controller and storage resources can present a problem when you require more of one than the other. For example, you may want to provide extremely long retention for a relatively small amount of quickly changing data. Doing that with a scale-out platform might require purchasing a large amount of controller resources just to get the required storage density -- which would be much easier and cheaper to accomplish with a scale-up system. Some scale-out platforms also have scalability and management limitations that might make them inappropriate for very large enterprises dealing with truly enormous backup datasets.