No longer capable of remaining on the sidelines as a separate administrative domain, today's networked storage must be managed with a deeper awareness of business objectives.
But in an era of compliance, litigation, and highly interactive, data-dependent apps fine-tuned for maximum responsiveness, it takes more than a shift in philosophy to establish the kind of business-conscious storage environment that can deliver a true competitive advantage. It takes management tools born of the need to mitigate the downsides of the deluge of data today's enterprises face.
Enter data classification, CDP (continuous data protection), data deduplication, and tiered storage -- three recent advances and one revamped mainstay poised to hone your daily storage operations.
Seemingly unrelated, these four technologies share a common objective: alleviating the pain of enterprise data management.
Whether providing improved data protection, reducing required capacities, ensuring a more flexible infrastructure, or presenting deep insights into stored data content, they seek to better align the traditionally technical benchmarks of storage management -- capacity, performance, and so on -- with business-related metrics, such as relevance, integrity, and responsiveness. In so doing, they are fast becoming essential tools for enterprises looking to derive greater advantage from existing and future storage assets.
The all-too-silent pink elephant in the room of storage management, data classification is finally receiving some much deserved attention from storage vendors. Compliance and e-discovery may be among the central motivating factors for this trend, but enterprises are fast finding data-level awareness of stored content to be an essential component of any comprehensive storage management strategy.
The rise of networked storage as a separate administrative domain has resulted in numerous benefits, including consolidated management and improved scalability. Yet this strategy has led enterprises to manage their storage containers without much understanding of the data content housed therein.
As a consequence, looking at data from the storage side rather than from the application front end is a lot like entering into a gigantic warehouse full of mysterious, cursorily labeled boxes. And when it comes to protecting data off premises or responding to requests from a judge or challenger in court, not to mention surfacing what your enterprise already knows, having precious information buried deep in storage silos can prove detrimental to your bottom line.
Making sense of what is stored in those mysterious boxes is the primary objective of data classification.
Chief among the benefits of data classification is the ability to allocate data to the appropriate storage tier. Compellent's Data Progression, for example, automatically classifies blocks of data according to criteria such as age and frequency of access, then pushes them to tiers accordingly. Data Progression has the unique capability of decoupling blocks from their file wrapper, but on any other storage system, administrators can combine analysis of standard file metadata -- name, file type, date created, and so on -- with simple classification criteria to identity files that need to move elsewhere.
Relatively easy to implement, that kind of functionality proves inadequate for more ambitious classification exercises. To comply with regulations such as HIPAA, to respond to FRCP (Federal Rules of Civil Procedures) e-discovery requests, or to assess risks of disclosure, companies need more comprehensive data classification tools capable of finding files that contain sensitive information such as Social Security numbers, credit card numbers, or other private personal or corporate data.
Data classification solutions of this caliber provide the applications and structure to search for those needles in companies' archive haystacks, scanning for relevant patterns and creating rules to automatically assign data to the proper containers. Implementing such tools is often a recursive exercise in which the human element must complement the results of the search and classification engines.
Infoscape -- EMC's ambitious and still evolving data classification project -- is the cornerstone of the company's ILM (information lifecycle management) strategy. Using templates, Infoscape users can quickly identify the steps and rules needed for each classification task.
Templates, however, can help only to a point, and EMC is finding that customers may have to manage documents outside of Infoscape. "[In Infoscape], we have implemented a copy to Documentum feature," says Sheila Childs, director of marketing at EMC.
Kazeon Information Server is another comprehensive data classification solution. Michael Marchi, vice president of solutions marketing at Kazeon, contends that e-discovery, compliance, and security are driving enterprises to incorporate integrated data classification solutions into their overall storage management strategies.
First launched in 2005, Kazeon's Information Server houses content-aware indexing, data classification, search, reporting, and migration in a single appliance in an effort to meet those needs. Information Server is also offered by NetApp to manage, for example, the retention dates of files created by NetApp's data protection offerings.
Index Engines, as its name suggests, leverages indexing as a means for creating metadata that makes corporate data easily searchable. The added twist this vendor offers, however, is the ability to create online metadata from files on tape reels, a lifesaver for companies housing a multitude of media in their vaults.
Despite the advances of such offerings, it would be disingenuous to paint data classification as a mature technology. That said, the technology is evolving and may in fact be the most effective means currently available for maintaining compliance, protecting sensitive data, and ensuring adequate responsiveness in the event of litigation. No other technology comes close to supplying an answer for those needs.
Continuous data protection
Once an acronym becomes popular, altering it -- even to better reflect the underlying technology -- is difficult business. CDP, the natural evolution of conventional backups, would probably benefit from a name change to something along the lines of "data recovery preparedness." After all, the objective of adopting a CDP solution is to ensure that your enterprise -- or selected business processes within -- can survive a disruption without data loss.
As many enterprises have been made painfully aware, the traditional backup paradigm has never provided an impenetrable data protection shield. In essence, conventional backup applications take a picture of selected databases or files at recurring intervals, typically each day at business closing time.
The approach, however, has severe limitations, most notably its long intervals between protective copies, which in the event of a disruption, translates to lost data. In today's world of highly interactive applications, a prolonged risk of data loss is fast becoming increasingly unacceptable, which is probably what triggered GlassHouse Technologies CTO James Damoulakis to title a recent BusinessWeek white paper "Best Practices: Are Backups a Waste of Time?"
Backups may not always be a waste of time, but the fact that just about every backup-software vendor has added CDP to its portfolio is probably the most unbiased acknowledgement of the importance of CDP -- and the limitations of traditional backup wares. CDP moves beyond backup's limitations by providing virtually infinite recovery points, an enormous improvement that leaves very little or no data at risk.
In the main, CDP solutions take one of two approaches: Either they use a host agent to intercept and replicate every write to disk, or they schedule frequent snapshots to create numerous volume images from which to restart in case of damage.
Less granular though easier to implement, the snapshot approach is worthwhile -- and perhaps less burdensome -- when full recoverability is not needed. For example, scheduling snapshots every 30 minutes can adequately protect an accounting system. In the event of a disruption, users can easily re-enter the last half hour of transactions after the proper files are restored from the latest volume image.
With more storage systems offering snapshot capabilities, this quasi- or near-CDP snapshot approach is certain to become more popular. Another notable advantage is that the snapshots can be the source of traditional backup operations so that tape copies for vaulting or data exchanges with other parties can be created offline.
However tempting near-CDP may be, it is no substitute for the no-bits-left-behind approach of a true, host-agent CDP solution. Recovery is not as easy with the host-agent approach because it requires applying data changes against a known good copy of the affected file or database. But true CDP makes it possible to bring data back at the very instant preceding the damage, which is quite a departure from the a priori, fingers-crossed decision one has to make when scheduling near-CDP and backups.
The kind of flexibility true CDP offers brings the VCR -- or TiVo -- rewind capability to mind, making it the ideal data protection safeguard for applications without a safety net, such as e-mail, database updates, word processors, and CAD/CAM. Not surprisingly, a number of lightweight CDP applications on the market protects user files, including what's stored on desktop and laptops. But the heavy lifting of data recovery for many companies revolves around e-mail and database-centered applications.
When considering a CDP solution, however, keep in mind that CDP alone does not provide application recovery. For that, you will need additional software. That said, CDP is an important first step toward ensuring that your data is safe -- and that the lifeblood of your business can easily be restored should something damaging transpire.
Few technologies can claim the quick road to success that data deduplication can. Just four years ago, the technology was proposed by but a few pioneers, largely ignored by major storage vendors. Today it is difficult to single out a vendor that doesn't have a data deduplication slide in its marketing materials.
In hindsight, the quick success of data deduplication is easy to explain: It's the most effective strategy for offsetting a significant portion of the data currently deluging companies. And with some enterprises doubling the amount of data they must manage every year, it's not surprising to see how data deduplication's promise to shrink data capacities by a factor of 20 to 1 would appeal to most.
To achieve that level of capacity reduction, data deduplication technologies use algorithms that essentially replace identical globs of data with pointers to a single instance. Implementations differ in how they apply those algorithms; for example, Sepaton pursues file-based byte-level comparisons, whereas Data Domain looks for equal fragments within files.
Moreover, the size of the fragments replaced can be either fixed or variable, another key differentiator among data deduplication solutions. Avamar, for example, uses a variable-size segment to identify duplicates. According to the vendor, which was recently acquired by EMC, the approach remains effective even when minor changes, such as inserting a single line in a document, could defeat comparisons between fixed-length segments. Despite such comparative claims, however, vendors' declared average deduplication ratios differ very little.
Other differences in how vendors implement the technology can have a more significant impact on the effectiveness of data deduplication in your daily operations. Adding traditional, hardware-implemented compression, for example, can further reduce data capacities by 50 percent -- a nontrivial gain that essentially doubles your dedup ratio.
Putting the dedup magic wand to work in line with your backups may seem the smart thing to do, but only if the added overhead doesn't extend your backup windows into business hours. Because of this, some companies may benefit from a more prudent offline, off-band approach to deduplication.
Nevertheless, because of the unprecedented data reduction ratios offered by data deduplication, it has become an indispensable addition to VTLs (virtual tape libraries). In fact, by storing more data per gigabyte, data deduplication narrows the cost gap between tape reel and SATA storage, which makes it economically viable to keep all but the oldest data online.
In addition to enabling quick restores, keeping more data online makes it possible to conduct extensive analysis on historical data, which becomes impractical if data is scattered over hundreds of tape media.
For companies supporting numerous remote offices, which are typically staffed with business-minded rather than technically proficient personnel, deduplication can help consolidate backups. Just install compatible VTL appliances at each location and replicate over the WAN only the data deltas, or just a pointer for a duplicate segment.