No longer capable of remaining on the sidelines as a separate administrative domain, today's networked storage must be managed with a deeper awareness of business objectives.
But in an era of compliance, litigation, and highly interactive, data-dependent apps fine-tuned for maximum responsiveness, it takes more than a shift in philosophy to establish the kind of business-conscious storage environment that can deliver a true competitive advantage. It takes management tools born of the need to mitigate the downsides of the deluge of data today's enterprises face.
Enter data classification, CDP (continuous data protection), data deduplication, and tiered storage -- three recent advances and one revamped mainstay poised to hone your daily storage operations.
Seemingly unrelated, these four technologies share a common objective: alleviating the pain of enterprise data management.
Whether providing improved data protection, reducing required capacities, ensuring a more flexible infrastructure, or presenting deep insights into stored data content, they seek to better align the traditionally technical benchmarks of storage management -- capacity, performance, and so on -- with business-related metrics, such as relevance, integrity, and responsiveness. In so doing, they are fast becoming essential tools for enterprises looking to derive greater advantage from existing and future storage assets.
The all-too-silent pink elephant in the room of storage management, data classification is finally receiving some much deserved attention from storage vendors. Compliance and e-discovery may be among the central motivating factors for this trend, but enterprises are fast finding data-level awareness of stored content to be an essential component of any comprehensive storage management strategy.
The rise of networked storage as a separate administrative domain has resulted in numerous benefits, including consolidated management and improved scalability. Yet this strategy has led enterprises to manage their storage containers without much understanding of the data content housed therein.
As a consequence, looking at data from the storage side rather than from the application front end is a lot like entering into a gigantic warehouse full of mysterious, cursorily labeled boxes. And when it comes to protecting data off premises or responding to requests from a judge or challenger in court, not to mention surfacing what your enterprise already knows, having precious information buried deep in storage silos can prove detrimental to your bottom line.
Making sense of what is stored in those mysterious boxes is the primary objective of data classification.
Chief among the benefits of data classification is the ability to allocate data to the appropriate storage tier. Compellent's Data Progression, for example, automatically classifies blocks of data according to criteria such as age and frequency of access, then pushes them to tiers accordingly. Data Progression has the unique capability of decoupling blocks from their file wrapper, but on any other storage system, administrators can combine analysis of standard file metadata -- name, file type, date created, and so on -- with simple classification criteria to identity files that need to move elsewhere.
Relatively easy to implement, that kind of functionality proves inadequate for more ambitious classification exercises. To comply with regulations such as HIPAA, to respond to FRCP (Federal Rules of Civil Procedures) e-discovery requests, or to assess risks of disclosure, companies need more comprehensive data classification tools capable of finding files that contain sensitive information such as Social Security numbers, credit card numbers, or other private personal or corporate data.
Data classification solutions of this caliber provide the applications and structure to search for those needles in companies' archive haystacks, scanning for relevant patterns and creating rules to automatically assign data to the proper containers. Implementing such tools is often a recursive exercise in which the human element must complement the results of the search and classification engines.
Infoscape -- EMC's ambitious and still evolving data classification project -- is the cornerstone of the company's ILM (information lifecycle management) strategy. Using templates, Infoscape users can quickly identify the steps and rules needed for each classification task.
Templates, however, can help only to a point, and EMC is finding that customers may have to manage documents outside of Infoscape. "[In Infoscape], we have implemented a copy to Documentum feature," says Sheila Childs, director of marketing at EMC.
Kazeon Information Server is another comprehensive data classification solution. Michael Marchi, vice president of solutions marketing at Kazeon, contends that e-discovery, compliance, and security are driving enterprises to incorporate integrated data classification solutions into their overall storage management strategies.
First launched in 2005, Kazeon's Information Server houses content-aware indexing, data classification, search, reporting, and migration in a single appliance in an effort to meet those needs. Information Server is also offered by NetApp to manage, for example, the retention dates of files created by NetApp's data protection offerings.
Index Engines, as its name suggests, leverages indexing as a means for creating metadata that makes corporate data easily searchable. The added twist this vendor offers, however, is the ability to create online metadata from files on tape reels, a lifesaver for companies housing a multitude of media in their vaults.
Despite the advances of such offerings, it would be disingenuous to paint data classification as a mature technology. That said, the technology is evolving and may in fact be the most effective means currently available for maintaining compliance, protecting sensitive data, and ensuring adequate responsiveness in the event of litigation. No other technology comes close to supplying an answer for those needs.
Continuous data protection
Once an acronym becomes popular, altering it -- even to better reflect the underlying technology -- is difficult business. CDP, the natural evolution of conventional backups, would probably benefit from a name change to something along the lines of "data recovery preparedness." After all, the objective of adopting a CDP solution is to ensure that your enterprise -- or selected business processes within -- can survive a disruption without data loss.
As many enterprises have been made painfully aware, the traditional backup paradigm has never provided an impenetrable data protection shield. In essence, conventional backup applications take a picture of selected databases or files at recurring intervals, typically each day at business closing time.
The approach, however, has severe limitations, most notably its long intervals between protective copies, which in the event of a disruption, translates to lost data. In today's world of highly interactive applications, a prolonged risk of data loss is fast becoming increasingly unacceptable, which is probably what triggered GlassHouse Technologies CTO James Damoulakis to title a recent BusinessWeek white paper "Best Practices: Are Backups a Waste of Time?"
Backups may not always be a waste of time, but the fact that just about every backup-software vendor has added CDP to its portfolio is probably the most unbiased acknowledgement of the importance of CDP -- and the limitations of traditional backup wares. CDP moves beyond backup's limitations by providing virtually infinite recovery points, an enormous improvement that leaves very little or no data at risk.
In the main, CDP solutions take one of two approaches: Either they use a host agent to intercept and replicate every write to disk, or they schedule frequent snapshots to create numerous volume images from which to restart in case of damage.
Less granular though easier to implement, the snapshot approach is worthwhile -- and perhaps less burdensome -- when full recoverability is not needed. For example, scheduling snapshots every 30 minutes can adequately protect an accounting system. In the event of a disruption, users can easily re-enter the last half hour of transactions after the proper files are restored from the latest volume image.
With more storage systems offering snapshot capabilities, this quasi- or near-CDP snapshot approach is certain to become more popular. Another notable advantage is that the snapshots can be the source of traditional backup operations so that tape copies for vaulting or data exchanges with other parties can be created offline.
However tempting near-CDP may be, it is no substitute for the no-bits-left-behind approach of a true, host-agent CDP solution. Recovery is not as easy with the host-agent approach because it requires applying data changes against a known good copy of the affected file or database. But true CDP makes it possible to bring data back at the very instant preceding the damage, which is quite a departure from the a priori, fingers-crossed decision one has to make when scheduling near-CDP and backups.
The kind of flexibility true CDP offers brings the VCR -- or TiVo -- rewind capability to mind, making it the ideal data protection safeguard for applications without a safety net, such as e-mail, database updates, word processors, and CAD/CAM. Not surprisingly, a number of lightweight CDP applications on the market protects user files, including what's stored on desktop and laptops. But the heavy lifting of data recovery for many companies revolves around e-mail and database-centered applications.
When considering a CDP solution, however, keep in mind that CDP alone does not provide application recovery. For that, you will need additional software. That said, CDP is an important first step toward ensuring that your data is safe -- and that the lifeblood of your business can easily be restored should something damaging transpire.
Few technologies can claim the quick road to success that data deduplication can. Just four years ago, the technology was proposed by but a few pioneers, largely ignored by major storage vendors. Today it is difficult to single out a vendor that doesn't have a data deduplication slide in its marketing materials.
In hindsight, the quick success of data deduplication is easy to explain: It's the most effective strategy for offsetting a significant portion of the data currently deluging companies. And with some enterprises doubling the amount of data they must manage every year, it's not surprising to see how data deduplication's promise to shrink data capacities by a factor of 20 to 1 would appeal to most.
To achieve that level of capacity reduction, data deduplication technologies use algorithms that essentially replace identical globs of data with pointers to a single instance. Implementations differ in how they apply those algorithms; for example, Sepaton pursues file-based byte-level comparisons, whereas Data Domain looks for equal fragments within files.
Moreover, the size of the fragments replaced can be either fixed or variable, another key differentiator among data deduplication solutions. Avamar, for example, uses a variable-size segment to identify duplicates. According to the vendor, which was recently acquired by EMC, the approach remains effective even when minor changes, such as inserting a single line in a document, could defeat comparisons between fixed-length segments. Despite such comparative claims, however, vendors' declared average deduplication ratios differ very little.
Other differences in how vendors implement the technology can have a more significant impact on the effectiveness of data deduplication in your daily operations. Adding traditional, hardware-implemented compression, for example, can further reduce data capacities by 50 percent -- a nontrivial gain that essentially doubles your dedup ratio.
Putting the dedup magic wand to work in line with your backups may seem the smart thing to do, but only if the added overhead doesn't extend your backup windows into business hours. Because of this, some companies may benefit from a more prudent offline, off-band approach to deduplication.
Nevertheless, because of the unprecedented data reduction ratios offered by data deduplication, it has become an indispensable addition to VTLs (virtual tape libraries). In fact, by storing more data per gigabyte, data deduplication narrows the cost gap between tape reel and SATA storage, which makes it economically viable to keep all but the oldest data online.
In addition to enabling quick restores, keeping more data online makes it possible to conduct extensive analysis on historical data, which becomes impractical if data is scattered over hundreds of tape media.
For companies supporting numerous remote offices, which are typically staffed with business-minded rather than technically proficient personnel, deduplication can help consolidate backups. Just install compatible VTL appliances at each location and replicate over the WAN only the data deltas, or just a pointer for a duplicate segment.
The general advantages of data deduplication are undeniable, as it is likely the most viable means for achieving significant savings on storage infrastructure and management. Yet choosing the best solution for your enterprise requires homework. More so than with the other technologies discussed here, data deduplication should be test-driven before purchase to assess its actual impact on your company's data assets.
However challenging choosing the optimal offering might prove to be, not choosing data deduplication will probably be the worst decision you can make, as its upside will give competitors who deploy it a measurable advantage over those that don't.
Tiered storage has been essential to daily IT operations since the dawn of computing. Founded on the fact that not all storage media are created equal, the concept involves migrating data to the media that best satisfies business requirements and cost objectives.
The logic behind tiered storage hasn't changed much since the Paleolithic age of computing, when managing tiers was often as easy as loading a file of punch cards to disk, running a much faster batch processing of that data, and returning that precious online space to a common pool when the processing was complete.
But the number and variety of storage systems currently available, as well as the amount of information enterprises must now manage, have made tiered storage's inherent benefits -- cost savings and increased responsiveness to business requirements -- even more desirable and perhaps easier to attain.
For example, recent advances in drive technologies have produced SATA devices that favor capacity and offer a cost per gigabyte significantly lower than that of typical high-performance FC, SCSI, or SAS (serial attached SCSI) drives. That said, high-performance drives now offer a blend of capacity and performance, and whereas SATA devices lead with capacities of as much as 1TB and growing for a single unit, high-performance drives have extended their capacities into the range of hundreds of gigabytes.
Based on such advances, storage vendors now offer an unprecedented granularity of storage arrays that range from very dense solutions based on high-capacity SATA drives to spindle-rich systems that provide fast interactive access at sensibly higher acquisition and operating costs.
By grouping homogeneous storage media in tiers, companies can store data more efficiently -- for example, maintaining frequently accessed transactional records on the fastest devices and moving older or seldom accessed files to a less expensive tier. As such, tiered storage provides obvious financial benefits, reducing the average cost of data that is parked for longer periods of time and rarely referenced, if at all.
And when it comes to seldom accessed data, the lower acquisition price of SATA systems can be reason enough to move to a tiered storage architecture. According to a recent IDG study, the cost per gigabyte of "capacity optimized" systems is less than half that of "performance optimized" systems, a ratio that seems likely to extend into the future.
Though the cost gap between systems can be even greater than those worldwide averages, acquisition savings are not the only benefits of tiered storage. Purchasing dense devices, for example, can avoid or delay capital expenditures to extend the datacenter.
Although difficult to put a dollar value on, isolating critical tier-1 data from the crowd of less sensitive data is the first step in establishing a more business-conscious storage environment -- likely the most desirable aspect of employing a tiered storage strategy in the enterprise.
In fact, some vendors are now offering "tier 0" devices to create a very fast, memory-based buffer between servers and conventional, disk-based storage. Not to be confused with traditional cache memory, which is either embedded within the application server or the storage device, these tier-0 devices are SSDs (solid state drives) that are fed with or deprived of data to improve the response time of the storage system.
Xiotech, for example, recently announced SSDs for its Magnitude 3D 3000 SAN systems. Gear6, a startup recently out of stealth mode, has customers tapping its CacheFX, a RAM-based NFS accelerator.
Obviously, such implementations target a different objective than traditional tiered storage does -- namely, creating a top performing layer of storage, rather than reducing cost. However, even if more expensive, tier-0 solutions respond to the same optimization criteria that suggest moving your data from enterprise storage to high-capacity SATA drives and eventually to tape. Managing those data allocations efficiently is the new challenge that storage admins face.