How to survive the data explosion

IT organizations everywhere are racing to deal with the onslaught of petabytes. Here's how to meet the challenge

IDC estimates that enterprise data doubles every 18 months. That's an astounding statistic, but somewhat difficult to wrap your head around. A simple analogy may help.

Let's say you're an avid movie buff, and when the American Film Institute's top 100 DVD collection came out in November 1998, you were one of the first to buy it. A collection of 100 DVDs is large enough to be impressive, but small enough to browse easily and find something you want to watch. Weighing in at around 28 pounds and taking up about four feet of space on your bookcase, even the most cramped NYC loft is likely to have space for it. Best of all, "Apocalypse Now" is only a quick 30-second visual search away from your DVD player.

Now, let's apply IDC's enterprise data growth stat to your primo collection. After doubling every 18 months, in November 2015, your collection would have grown to more than 200,000 DVDs weighing over 20 tons and taking up nearly two miles of shelf space. Unless you kept the DVDs scrupulously alphabetized, finding the one you want could take hours. Your collection would have grown to such a massive size, it would almost be useless. It would be a ball and chain dragging behind you until you give up and get rid of most of it.

That's precisely what's happening with our data. Personal, corporate, governmental — it doesn't matter. We're keeping and maintaining way more data than we can possibly ever use. The fact that an 18GB disk available in 1998 is roughly the same size, weight, and cost as a 4TB disk today only obscures the problem and makes us lazy about policing our data growth.

We may be able to store all that data, but when we lose the ability to manage and exploit it effectively, its value decreases. As a result, many businesses are spending more and more time and capital to store data that's worth less and less to the business. Data growth is unavoidable. But it must be accompanied by data management policies that ensure the data created and retained is of real and lasting value.

Of course that's easier said than done. It involves careful thought and no small effort. Tossing a new shelf of disk into a storage array is much easier than developing measures that will curb growth. But we need to if we hope to avoid becoming slaves to our own creation — a fate that is far harder to recover from than to avoid.

Paths to data containment

You may not be able to stop data growth, but at least you can slow it down. Unfortunately, no one containment method fits all. Grappling with growing mountains of data requires a creative combination of several different approaches.

Nontechnical methodologies

The data we store didn't appear out of thin air; someone created it and decided it was worth retaining. Before you attempt a technological fix, you need to influence the behavior of those who create and maintain business data. Starting at the source is almost always a successful strategy.

Controls. The storage quota is one of the oldest means of controlling data growth. Quotas are often overlooked these days, because they're seen as an imposition on the business by an overreaching IT organization. It doesn't need to be that way.

Quotas accomplish one important goal: They force users and departments to ask for more storage when they've run into a predefined limit, rather than simply allowing them to fill all available space. IT may ultimately be forced to accede to requests for additional storage space, but the simple fact that a request was necessary gives IT a chance to ask for justification.

Showback. No quota implementation is complete without reports that show how users are consuming data. But capacity reports that simply list megabytes consumed and/or available fail to tell a meaningful story. Instead, tie the storage capacity consumed to the expense of maintaining it in a “showback” report. Hard costs are a great way to buttress calls for restraint.

Common pitfalls in implementing showback include inadvertently omitting portions of the storage infrastructure and failing to educate users on all the components necessary to deliver enterprise-class storage. In the former case, you might neglect to factor in the cost of storage resources deployed to implement the disaster recovery (such as replicated warm site SANs or backup hardware and software) or even the labor necessary to manage those resources. Be scrupulous. Capture it all.

A comprehensive picture of what data storage costs risks giving users and stakeholders a nasty case of sticker shock. Some may take this seriously and curb data growth. At a time when anyone can buy a 2TB hard drive at an office supply store for about $100, careful education is required to justify the expense of storing and protecting data within an enterprise storage infrastructure.

Chargeback. Simply showing the true cost of storing data may not be enough to change behavior. Passing on the costs to departments through a chargeback mechanism creates a sense of data ownership that would be impossible to gain through other means. Note that implementing chargeback is not something IT can do on its own. It requires substantial support from stakeholders, which may not exist in all business environments.

Social engineering. There are many other less obvious ways of influencing user behavior. When searching for incentives to conserve space, don't be afraid to think outside the box. Host a “Biggest Loser” competition, where departments that cut back the most win gift certificates, or plan an Earth Day-themed storage conservation event. Even if little storage is recovered through such efforts, making data management part of the corporate consciousness can be well worth the effort.

Technical methodologies

After you've done all you can to curb data at the source, focus on the storage infrastructure itself. All kinds of new technologies are emerging to ensure data is stored in the most efficient ways possible.

Data deduplication. Data duplication is a big source of inefficiency. Deduplication identifies and eliminates redundant files, reducing the amount of disk necessary anywhere from 10:1 to 50:1 and beyond, depending on the level of redundancy. This can be especially valuable when used with unstructured data such as departmental file shares, where users are prone to making multiple copies of the same files.

Originating with backup systems, deduplication technology has made inroads in primary storage. Many modern OSes, including Windows Server 2012, simply include it as a baked-in feature, so using it is a no-brainer.

Data structuring. One way to deal with data growth is to impose new structure. In the email example, email archiving software can strip attachments from email databases and store them outside of the email infrastructure using single-instance storage (effectively application-level deduplication).

In other instances, simply implementing information management solutions can have an enormous impact on the creation and retention of business data. Information management systems such as SharePoint can replace difficult-to-control file shares and make it possible to curb duplication. They also capture the metadata you need to determine when data can be archived or deleted.

Lifecycle management. One of the hardest things to do in any storage infrastructure is to manage the latter part of the data lifecycle — when it should be archived or deleted. Nobody likes deleting data; there's something about it that feels unnatural. After all, effort went into creating it and nobody wants to be forced to re-create it.

Creating a methodology users can employ to determine when to archive or jettison old data requires careful consideration. It may be tempting to archive or delete any data over a given age, but that's typically a poor yardstick. To make that determination, you need metadata that describes the data's purpose. That metadata generally won't exist unless you've structured your data.

Do or die

No matter what combination of approaches you use to curb data growth, simply ignoring the problem and continuing to shovel storage at a growth, implement both technical and nontechnical approaches to educate users, and increase storage efficiency. You can decrease the amount of storage you need to deploy and also maintain the organization's ability to effectively leverage its data. The longer you wait, the harder it will be — so don't waste time.

This article, "How to survive the data explosion," was originally published at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

Copyright © 2014 IDG Communications, Inc.