Are users hoarding data? Here's where to put it

If you're maintaining data not accessed since the Clinton Administration, there's a place for it -- and it should ease data migration in the future

Data migration is one of those aspects of life that makes us shudder with dread. At some time or another, lots of data will need to be transported from one spot to another -- because an aging storage unit must be mothballed, because storage infrastructure must be rearchitected, or because some business event requires a change in location. Pain of some kind always results.

There are ways to ease the pain, but they usually demand a homogenous storage infrastructure that permits data to be accessed during a manually triggered or automated reorganization. If you're transporting data between locations because a workgroup is moving or an office is closing or opening, you're mostly out of luck. In most cases where physical locations are changing, you'll need some extremely fast pipes or a Subaru full of storage arrays to make it happen -- typically, the latter.

[ Also from Paul Venezia: How to meet deadlines without sabotaging the future | Discover the key technologies to speed archival storage and get quick data recovery in InfoWorld's Archiving Deep Dive PDF special report. ]

It's times like these that we treasure tools like rsync, but though rsync guarantees that we accurately replicate every bit of data from one place to another, it's still a slow, tedious process. In many cases, it could -- and should -- lead to a greater inspection of the data being maintained. If we're moving the working directories of a team of developers and realize that they're using roughly 80GB per person in their homedirs, perhaps it's time to have a chat about data maintenance and the importance of pruning your home directory once in a while.

We can't be expected to continuously maintain an ever-larger storage pool. Sure, we can buy more arrays and more backup infrastructure, and we can upgrade the network to handle larger and more voluminous data movements, but we'd rather do that because the infrastructure actually needs it, not because we're maintaining and backing up 15 different versions of several development libraries, some that haven't been touched since the 1990s. (Believe me, I've seen that.)

Such conversations about dumping old data tend to be difficult, especially with developers. General users can more easily adhere to reasonable quotas, and they will no doubt survive if they have to delete a few videos or quasi-legal music downloads. But developers, scientists, and other highly technical users can easily consume vast storage resources for legitimate purposes, and in most cases have a hard time parting with anything even after it has little or no use.

To tell the truth, I sometimes find it challenging to delete data that I am 99.99 percent sure I'll never need again -- or even if I need it, I can download if necessary. Yet there's still the worry that as soon as it's gone, I'll wind up in a situation where I'm wasting time downloading it again. This is the reason I still have RHEL 4.5 ISOs in my lab's vast ISO collection. In fact, I took a tally, and I have 696GB in ISO images alone. Mea culpa.

I understand the reluctance of some to clean up their homedirs. Yet that doesn't help when I wait through yet another synchronization pass that could take 24 hours or more, even at gigabit speeds. The older the source storage array, the more time the process will take, because it hasn't been all that long since gigabit links could be saturated by general-purpose mainline storage, especially non-performance storage with SATA drives. Of course that's where much of this "big data" collects.

This challenge spans the IT spectrum -- and doesn't necessarily track with the size of the company. A Web-scale company with millions of users operates with a completely different set of requirements than an IT department supporting hundreds of developers or scientists. In many cases, the latter is the more challenging situation due to the way users interact with the data. Depending on the development effort, each of those developers or scientists could be creating dozens of gigabytes of data per day -- potentially even hundreds. Most times they don't think about whether there's enough storage to support that work. 

One of the most useful tools in the battle over unreasonable data retention is a simple script that walks through a file tree and records every directory with an atime of greater than n years, and calculates the space savings if that data was moved to a graveyard array or permanently deleted. Some of these results will be shocking, especially if the shares have been around for a while. Finding 3TB that hasn't been touched since 2007 in an 8TB array is not unusual.

To mollify those who would cry over the loss of this data, no matter how old it may be, buy a small NAS with a bunch of 3TB drives. Move the old data over to it and shut it off until you need to add more graveyard data. Dollars to doughnuts that data won't ever be missed. This should satisfy the legal department, too.

Offline storage is a beautiful thing. Next time you need to migrate to new storage, you won't need to pull the weight of eons, just the bulk of a few years.

This story, "Are users hoarding data? Here's where to put it," was originally published at Read more of Paul Venezia's The Deep End blog at For the latest business technology news, follow on Twitter.