Dealing with the data explosion

The exponential expansion of enterprise data -- and what to do about it -- is the biggest problem faced by IT

Am I the only one to notice that the two big trends of the day, cloud computing and mobile tech, seem to have so little to do with the core issues that concern IT professionals?

While the guys at Gartner and Forrester dream of other things, at InfoWorld we've given a name to the most pervasive underlying trend in all of IT: the enterprise data explosion.

[ To stay on top of today's storage challenges, visit InfoWorld's Storage topic center for the latest news, features, and reviews. ]

You've heard the basic IDC stat, which sounds like a malign inversion of Moore's Law: Data doubles every 18 months. And the explosion shows no sign of abating. New compliance regulations in the wake of the global financial meltdown will likely mandate even more data retention, while the imperative to digitize health care records in the United States will prompt a fresh set of storage requirements. With the cost of disk space at an all-time low and the vagaries of compliance laws compelling businesses to "save everything" as a brute force method to reduce risk, enterprises are adding capacity at an astounding rate.

IDC analysts predict that unstructured data will grow at twice the rate of conventional structured data held in databases. By 2010, this "dark matter," so named due to the challenge of extracting useful information from raw data, will make up the majority of all enterprise data stored.

Most of that dark matter comes in the form of security, network, and system event logs. Almost everything that happens in a business is recorded in a log file, making the search and analysis of that data an essential part of managing, securing, and auditing how a company's technology infrastructure is used. Logs are key to many forms of regulatory compliance (PCI, SOX, FISMA, HIPAA) and are a source business intelligence just waiting to be tapped -- think Web servers and CRM systems.

A number of tools now help IT search and analyze log files, including products from AlertLogic, ArcSight, LogLogic, LogRhythm, RSA Security, Sensage, and splunk. ArcSight and RSA also sell leading SEM (security event management) systems, which collect event log data across network and security devices, correlating network events in real time to identify security threats as they happen. SEM solutions collect vast amounts of event data and provide reporting tools for mining it.

Dark matter is only about half of all enterprise data stored. The structured stuff is ballooning, too: transaction records, e-mail archives, rich media, near-line database backups, and on and on. We all know how low-cost storage systems and virtualization are making it more economical to store this stuff. But managing and securing these huge volumes of data are becoming prohibitively difficult, and the cost of buying and maintaining new hardware without increased efficiencies cannot be sustained forever.

We are still years away from solutions that allow administrators to wrap their arms around the whole, heterogeneous storage mess and manage it from one monster control panel. Meanwhile, some interesting new options for easing the pain are emerging.

Most people have heard of one of them, thanks to the recent bidding war over Data Domain: data deduplication. Here, byte- or block-level data reduction techniques shrink the disk requirements (by as much as 80 percent or more) for backups, snapshots, and even virtual server disk files, lowering overall data protection costs while at the same time making more data available on near-line storage. A number of companies are involved in this space, including xaGrid, FalconStor, IBM, NEC, Quantum, Riverbed, Sepaton, and Symantec.

Some of the new cloud solutions are interesting, too, the most prevalent of which are cloud-based hosting or backup/recovery solutions from the likes of SunGard or RackSpace. In addition, many of the first practical cloud-based applications have been built to store, manage, and process massive data sets, leveraging large clusters of commodity hardware and using programming frameworks (such as MapReduce and Hadoop) for reliable and scalable distributed computing.

These and other technologies can be marshaled to manage the explosive growth of data -- and, in some cases, to extract new value from that data. But determining the best practices in each discipline and creating a grand strategy that drives toward an enterprise-wide solution isn't easy.

Over the next few months we'll be creating fresh material that addresses these issues. Meanwhile, we'd like to hear from you: What sort of information do you need that can help you develop appropriate storage architectures and solutions to deal with the enterprise data explosion?