Big data

Too much data, too little time

On one hand, "big data" certainly has the whiff of utopian fantasy. If businesses and governments would just look into the big piles of data they've accumulated, we're told, we'd all get more of what we want at a much lower cost. A McKinsey Global Institute report puts the value of tapping big data at $300 billion in U.S. health care alone. And health care is only one of many sectors of the U.S. economy (including retail, manufacturing, transportation, utilities, natural resources, finance and insurance, and eight more) with vast treasure troves of data to mine -- at least 200 terabytes, or the digital equivalent of the Library of Congress, for every company of 1,000 employees or more, McKinsey estimates.

Big data is not just the big rock candy mountain, but the big rock candy mountain range of the Internet age.

On the other hand, it's impossible to ignore recent breakthroughs in our ability to store and analyze all of this data. Google, which processes upward of 20 petabytes of data per day (a 2008 figure), came up with MapReduce (a programming framework that distributes processing of large data sets across large clusters of computers) because traditional data warehouses don't scale to such heights. Facebook abandoned an Oracle data warehouse for Apache Hadoop (an open source implementation of MapReduce) for the same reason, then developed Hive (a Hadoop companion now also maintained by the Apache Software Foundation) to continue querying the data using SQL.

Note: While Google and Facebook might measure big data in petabytes, you don't need to process data at Web scale for MapReduce to pay off. For example, reading one terabyte from a single disk drive today would take between two and three hours. Divide that data among 10 machines, and you're down to 15 minutes. Regardless of the amount of data you have, a sufficient number of machines cuts hours of processing time to minutes.

Finally, don't get the idea that MapReduce and Hadoop are wholesale replacements for SQL databases. Like many Hadoop infrastructures, Facebook's pulls data from SQL databases to inform MapReduce processing and pushes summary results into SQL databases that handle reporting. In fact, a common use case of Hadoop is simply to bring enough structure to raw data to facilitate further processing in traditional databases. As you might expect, a number of vendors are marrying SQL and MapReduce in "big data analytics" solutions that promise the best of both worlds: a ubiquitous query language we already know and the power of parallel processing.

See also:

Copyright © 2012 IDG Communications, Inc.