Parallel processing meets big data

If you had to point to one event that marked the arrival of the "big data" era, it would be the December 2004 publication of Google's paper on the MapReduce programming model. Google came up with a clever way to scale its processing of crawled Web pages and Web server log files across large clusters of commodity servers, and it shared the ideas with the world. Another team of developers working on Web-scale search problems poured those ideas into an open source MapReduce framework called Hadoop, and the big data industry was born.

But MapReduce has applications beyond Google and Yahoo. The MapReduce paper describes a way to simplify and accelerate the processing of any collection of raw data by breaking the programming job into discrete functions (mappers and reducers), partitioning the data across many computers (even hundreds or thousands), and executing the job across all of those computers in parallel. In practice, the programmer worries only about coding the map and reduce tasks. The Hadoop framework takes care of the messy details, from partitioning the data to rescheduling tasks in the event of hardware failure.

MapReduce solves problems in a series of steps. The map function takes key/value pairs as input and produces an intermediate set of key/value pairs as output. The reduce function takes the output of the map and merges the results for each key. By chaining together map and reduce operations, the programmer can start with a collection of documents (key = document name, value = document contents) and sift through the raw data to isolate the interesting results -- unique site visitors, ad click-throughs, tweets with the most replies, mean recorded temperature, the longest word in Shakespeare, anything your data might reveal.

The beauty of MapReduce is that it scales linearly as you add compute nodes. Because the data is stored locally on each node, network bandwidth is eliminated as a bottleneck. (That's a key difference between MapReduce and the HPC or grid computing approaches to parallel processing, in which the data is accessed from shared storage.) With MapReduce, if you have enough computers, then processing your piles of big data will take hardly any time at all. 

See also:

Copyright © 2012 IDG Communications, Inc.