Google Fellows Jeff Dean and Sanjay Ghemawat have published a paper in this month's Communications of the ACM, a publication of the Association for Computing Machinery, detailing the programming model Google leverages to process more than 20 petabytes of data per day on commodity-based clusters.
The paper is an update of a prior article on the process and includes deeper insights into the effect the model has had on operations at Google in the time since first publication.
The methodology, known as MapReduce, allows users to break computations into a map and a reduce function, which the runtime system automatically parallelizes across large clusters, navigating machine failures and honing the efficiency of network and disk use in the process.
Inspired by similar primitives in Lisp, the methodology abstracts parallelization, fault tolerance, data distribution, and load balancing into a library. More than 10 thousand programs have been implemented at Google using MapReduce, which can also be used to parallelize computations for multicore processing on a single machine.
The model has been used for large-scale graph processing, text processing, data mining, machine learner, and statistical machine translation, among other algorithms, Dean and Ghemawat write.
The clusters on which MapReduce jobs run consist of thousands of commodity PCs connected by Gigabit Ethernet. The Linux-based dual-processor x86 machines have between 4GB and 8GB of memory per machine, with two 160GB IDE disks directly attached. Google's homegrown GFS (Google File System) manages the data stored to the disks.
Computations are submitted to a scheduler, which maps tasks to available machines. The MapReduce library splits input files into pieces typically between 16MB and 64MB and implements a master/worker model to perform tasks across the cluster.
MapReduce use has scaled significantly in its first four years of use at Google, with map input data topping 403 petabytes in September 2007. More than 11,000 machines were used that month to process 2.2 million jobs, with an average of 394 machines taking 395 seconds on average to complete each job.
As Dean and Ghemawat note in the paper, the most significant use of MapReduce has been a rewrite of the indexing system used for Google search. The MapReduce system has reduced computations from approximately 3,800 lines of C++ to 700 lines.