With MapReduce, developers can create programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. The MapReduce framework is broken down into two functional areas: Map, a function that parcels out work to different nodes in the distributed cluster, and Reduce, a function that collates the work and resolves the results into a single value.
One of MapReduce's primary advantages is that it is fault-tolerant, which it accomplishes by monitoring each node in the cluster; each node is expected to report back periodically with completed work and status updates. If a node remains silent for longer than the expected interval, a master node makes note and reassigns the work to other nodes.
Building on Hadoop
In addition to many open source support tools such as Clojure and Thrift, dozens of commercial options exist as well, though many are built using Hadoop as the foundation. The PricewaterhouseCoopers Center for Technology and Innovation has published an in-depth guide to the big data building blocks and how they relate to both IT deployment and business usage.
One example is Datameer, which provides a platform to collect and read different kinds of large data stores, put them into a Hadoop framework, and then provide tools for analysis of this data. Basically, Datameer seeks to hide the complexity of Hadoop and provide analysis tools on top of it. Datameer's sweet spot is data sources that exceed 10TB, the size at which Datameer says companies begin to struggle with using traditional technologies for data analysis.
Other commercial vendors offering similar approaches to big data analytics include Appistry, Cloudera, Drawn to Scale HQ, Goto Metrics, Karmasphere, and Talend. And the three main database vendors -- IBM, Microsoft, and Oracle -- all support Hadoop interaction, though in different ways. The open source BI vendor Pentaho also supports Hadoop.
Big data fits businesses of all sizes
Big data is not just all about size; it is also about performance, whatever the dimensions of the data set. That matters for immediate analytics, such as assessing a customer's behavior on a website to better understand what support they need or product they seek, or figuring out implications of current weather and other conditions on delivery routing and scheduling.
That is where server clusters, high-performance file systems, and parallel processing come into play. In the past, those technologies were too expensive for all but the largest businesses. Today, virtualization and commodity hardware have reduced the costs significantly, making big data available to small and medium-size businesses.
Those smaller businesses also have another path to big data analytics: the cloud. Cloud services for big data are popping up, offering platforms and tools to perform analytics quickly and efficiently.