Best of Open Source Awards 2013

Bossie Awards 2013: The best open source big data tools

InfoWorld's top picks in the expanding Hadoop ecosystem, the NoSQL universe, and beyond

The best open source big data tools

MapReduce was a response to the limitations of traditional databases. Tools like Giraph, Hama, and Impala are responses to the limitations of MapReduce. These all run on Hadoop, but graph, document, column, and other NoSQL databases might also be part of the mix. Which big data tools will meet your needs? The number of options seems to be expanding faster than ever. 

Apache Hadoop

When people say "big data" or "data science," they're usually talking about a Hadoop project. Hadoop generally refers to the MapReduce framework, but the project also consists of important tools for data storage and processing. This new YARN framework, aka MapReduce 2.0, is an important step forward for Hadoop, and you can expect a big hype cycle to start shortly (if not then I'll start one!).

There aren't many Apache projects that support even one heavily capitalized startup. Hadoop supports several. Analysts estimate that Hadoop will be a ballooning market worth tens of billions per year. If you slipped into a coma during the financial crisis and just woke up, this is the biggest thing you missed.

-- Andrew C. Oliver

Apache Sqoop

When you think of big data processing, you think of Hadoop, but that doesn't mean traditional databases don't play a role. In fact, in most cases you'll still be drawing from data locked in legacy databases. That's where Apache Sqoop comes in.

Sqoop facilitates fast data transfers from relational database systems to Hadoop by leveraging concurrent connections, customizable mapping of data types, and metadata propagation. You can tailor imports (such as new data only) to HDFS, Hive, and HBase; you can export results back to relational databases as well. Sqoop manages all of the complexities inherent in the use of data connectors and mismatched data formats.

-- James R. Borck

Talend Open Studio for Big Data

Talend Open Studio for Big Data lets you load files into Hadoop (via HDFS, Hive, Sqoop, and so on) without manual coding. Its graphical IDE generates native Hadoop code (supporting YARN/MapReduce 2) that leverages Hadoop's distributed environment for large-scale data transformations.

Talend's visual mapping tools allow you to build flows and test your transforms without ever getting your hands dirty with Pig. Project scheduling and job optimization tools further enhance the toolkit.

Gleaning intelligence from big piles of data starts with getting that data from one place to Hadoop, and often from Hadoop to another place. Talend Open Studio helps you swim through these migrations without getting bogged down in operational complexities.

-- James R. Borck

Apache Giraph

Apache Giraph is a graph processing system built for high scalability and high availability. The open source equivalent of Google's Pregel, Giraph is used by Facebook to analyze social graphs of users and their connections. This system circumvents the problem of using MapReduce to process graphs by implementing Pregel's more efficient Bulk Synchronous Parallel processing model. The best part: Giraph computations run as Hadoop jobs on your existing Hadoop infrastructure. You get distributed graph processing while using the same familiar tools.

-- Indika Kotakadeniya

Apache Hama

Like Giraph, Apache Hama brings Bulk Synchronous Parallel processing to the Hadoop ecosystem and runs on top of the Hadoop Distributed File System. However, whereas Giraph focuses exclusively on graph processing, Hama is a more generalized framework for performing massive matrix and graph computations. It combines the advantages of Hadoop compatibility with a more flexible programming model for tackling data-intensive scientific applications.

-- Indika Kotakadeniya

Cloudera Impala

What MapReduce does for batch processing, Cloudera Impala does for real-time SQL queries. The Impala engine sits on all the data nodes in your Hadoop cluster, listening for queries. After parsing each query and optimizing an execution plan, it coordinates parallel processing among the worker nodes in the cluster. The result is low-latency SQL queries across Hadoop with near-real-time insight into big data.

Because Impala uses your native Hadoop infrastructure (HDFS, HBase, Hive metadata), you get a unified platform where you can analyze all of your data without connector complexities, ETL, or expensive data warehousing. And because Impala can be tapped from any ODBC/JDBC source, it makes a great companion for BI packages like Pentaho.

-- James R. Borck


VMware's project aimed at bringing virtualization to big data processing, Serengeti lets you spin up Hadoop clusters dynamically on shared server infrastructure. The project leverages the Apache Hadoop Virtualization Extensions -- created and contributed by VMware -- that make Hadoop virtualization-ready.

With Serengeti, you can deploy your Hadoop cluster environments in minutes without sacrificing configuration options like node placement, HA status, or job scheduling. Further, by deploying Hadoop in multiple VMs on each host, Serengeti allows data and compute functions to be separated, improving computational scaling while maintaining local data storage.

-- James R. Borck

Apache Drill

Inspired by Google's Dremel system, Apache Drill is designed for low-latency interactive analysis of very large data sets. Drill supports multiple sources of data, including HBase, Cassandra, and MongoDB as well as traditional relational databases. With Hadoop, you get massive data throughput, but exploring an idea might take hours or minutes. With Drill, you get results fast enough to work interactively, so ideas can be rapidly explored and fruitful theories developed further.

-- Steven Nuñez


Graph theory has applications across the board. A suspected case of insider trading can be investigated by a link analysis of the traders and employees involved. A complex IT environment can be visualized to uncover the most important connection points in the system. Developed by a consortium of academics, corporations, and individuals, Gephi is a visualization and exploration tool that supports multiple graph types and networks as large as 1 million nodes. The wiki, forums, and tutorials are extensive, and the active Gephi community has produced a large set of plug-ins, so it's likely you won't have to reinvent the wheel for common applications.

-- Steven Nuñez


An agile and blazing-fast graph database, Neo4j can be used in a variety of different ways, including social applications, recommendation engines, fraud detection, resource authorization, and data center network management. Neo4j has continued its steady progress with both performance improvements (streaming of query results) and improved clustering/HA support.

-- Michael Scarlett


Perhaps the most popular NoSQL database of them all, MongoDB uses a binary form of JSON document to store data. This allows schemas to vary across documents, giving developers unbridled freedom compared to traditional relational databases, which impose flat, rigid schemas across numerous tables. And yet MongoDB still provides the functionality developers expect in a relational database.

This was a big year for MongoDB with two new releases and scores of new features, including text search and geospatial capabilities, as well as such performance improvements as concurrent index builds and a faster JavaScript engine (V8).

-- Michael Scarlett

Couchbase Server

Like other NoSQL databases, and unlike most relational databases, Couchbase Server does not require you to create a schema before data is inserted. One unique attribute of Couchbase Server is its memcached library. This feature allows developers to seamlessly transition from a memcached environment and gain data replication, durability, and zero application downtime. The 2.0 release added document database capability. The 2.1 release built on this with cross-data center replication and improved storage performance.

-- Michael Scarlett

Paradigm4 SciDB

SciDB is a distributed database system that leverages parallel processing to perform real-time analytics on streaming data. Built from the ground up to support massive scientific data sets, it eschews the rows and columns of relational databases for native array constructs that are better suited to ordered data sets such as time series and location data. Neither relational nor MapReduce, SciDB offers a unified solution that scales across large clusters without requiring Hadoop's multilayered infrastructure and data massaging obligations.

-- James R. Borck