7 top tools for taming big data
Top-flight reporting, analysis, visualization, integration, and development tools that help you harness HadoopFollow @peterwayner
Big data tools: Pentaho Business Analytics
Pentaho is another software platform that began as a report generating engine; it is, like JasperSoft, branching into big data by making it easier to absorb information from the new sources. You can hook up Pentaho's tool to many of the most popular NoSQL databases such as MongoDB and Cassandra. Once the databases are connected, you can drag and drop the columns into views and reports as if the information came from SQL databases.
I found the classic sorting and sifting tables to be extremely useful for understanding just who was spending the most amount of time at my website. Simply sorting by IP address in the log files revealed what the heavy users were doing.
Pentaho also provides software for drawing HDFS file data and HBase data from Hadoop clusters. One of the more intriguing tools is the graphical programming interface known as either Kettle or Pentaho Data Integration. It has a bunch of built-in modules that you can drag and drop onto a picture, then connect them. Pentaho has thoroughly integrated Hadoop and the other sources into this, so you can write your code and send it out to execute on the cluster.
Big data tools: Karmasphere Studio and Analyst
Many of the big data tools did not begin life as reporting tools. Karmasphere Studio, for instance, is a set of plug-ins built on top of Eclipse. It's a specialized IDE that makes it easier to create and run Hadoop jobs.
I had a rare feeling of joy when I started configuring a Hadoop job with this developer tool. There are a number of stages in the life of a Hadoop job, and Karmasphere's tools walk you through each step, showing the partial results along the way. I guess debuggers have always made it possible for us to peer into the mechanism as it does its work, but Karmasphere Studio does something a bit better: As you set up the workflow, the tools display the state of the test data at each step. You see what the temporary data will look like as it is cut apart, analyzed, then reduced.
Karmasphere also distributes a tool called Karmasphere Analyst, which is designed to simplify the process of plowing through all of the data in a Hadoop cluster. It comes with many useful building blocks for programming a good Hadoop job, like subroutines for uncompressing Zipped log files. Then it strings them together and parameterizes the Hive calls to produce a table of output for perusing.
Big data tools: Talend Open Studio
Talend also offers an Eclipse-based IDE for stringing together data processing jobs with Hadoop. Its tools are designed to help with data integration, data quality, and data management, all with subroutines tuned to these jobs.
Talend Studio allows you to build up your jobs by dragging and dropping little icons onto a canvas. If you want to get an RSS feed, Talend's component will fetch the RSS and add proxying if necessary. There are dozens of components for gathering information and dozens more for doing things like a "fuzzy match." Then you can output the results.
Stringing together blocks visually can be simple after you get a feel for what the components actually do and don't do. This was easier for me to figure out when I started looking at the source code being assembled behind the canvas. Talend lets you see this, and I think it's an ideal compromise. Visual programming may seem like a lofty goal, but I've found that the icons can never represent the mechanisms with enough detail to make it possible to understand what's going on. I need the source code.