7 top tools for taming big data
Top-flight reporting, analysis, visualization, integration, and development tools that help you harness HadoopFollow @peterwayner
Many of the big data tools are also working with NoSQL data stores. These are more flexible than traditional relational databases, but the flexibility isn't as much of a departure from the past as Hadoop. NoSQL queries can be simpler because the database design discourages the complicated tabular structure that drives the complexity of working with SQL. The main worry is that software needs to anticipate the possibility that not every row will have some data for every column.
The biggest challenge may be dealing with the expectations built up by the major motion picture "Moneyball." All the bosses have seen it and absorbed the message that some clever statistics can turn a small-budget team into a World Series winner. Never mind that the Oakland Athletics never won the World Series during the "Moneyball" era. That's the magic of Michael Lewis' prose. The bosses are all thinking, "Perhaps if I can get some good stats, Hollywood will hire Brad Pitt to play me in the movie version."
None of the software in this collection will come close to luring Brad Pitt to ask his agent for a copy of the script for the movie version of your Hadoop job. That has to come from within you or the other humans working on the project. Understanding the data and finding the right question to ask is often much more complicated than getting your Hadoop job to run quickly. That's really saying something because these tools are only half of the job.
To get a handle for the promise of the field, I downloaded some big data tools, mixed in data, then stared at the answers for Einstein-grade insight. The information came from log files to the website that sells some of my books (wayner.org), and I was looking for some idea of what was selling and why. So I unpacked the software and asked the questions.
Big data tools: Jaspersoft BI Suite
The Jaspersoft package is one of the open source leaders for producing reports from database columns. The software is well-polished and already installed in many businesses turning SQL tables into PDFs that everyone can scrutinize at meetings.
The company is jumping on the big data train, and this means adding a software layer to connect its report generating software to the places where big data gets stored. The JasperReports Server now offers software to suck up data from many of the major storage platforms, including MongoDB, Cassandra, Redis, Riak, CouchDB, and Neo4j. Hadoop is also well-represented, with JasperReports providing a Hive connector to reach inside of HBase.
This effort feels like it is still starting up -- many pages of the documentation wiki are blank, and the tools are not fully integrated. The visual query designer, for instance, doesn't work yet with Cassandra's CQL. You get to type these queries out by hand.
Once you get the data from these sources, Jaspersoft's server will boil it down to interactive tables and graphs. The reports can be quite sophisticated interactive tools that let you drill down into various corners. You can ask for more and more details if you need them.
This is a well-developed corner of the software world, and Jaspersoft is expanding by making it easier to use these sophisticated reports with newer sources of data. Jaspersoft isn't offering particularly new ways to look at the data, just more sophisticated ways to access data stored in new locations. I found this surprisingly useful. The aggregation of my data was enough to make basic sense of who was going to the website and when they were going there.