7 top tools for taming big data
Top-flight reporting, analysis, visualization, integration, and development tools that help you harness HadoopFollow @peterwayner
Talend also maintains TalendForge, a collection of open source extensions that make it easier to work with the company's products. Most of the tools seem to be filters or libraries that link Talend's software to other major products such as Salesforce.com and SugarCRM. You can suck down information from these systems into your own projects, simplifying the integration.
Big data tools: Skytree Server
Not all of the tools are designed to make it easier to string together code with visual mechanisms. Skytree offers a bundle that performs many of the more sophisticated machine-learning algorithms. All it takes is typing the right command into a command line.
Skytree is more focused on the guts than the shiny GUI. Skytree Server is optimized to run a number of classic machine-learning algorithms on your data using an implementation the company claims can be 10,000 times faster than other packages. It can search through your data looking for clusters of mathematically similar items, then invert this to identify outliers that may be problems, opportunities, or both. The algorithms can be more precise than humans, and they can search through vast quantities of data looking for the entries that are a bit out of the ordinary. This may be fraud -- or a particularly good customer who will spend and spend.
The free version of the software offers the same algorithms as the proprietary version, but it's limited to data sets of 100,000 rows. This should be sufficient to establish whether the software is a good match.
Big data tools: Tableau Desktop and Server
Tableau Desktop is a visualization tool that makes it easy to look at your data in new ways, then slice it up and look at it in a different way. You can even mix the data with other data and examine it in yet another light. The tool is optimized to give you all the columns for the data and let you mix them before stuffing it into one of the dozens of graphical templates provided.
Tableau Software started embracing Hadoop several versions ago, and now you can treat Hadoop "just like you would with any data connection." Tableau relies upon Hive to structure the queries, then tries its best to cache as much information in memory to allow the tool to be interactive. While many of the other reporting tools are built on a tradition of generating the reports offline, Tableau wants to offer an interactive mechanism so that you can slice and dice your data again and again. Caching helps deal with some of the latency of a Hadoop cluster.
The software is well-polished and aesthetically pleasing. I often found myself reslicing the data just to see it in yet another graph, even though there wasn't much new to be learned by switching from a pie chart to a bar graph and beyond. The software team clearly includes a number of people with some artistic talent.
Big data tools: Splunk
Splunk is a bit different from the other options. It's not exactly a report-generating tool or a collection of AI routines, although it accomplishes much of that along the way. It creates an index of your data as if your data were a book or a block of text. Yes, databases also build indices, but Splunk's approach is much closer to a text search process.
This indexing is surprisingly flexible. Splunk comes already tuned to my particular application, making sense of log files, and it sucked them right up. It's also sold in a number of different solution packages, including one for monitoring a Microsoft Exchange server and another for detecting Web attacks. The index helps correlate the data in these and several other common server-side scenarios.