7 top tools for taming big data
Top-flight reporting, analysis, visualization, integration, and development tools that help you harness HadoopFollow @peterwayner
Splunk will take text strings and search around in the index. You might type in the URLs of important articles or the IP address. Splunk finds them and packages them into a timeline built around the time stamps it discovers in the data. All other fields are correlated, and you can click around to drill deeper and deeper into the data set. While this is a simple process, it's quite powerful if you're looking for the right kind of needle in your data feed. If you know the right text string, Splunk will help you track it. Log files are a great application for it.
A new Splunk tool called Shep, currently in private beta, promises bidirectional integration between Hadoop and Splunk, allowing you to exchange data between the systems and query Splunk data from Hadoop.
Bigger than big data
After wading through these products, it became clear that "big data" was much bigger than any single buzzword. It's not really fair to lump together products that largely build tables with those that attempt complicated mathematical operations. Nor is it fair to compare simpler tools that work with generic databases with those that attempt to manage larger stacks spread out over multiple machines in frameworks like Hadoop.
To make matters worse, the targets are moving. Some of the more tantalizing new companies still aren't sharing their software yet. Mysterious Platfora has a button you can click to stay informed, while another enigmatic startup, Continuity, just says, "We're still in stealth, heads down and coding hard." They're surely not going to be the last new entrants in this area.
Despite the speed and sophistication of the new algorithms, I found myself liking the old classic reports the best. The Pentaho and Jaspersoft tools simply produce nice lists of the top entries, but this was all I needed. Knowing the top domains in my log file was enough.
The other algorithms are intellectually interesting, but they're harder to apply with any consistency. They can flag clusters or do fuzzy matching, but my data set didn't seem to lend itself to these analyses. Try as I might, I couldn't figure out any applications for my data that didn't seem contrived.
Others will probably feel differently. The clustering algorithms are used heavily in diverse applications such as helping people find similar products in online stores. Others use outlier detection algorithms to identify potential security threats. These all bear investigation, but the software is the least of the challenges.
Perhaps it is my lack of vision that left me clutching to the old sortable reports. In time, I may come to understand just how I might use the advanced algorithms to do more. This may be why most of these companies list consulting among their products. They will rent you one of their engineers, who is familiar with the software and the math, so you have a guide when you're digging around the data. This is a good option for every business because the needs and demands are often rather abstract and filled with wishful hand-waving.
At a recent O'Reilly Strata conference on big data, one of the best panels debated whether it was better to hire an expert on the subject being measured or an expert on using algorithms to find outliers. I'm not sure I can choose, but I think it's important to hire a person with a mandate to think deeply about the data. It's not enough to just buy some software and push a button.
This article, "7 top tools for taming big data," was originally published at InfoWorld.com. Follow the latest developments in business technology news and get a digest of the key stories each day in the InfoWorld Daily newsletter. For the latest business technology news, follow InfoWorld on Twitter.