Enterprise Hadoop: Big data processing made easier
Amazon, Cloudera, Hortonworks, IBM, and MapR mix simpler setup of Hadoop clusters with proprietary twists and trade-offsFollow @peterwayner
The company is selling training, support, professional services, and some tools for managing your cluster. The Cloudera distribution and basic manager are free for clusters with fewer than 50 machines, while the subscription-based enterprise edition offers many more features for handling standard data formats.
The free version is quite useful for starting up a cluster and monitoring the jobs as they flow through the system. The manager takes a list of IP addresses, logs into all of them with SSH, and installs the major tools.
The automation makes it pretty easy to run the Cloudera distro, but I still had to patch a few glitches to install it on CentOS. One component wanted a certain version of zip, and it ground to a halt until I logged into the machines and installed it myself. At another point, the Web-based graphical user interface wouldn't work until I logged in again and installed a widget library, ExtJS. The open source licenses probably weren't compatible.
The logging in reminded me of a small point. The IBM installer can use a different root password for each machine. Cloudera's installer wants to use either the same root password or the same RSA key. This meant I had to log into all of the machines and change the password because I was using a stock version of CentOS to start up the rack.
The fact that I noticed this small point and remembered it says much about what is for sale here. The tools are open source and the companies are selling ease-of-use. Little delays can multiply when you're not running exactly the same code.
I think Cloudera has done a better job of making its tools work with different Linux distros. It lists Ubuntu, Suse, Red Hat, CentOS, and Debian. Although I had to do a bit of patching with CentOS, it was relatively simple.
The difference between the free and enterprise versions is a bit bigger than I often see. The proprietary version will not only handle more than 50 machines, but it also includes plenty of monitoring, reporting, and data analysis tools.
In other words, the free version is a great way to start up a Hadoop cluster and make sure that everything is running, but you'll have to do some poking around to monitor it. The enterprise version includes more tools that automate the poking around and double-checking.
IBM InfoSphere BigInsights
IBM bundles Hadoop into something it calls InfoSphere BigInsights. The word "Hadoop" is on the main page, but the advertising copy clearly suggests that this is a product to help people who want "deep insights" into "big data." It's a tool for data analysis that just happens to use Hadoop for all of the structure.
There are two tiers: basic and enterprise. The basic edition is available completely for free, but you can buy support if you like. The enterprise edition, available through a commercial license, includes a number of extra features like BigSheets, a spreadsheetlike tool for drilling down into the data sitting in the cluster.
The collection includes all of the usual suspects and a few that aren't always mentioned -- such as Lucene. Lucene makes sense because BigInsights includes more than a few mechanisms for taking apart text. There's an entire collection of TextExtractors that will do things like search for addresses and flag certain words. The meat of the text analytics is in the enterprise edition.
IBM's literature says the BigInsights package is for Linux, but I found that it ran smoothly with only Red Hat's Enterprise distribution. The installation script would limp to the finish with a few of the others I tried, but it often reported that it failed to install entire tools like Hive or Pig. Even CentOS wasn't close enough to get much running. I think it may still be possible to get BigInsights running if you're adept with Linux and happy to poke around the log files, but it achieves labor savings only if you're running Red Hat Enterprise.