Choosing a Hadoop
There's no easy way to summarize the quickly shifting space. Each of these companies is pointed in a slightly different direction. They may all agree that the Hadoop collection of software is a great way to spread out work over a cluster, but they each have different visions of who would want to do this and, more important, how to accomplish it. The similarities are fewer than you might expect.
The biggest differences may be in how you handle your data. The idea of making your data accessible through NFS may be one of the neatest innovations, but MapR is introducing some risk by breaking from the pack and adding its own proprietary extensions. MapR's claims for great speed and better throughput are tantalizing, but there's also the danger of bugs or mistakes appearing because of incompatibility. Just as in horror movies, bad things can happen when you split up and strike off on your own.
Amazon's system also imposes its own limitations. It's easiest if you've already decided to park your information in S3. If you've decided that this is a good place for it, you won't notice. If you're not so sure, you'll have to adapt.
Much will also depend on how you're using your data. I think Amazon's cloud may be the simplest way to knock off fast jobs that are run occasionally, but it's not the only choice. Both IBM and Cloudera make it relatively easy to set up and run a cluster. After doing it a few times, I found I could knit together a small cluster of Rackspace cloud machines in just a few minutes. It's not simple, but it's not too hard either.
My guess is that most folks will want to use Cloudera, IBM, or MapR for permanent clusters in permanent clouds. Although it's tempting to spin up a rack for a bit of work, it probably makes more sense to leave it up and running just to simplify the process of migrating the data. I suspect that most Hadoop work involves more data juggling than raw computation. Leaving the cluster up and running makes it possible to move the data in and out.
Another big difference is how the companies are adding features for processing different types of data. IBM includes Lucene, a text search engine for building indices. Cloudera offers intelligent log search. I think that these sorts of Hadoop add-ons will only become more common.
These additions will also put more stress on the open source foundation of Hadoop. In many ways, the proliferation of different approaches shows the strength of the open source approach. The commercial vendors can collaborate on the care while competing on which extra features to add to the mix.
Still, there's bound to be some tension between the companies as they add material. I'm hopeful that the spirit of cooperation will continue to be strong enough to keep everyone working together, but there's no reason to assume that it will always be so. When people get good ideas, they'll roll them into their own clusters first and they may or may not contribute them back to the original distribution.
Commercial Hadoop distributions at a glance
|Amazon Elastic MapReduce||Hadoop implemented on EC2: Upload your JAR file of Hadoop jobs, and Amazon's management tools handle everything else including bidding for idle machines.|
|Cloudera CDH, Manager, and Enterprise||CDH: Hadoop, Hive, Mahout, Oozie, Pig, ZooKeeper, Hue, and other open source tools; no proprietary tools. Cloudera Manager Free Edition: All of CDH plus basic Manager supporting up to 50 cluster nodes. Cloudera Enterprise: Combines CDH, a more sophisticated Manager supporting an unlimited number of cluster nodes, proactive monitoring, and additional data analysis tools.|
|Hortonworks Data Platform||Hadoop, Hive, Mahout, Oozie, Pig, ZooKeeper, Hue, and other open source tools; no proprietary tools.|
|IBM InfoSphere BigInsights||Basic Edition: Hadoop, Hive, Mahout, Oozie, Pig, ZooKeeper, Hue, and other open source tools; basic version of the IBM installer and data access tools. Enterprise Edition: Adds sophisticated job management tools, a data access layer that integrates with major data sources, and BigSheets, a spreadsheet-like interface for manipulating data in the cluster.|
|MapR M3 and M5||M3: Hadoop, Hive, Mahout, Oozie, Pig, ZooKeeper, Hue, and other open source tools. M5: Adds direct NFS access, snapshots, and mirroring for "high availability."|
This article, "Enterprise Hadoop: Big data processing made easier," was originally published at InfoWorld.com. Follow the latest developments in business technology news and get a digest of the key stories each day in the InfoWorld Daily newsletter. For the latest business technology news, follow InfoWorld on Twitter.
Having trouble installing and setting up Win10? You aren’t alone. Here are many of the most common...
Hot or not? From the web to the motherboard to the training ground, get the scoop on what's in and...
Confidence in our power over machines also makes us guilty of hoping to bend reality to our code
Sponsored by Hewlett Packard Enterprise
Microsoft says its new Azure cloud database is all types of databases in one. Here's why that might be...
Edge computing will not replace cloud computing, though the two approaches can complement each other ...
The Rust-like open source language tackles application development where asynchrony leads to...
The popular code repository is trying to be a one-stop shop for developers to get more of their work...