Choosing a Hadoop
There's no easy way to summarize the quickly shifting space. Each of these companies is pointed in a slightly different direction. They may all agree that the Hadoop collection of software is a great way to spread out work over a cluster, but they each have different visions of who would want to do this and, more important, how to accomplish it. The similarities are fewer than you might expect.
The biggest differences may be in how you handle your data. The idea of making your data accessible through NFS may be one of the neatest innovations, but MapR is introducing some risk by breaking from the pack and adding its own proprietary extensions. MapR's claims for great speed and better throughput are tantalizing, but there's also the danger of bugs or mistakes appearing because of incompatibility. Just as in horror movies, bad things can happen when you split up and strike off on your own.
Amazon's system also imposes its own limitations. It's easiest if you've already decided to park your information in S3. If you've decided that this is a good place for it, you won't notice. If you're not so sure, you'll have to adapt.
Much will also depend on how you're using your data. I think Amazon's cloud may be the simplest way to knock off fast jobs that are run occasionally, but it's not the only choice. Both IBM and Cloudera make it relatively easy to set up and run a cluster. After doing it a few times, I found I could knit together a small cluster of Rackspace cloud machines in just a few minutes. It's not simple, but it's not too hard either.
My guess is that most folks will want to use Cloudera, IBM, or MapR for permanent clusters in permanent clouds. Although it's tempting to spin up a rack for a bit of work, it probably makes more sense to leave it up and running just to simplify the process of migrating the data. I suspect that most Hadoop work involves more data juggling than raw computation. Leaving the cluster up and running makes it possible to move the data in and out.
Another big difference is how the companies are adding features for processing different types of data. IBM includes Lucene, a text search engine for building indices. Cloudera offers intelligent log search. I think that these sorts of Hadoop add-ons will only become more common.
These additions will also put more stress on the open source foundation of Hadoop. In many ways, the proliferation of different approaches shows the strength of the open source approach. The commercial vendors can collaborate on the care while competing on which extra features to add to the mix.
Still, there's bound to be some tension between the companies as they add material. I'm hopeful that the spirit of cooperation will continue to be strong enough to keep everyone working together, but there's no reason to assume that it will always be so. When people get good ideas, they'll roll them into their own clusters first and they may or may not contribute them back to the original distribution.
Commercial Hadoop distributions at a glance
|Amazon Elastic MapReduce||Hadoop implemented on EC2: Upload your JAR file of Hadoop jobs, and Amazon's management tools handle everything else including bidding for idle machines.|
|Cloudera CDH, Manager, and Enterprise||CDH: Hadoop, Hive, Mahout, Oozie, Pig, ZooKeeper, Hue, and other open source tools; no proprietary tools. Cloudera Manager Free Edition: All of CDH plus basic Manager supporting up to 50 cluster nodes. Cloudera Enterprise: Combines CDH, a more sophisticated Manager supporting an unlimited number of cluster nodes, proactive monitoring, and additional data analysis tools.|
|Hortonworks Data Platform||Hadoop, Hive, Mahout, Oozie, Pig, ZooKeeper, Hue, and other open source tools; no proprietary tools.|
|IBM InfoSphere BigInsights||Basic Edition: Hadoop, Hive, Mahout, Oozie, Pig, ZooKeeper, Hue, and other open source tools; basic version of the IBM installer and data access tools. Enterprise Edition: Adds sophisticated job management tools, a data access layer that integrates with major data sources, and BigSheets, a spreadsheet-like interface for manipulating data in the cluster.|
|MapR M3 and M5||M3: Hadoop, Hive, Mahout, Oozie, Pig, ZooKeeper, Hue, and other open source tools. M5: Adds direct NFS access, snapshots, and mirroring for "high availability."|
This article, "Enterprise Hadoop: Big data processing made easier," was originally published at InfoWorld.com. Follow the latest developments in business technology news and get a digest of the key stories each day in the InfoWorld Daily newsletter. For the latest business technology news, follow InfoWorld on Twitter.
You may still be better off sticking with Win7 or Win8.1, given the wide range of ongoing Win10...
Now that we're down to the wire, many upgraders report that the installer hangs. If this happens to...
Angular 3 will have better tooling and will generate less code; Google also is promising a new major...
Sensing a possible stall in your coding career? Here’s how to break free and tap your true potential
In this selection you’ll find speakers taking on some of the most pressing, and persistent, security...
Nim compiles and runs fast, delivers tiny executables on several platforms, and borrows great ideas...
A port of the popular Torch library, PyTorch offers a comfortable coding option for Pythonistas