Enterprise Hadoop: Big data processing made easier
Amazon, Cloudera, Hortonworks, IBM, and MapR mix simpler setup of Hadoop clusters with proprietary twists and trade-offsFollow @peterwayner
Amazon Elastic MapReduce
It should be no surprise that Amazon, one of the pioneers of cloud computing, offers a mechanism for spinning up Hadoop clusters on its EC2 cloud. Elastic MapReduce is tightly integrated with all of Amazon's other elastic offerings, and it sits as another tab on the Amazon Web Services main page. You store your data in S3, then fire up a job to churn through it.
The integration is nicely done. Amazon provides a Java-based Web interface that does a great job of hand-holding, taking care of many of the glitches that often occur when you're first trying software. As with the other tools here, you upload your processing logic, in the form of a JAR file, directly through the Web interface. When Elastic MapReduce wanted to store data in an S3 bucket, it flipped me over to a page for creating the bucket.
If the Web GUI is a bit too babyish, there's also a classic Web service API that's been wrapped up in software by a number of other programmers. I played a bit with a Ruby-based collection of tools that submits the jobs and starts them running. The standard start and end is the S3 cloud.
With Elastic MapReduce, Amazon is essentially offering nicer packaging on top of EC2 for those who are willing to plunge deeper into Amazon Web Services. I could have built my own cluster of machines on EC2 and used any of the Hadoop distros to spin them up, but Elastic MapReduce offers a nice set of shortcuts. Amazon has already built and integrated the infrastructure, and you just push the buttons to choose which version of Hadoop (0.18 or 0.2) you want to use. There's no need to worry about which version of Linux is running underneath.
The infrastructure is quite nice. You can choose to pay a stock price for your machines or just bid for empty machines on the spot market. This is the kind of extra feature that thrills the free-market fans, but I found it confusing. You choose your bid and take your chances. If you bid too little, you could end up waiting a long time, perhaps even forever.
It should be noted that the cloud doesn't respond instantaneously. It took from 5 to 18 minutes to execute tiny jobs that would take microseconds to execute on a fully configured cluster in your own server closet. The overhead wouldn't make a difference for a big job, but it's not the same as having your own cluster waiting patiently for you to push the Start button.
Taking advantage of all of these features means buying into Amazon's storage system. If you're already using S3 for your data, you'll be ready to go. If you're not, you'll have to make some decisions. Some people find that S3 is too expensive for bulk data that's rarely accessed. You're paying for all of the engineering that's been built for people who need a fairly good response time, and that price is built into the retrieval costs.
I think all of Amazon's extra features are good options for two classes of users. If you already have most of the relevant data in Amazon's cloud, Elastic MapReduce makes it easy to spin up jobs to analyze it. The piping is already well in place.
The other group would be those who don't need a cluster most of the time but want to do short, intensive calculations once a week, once a month, or once a quarter. It's not much work to create a full Hadoop cluster using the other tools in this review, but it's kind of silly to request new machines from scratch every now and again. Amazon offers a nice shortcut to uploading a Python script or a JAR file and going straight to computation.
Cloudera CDH, Manager, and Enterprise
Cloudera is a startup that has collected Hadoop experts from all of the major companies using Hadoop. The CTO came from Yahoo, the chief scientist from Facebook, and the CEO from Oracle. The staff is filled with the names of people who learned Hadoop by building it.