Enterprise Hadoop: Big data processing made easier
Amazon, Cloudera, Hortonworks, IBM, and MapR mix simpler setup of Hadoop clusters with proprietary twists and trade-offsFollow @peterwayner
The value of this approach will depend on the seriousness of your job. If your Hadoop data is mission critical or simply has to be ready most of the time, you'll definitely be interested in the extra features for preserving the data and keeping up the cluster. But if you're processing log files and generating reports that can wait a few hours or even days, there's not much need for it. Restarting your cluster when the NameNode fails is kind of a pain, but not if you have the slack in your system to begin again.
The value will also depend on the nature of your calculations. If you do many short calculations, then restarting isn't a big problem. But if your jobs last hours, days, or especially weeks, the ability to store snapshots becomes more and more valuable.
There is a cost for some of these features. I couldn't install the M3 distribution on my cluster of machines in the Rackspace cloud because it requires access to "physical hard drives." In other words, the NFS code from MapR burrows fairly deeply into the file system to generate the performance gains. It can't work its magic with all of the layers of virtualization in some environments. This won't be an issue if you're using real machines with real disks, but it can be a roadblock in some of the new virtual worlds.
I ended up doing my testing with a VMware version that worked with MapR's version of Ubuntu.
I have to say that the direct access to NFS is a nice feature. Although it's always possible to move the data in and out of HDFS with the regular tools, it's much easier to integrate the system with a direct NFS link. I wouldn't be surprised if some errant tool occasionally introduces a bug because the data is not run through HDFS, but I'm guessing the occasional problems will be worth the trade-off.
It's clear that MapR is putting most of its effort into the code under the hood. The Web console for monitoring jobs is perfectly nice, but it lacks some of the gaudier features you'll find in other distributions. I even found myself kicking off jobs by typing "hadoop" into a command line. There's nothing missing that will get in the way of serious work, but the interface isn't as accessible for new users as some of the others.
MapR also has some interesting partnerships. Many have noted that EMC is almost certainly repackaging MapR and selling it as part of the Greenplum collection of big data analytics tools. This suggests that we're already starting to see these stacks disappearing inside of other packages.
Hortonworks Data Platform
I wanted to test the Hortonworks distribution, but it wasn't ready when I was writing. The company will be concentrating on selling training and support while avoiding creating proprietary extensions.
"We are an open source company," Eric Baldeschwieler, the CEO, told me. "The only product we have is open source. We won't commit to never selling anything, but you won't see anything in the next year. We're committed to a complete open source, horizontal platform. We want people to be able to download everything they want for free. That differentiates us from everyone else in the market."
Indeed, the company employs a number of people with a deep knowledge of Hadoop gained from years at Yahoo. The company formally separated from Yahoo last year, and now it's looking for partnerships to support their work.
Hortonworks is currently running a private beta. I couldn't join it, but perhaps your company will be able to participate. In the meantime, you can grab Hadoop directly from Apache. It's guaranteed to be pretty close to what Hortonworks will be shipping, at least for the next year.