Enterprise Hadoop: Big data processing made easier

Amazon, Cloudera, Hortonworks, IBM, and MapR mix simpler setup of Hadoop clusters with proprietary twists and trade-offs

Page 4 of 6

There are several nice touches in this installation script, by the way. As I plowed along looking for a good distribution, the software was careful to remember all of my inputs, so it wouldn't need to be reconfigured each time. This should be useful in a cloud where people may try to spin up a cluster, then tear it down. The software also includes a number of little features, like the ability to remember a different root password for each node; these can be quite helpful.

The center of the IBM tool is a console that helps you set up some jobs and kick them off. It's completely browser-based -- like the install script -- and you can simply upload your JAR files directly through the Web browser. You can even drill down into the HDFS file system layer and read the results without leaving the browser.

The Web GUI is a big advance over using the command line, but I easily found a number of ways that the console in the basic edition could be improved. As far as I can tell, there's no way to delete the old jobs. The information for each job includes basic details about the start and stop time, but almost everything else is just dumped as raw text. It wouldn't be too hard to parse some of this and do a nicer job displaying the log information.

The monitoring is also rudimentary. You can see that the nodes in your cluster are running and the components have started, but you don't get any cool dials or widgets that show the load or the progress. If you ask for the "details" about a component, you get a popup with some Log4J lines related to that component. A Java programmer won't blink an eye, but others might find it spare and uninviting.

There are a number of better tools in the enterprise edition. The aforementioned BigSheets, a so-called spreadsheet running on Hadoop, will let you play around with the data in the Hadoop cluster just as you would experiment with the data in Excel. There's also a number of tools for connecting your cluster with other databases and data sources throughout the enterprise. The basic edition is good for trying out a pretty standard version of Hadoop, while the enterprise edition adds a slew of features that go far beyond the open source core.

MapR M3 and M5
Whereas Cloudera is run by folks who come from Hadoop strongholds such as Yahoo, MapR's corporate team is filled with people who hail from Google, EMC, Microsoft, and Cisco, companies with plenty of experience with big data sets, even if they're not steeped in Hadoop's traditional way of working with them.

The new talent is also bringing more sophistication to the stack. The MapR distribution of Hadoop includes a better version of the file system with snapshots, mirroring, and direct NFS access if you need it. MapR also offers a more resilient architecture that won't go down if the central controller locks up. MapR calls all of this "high availability" and charges for it.

MapR comes in two flavors: M3 and M5. Is there an M4? Apparently not, but that's marketing for you. The real distinction is between the free community edition (M3) and the proprietary version with all the extra, high-availability features (M5). While some of the other companies are effectively selling tools for monitoring and reporting, MapR is selling a more sophisticated layer under the hood. In other words, whereas the others are wrapping more features around the open source Hadoop, MapR is rebuilding it.

| 1 2 3 4 5 6 Page 4