Supersize me: Hadoop upgrade will handle even bigger data

Apache aims to have the upcoming 0.23 release this year be able to run on 6,000-node clusters

With a planned upgrade to its Hadoop distributed data processing technology, the Apache Software Foundation intends for the platform to run across much larger clusters and take on larger workloads, an Apache official said Thursday.

A key goal for the upcoming 0.23 release of Hadoop, which could eventually be called version 2 or 3, is to have it run across 6,000-node clusters; it currently has run on 4,000-node clusters, said Arun Murthy, vice president of Apache Hadoop at Apache and a founder of Hortonworks, which offers Hadoop technologies and services. Release 0.23 is currently alpha quality; it is due for more formal release later this year.

[ VMware this unveiled technology to help Java developers use Hadoop. | Sign up for InfoWorld's Data Explosion newsletter to help deal with growing volumes of data. ]

Hadoop has become popular for mining large data sets. Plans call for Hadoop 0.23 to run across 6,000-machine clusters, each with 16 or more cores, and process 10,000 concurrent jobs. Users will get more work done, Murthy said in a presentation at the O'Reilly Strata conference in Santa Clara, Calif. Performance, he stressed, is something users "can never have enough of."

Other capabilities eyed for the upgrade include HDFS (Hadoop Distributed File System) federation as well as high availability for HDFS. MapReduce, which is the programming model and software framework in Hadoop, will be improved as well. Called "Yarn," the MapReduce upgrade "is the first to take Hadoop and make it a much more general data processing system," Murthy said. Yarn is "a high-performance rewrite of MapReduce," with twice the throughput on large clusters, said Eric Baldeschwieler, Hortonworks CTO. Also, wire protocol compatibility planned for the 0.23 release will enable server and client upgrades to be done independently.

Also at Strata on Thursday, MarkLogic and Hortonworks announced integration between Hortonworks Data Platform and MarkLogic's operational database platform. The integration will allow users to combine MapReduce with MarkLogic's real-time interactive analysis and indexing on a single, unified platform, MarkLogic said. The arrangement is intended to help users better accommodate big data workloads. MarkLogic will certify its Connector for Hadoop against Hortonworks Data Platform.

This article, "Supersize me: Hadoop upgrade will handle even bigger data," was originally published at Follow the latest developments in business technology news and get a digest of the key stories each day in the InfoWorld Daily newsletter. For the latest developments in business technology news, follow on Twitter.


Copyright © 2012 IDG Communications, Inc.