Open Data Platform strives for Hadoop unity

A consortium of Hadoop vendors and users aims to define a core Hadoop and create a standardized base for future distributions

Hadoop elephant code

With any project as influential, widely adopted, and plain big as Hadoop, there inevitably comes a moment when those involved decide to standardize.

The Open Data Platform (ODP), a newly formed consortium of Hadoop vendors and consumers, is planning to do something about what it perceives as the "open but fragmented" nature of Hadoop, caused in part by commercial Hadoop vendors themselves.

The group's plan is to create a "tested reference core of Apache Hadoop, Apache Ambari, and related Apache source artifacts" -- a kind of standardized base for future Hadoop distributions, built directly from the open source core projects.

Sunny Madra, head of data products at Pivotal (one of the Hadoop vendors involved in the initiative), spoke about the parallels between Hadoop's current state and Linux's relationship with Unix. "The Unix ecosystem was quite fragmented; everyone had their own things going on, and you couldn't be sure if something ran in one place or the other. Then Linux comes around and standardizes that. So if you take a look at RHEL or CentOS or Oracle, you know that if you have something that runs on any one of those, it'll run on all of them."

Different Hadoop distributions remain dissimilar by design, though built from common sources. "One of the challenges of the Hadoop ecosystem today is that it's hard to certify," Madra said. "It's hard to say that if I build software and certify it to run on this company's particular distribution, it'll run anywhere else." A similar project exists for Linux; the Linux Standard Base was created to make the job of developing software for Linux less onerous by reducing unneeded differences between distributions.

It likely won't become clear until ODP moves forward the impact the initiative will have on the ways commercial distributions differentiate themselves. ODP could make it possible for distributions like Pivotal and Hortonworks to differentiate more on the basis of the value-adds they provide than how they assemble Hadoop’s underlying pieces. (In 2013, Hortonworks CEO Rob Bearden called out Hadoop fragmentation as a by-product of the distributions' mix of proprietary and open software.)

The Apache project Bigtop has attempted to address the need for a properly vetted, integration-tested Hadoop bundle, but ODP's aim is different. In a conference call, Shaun Connolly, vice president of corporate strategy for Hortonworks, described Bigtop and ODP as complementary, not competitive.

Bigtop "has historically provided a lot of the tooling for bringing together releases," Connolly said, while the ODP is meant to "[complement] that process in a downstream consumable version that can go broad across the [Hadoop] ecosystem."

Raymie Stata, CEO of Altiscale, added that ODP aims for "depth, rather than breadth," and hopes to bring "rigor" to the core used for ODP-derived distributions.

One fairly major name missing from the current list of ODP participants is Cloudera, whose Hadoop distribution is more akin to Pivotal's than Hortonworks' in the way it mixes open source savvy and business sense.

Scott Yara, president and head of products for Hadoop, could not speak about Cloudera's intentions, but did say, "We'd love to have them involved, as this is an advancement for the whole industry."