Last week's Strata+Hadoop World conference was subtitled "Make Data Work," and the sheer size of Hadoop's ecosystem ensures there are a number of ways to make that happen. Sometimes it's via a project, such as Spark, that transforms the way Hadoop is used, or via third-party integration such as Couchbase. But sometimes it's through initiatives like the Open Data Platform, which seems to divide the Hadoop community while professing to unite it.
Here are three of the more significant events from the show that illustrate the ways Hadoop is changing as an ecosystem, as a product, and as a project.
Spark marches on
Spark, the in-memory data processing system widely used with Hadoop, is fast becoming an ecosystem itself. Evidence of that surfaced during Strata when one of Spark's creators -- Matei Zaharia, now CTO for key Spark contributors Databricks -- discussed the inclusion of expanded support for the R language in Spark 1.4, due in June . R, which is used for statistical analysis, is soaring in popularity, with everyone from HP to Microsoft taking an interest.
The constantly expanding Spark toolset is also attracting growing third-party interest -- from the likes of Hadoop-on-demand provider Qubole to titans such as Intel. The latter is planning to invest in speeding up Spark on Intel hardware, mainly by leveraging Intel processor features like hardware-accelerated encryption.
Perhaps the biggest reason for Spark's growing draw is the escape hatch it, in conjunction with the YARN framework, provides from the speed and processing limits imposed by Hadoop's old-school MapReduce algorithms. Much of the criticism of Hadoop has revolved around its ties to a legacy, one-dimensional batch-processing model -- which Spark helps Hadoop users overcome.
The move away from MapReduce and toward YARN continues to yield new fruit. Hadoop vendors MapR and Mesosphere -- developers of the Apache Mesos cluster-management system -- teamed up last week to unveil Myriad, which runs YARN jobs on clusters managed by Mesos, thus providing CPU and memory management capabilities.
Skepticism rises over the ODP
Among all the announcements of new products and technology refinements, however, one bit of Hadoop horn-blowing inspired skepticism: The creation of the Open Data Platform (ODP) initiative, a vetted common-core Hadoop distribution on which vendors can base their products.
On the face of it, the initiative seems to have merit. Pivotal, one of ODP's founders, spoke about making sure ahead of time that all the pieces work well together, so those who create products for the Hadoop ecosystem don't have to go through an arduous certification process.
But Gartner analysts Nick Heudecker and Merv Adrian are not buying the idea that Hadoop distributions are fragmented enough to need that kind of horizontal unification. "This simply institutionalizes a dichotomy in favor of a few favored players," they wrote. "Who wants it? As Cloudera [a Hadoop vendor that is not part of the initiative] suggests, the paying members, and it's not clear who else."
Cloudera voiced its skepticism about the need for a unified base Hadoop distribution in explaining why it was staying out of the fray. "[If the longing for standardization in the Apache Hadoop ecosystem] were real, then you'd see a large collection of ISVs and customers leading the charge, not merely signing on. ... Cloudera's partner ecosystem includes 1,447 companies at the time of this writing. We're simply not hearing from them that they're confused about building applications on core Hadoop."
Cloudera added, "High bidders who don't understand how open source works" are more interested in solving their own strategic problems than those of Hadoop.
The Hadoop hits keep on comin'
There's been no slowing in Hadoop's uptake by third-party vendors. Hortonworks (another ODP participant) and Couchbase teamed up to offer two-way connectors between HDP and Couchbase Server, using Storm and Kafka to tie them together. Joint projects like this underscore each product's strength: Hadoop for batch processing a giant reservoir of data; Couchbase for high-speed processing of unstructured data.
Some of the additions makes sense as an expansion of an existing Hadoop portfolio. (Never let it be said that Microsoft leaves an opportunity on the table, for instance; that company has added the real-time framework Storm to technologies available through its Azure HDInsight offering.)
But others seem more inspired by a desire to be associated with a technology trend. To wit: Oracle's Big Data Platform update intends to allow Oracle customers to favor the company's visual analytics suite for Hadoop via its Big Data Discovery product; to stream data into Hadoop with GoldenGate for Big Data; and to connect its own SQL products to Hadoop. It's less about Hadoop than about Oracle trying to keep its product line relevant in the face of slumping revenues and mounting pressure from open source.
[Edited to include specific mention of the creators of Spark.]