Spark 1.5 shows MapReduce the exit

The in-memory batch-processing framework sheds more JVM performance bottlenecks as a major Hadoop vendor eyes Spark as a full-blown replacement for the aging MapReduce

 Spark 1.5 shows MapReduce the exit
Ann Hung via Flickr (Creative Commons BY or BY-SA)

Apache Spark, the in-memory data processing framework nominally associated with Hadoop, has hit version 1.5.

This time around, improvements include the addition of data-processing features and a speed boost, as well as changes designed to remove bottlenecks to Spark's performance stemming from its dependencies on the JVM.

With one major Hadoop vendor preparing to ditch MapReduce for Spark, the pressure's on to speed up both Spark's native performance and its development.

Making sparks fly faster

A key component of the Spark 1.5 feature set is Project Tungsten, an initiative to improve Spark's performance by circumventing the limits of the JVM.

Many of Spark's speed limits are by-products of the JVM's garbage collection and memory management systems. Project Tungsten, pieces of which landed in Spark 1.4, rewrites key parts of Spark to avoid the bottlenecks entirely and enable features like allowing direct use of CPU cache memory to further speed up processing. Databricks, Spark's main commercial sponsor, has plans to eventually leverage GPU parallelism to further pick up the pace, but they remain theoretical for now.

Other performance improvements in Spark are a little closer to the user -- such as new machine learning functions or better performance for SQL operations on DataFrames, a database-table-like abstraction for data in Spark.

Also included is wider support for cluster-management technologies like Mesos and YARN; the former allows Spark to be used in a wider variety of contexts than within Hadoop alone. (Spark originated outside of Hadoop, after all.)

With or without Hadoop

Still, Hadoop remains a major arena for Spark, and signs point to Spark becoming a significant component within Hadoop. Case in point: Cloudera, vendor of a key Hadoop distribution, is pushing hard to make Spark -- not the venerable MapReduce -- the default engine for Hadoop workloads by way of its One Platform Initiative.

The project is ambitious and paved with good intentions, but it'll take more than improvements to Spark alone. Spark can become a seamless player in the Hadoop world, but the real blocker is rewriting existing MapReduce applications to use Spark instead.

Improvements to Spark are already under way, thanks to support from Spark's backers (including, as of late, many major tech companies). But swapping MapReduce for Spark will require Cloudera and other Hadoop vendors to provide incentives that make the switch as painless as possible.

Related:

Copyright © 2015 IDG Communications, Inc.