Hadoop needs Spark to light the way

Once, Hadoop and MapReduce were nearly synonymous, but today, Spark is the framework of choice for a new wave of big data practitioners

Hadoop needs Spark to light the way

Hadoop has never been in more desperate need of a lift than now. Though Hadoop has been synonymous with MapReduce for years, even ardent backers like Cloudera are abandoning MapReduce for its sexier, cooler cousin, Apache Spark.  

With Spark 1.5 out now, the speedier processing engine keeps getting better, making stodgy old MapReduce look even worse.

Several years into the evolution of Hadoop, it's hard to overstate how much its future depends upon Spark's present. As Patrick McFadin, chief evangelist for Apache Cassandra at DataStax, insisted to me, "Spark doesn't need Hadoop to be successful, but the future of Hadoop depends on Spark." This is both good and bad news.

Making Hadoop work

Hadoop being synonymous with MapReduce used to be a good thing. Cloudera's Justin Kestelyn reminded me in an interview, "The terms 'MapReduce' and 'Hadoop' were interchangeable because Hadoop was just a kernel [composed of HDFS and MapReduce]." In the early days of modern big data, lumbering, batch-oriented MapReduce was good enough.

That was then. As MongoDB vice president Kelly Stirman told me recently:

The very definition of Hadoop is a work in progress. Each year, the list of technologies that comprise Hadoop expands. What started with MapReduce and HDFS is today more than 20 different projects, each with their own dependencies, release schedules, project teams, road maps, and interfaces.  

As Stirman implies, Hadoop's boiling mass of big data innovation brings both promise and peril -- promise because the constant tinkering serves to renew Hadoop's relevance, but peril because, as Stirman goes on, there doesn't seem to be a stable foundation upon which enterprises can build:

This number [of Hadoop sub-projects] will continue to increase, partially due to an expansion of the scope for Hadoop, and thus far there is a pattern of rewriting significant components rather than continuing to improve what is already there.
This complexity makes using Hadoop challenging. Developers have a lot of heavy lifting to do for building an application. Operations teams are throwing away playbooks to integrate rapidly changing components in order build stable, performant, and secure environments.

This ecosystem complexity is one reason Hadoop adoption remains relatively light. But the other is a matter of MapReduce's complexity. McFadin explains:

I've run large Hadoop deployments and have written enough MapReduce to say I didn't want to do it again. Writing useful analytics with only a map and reduce command is a challenge and time consuming. Not only is the job writing slow, the framework requires a lot of servers to be performant.

Small wonder, then, that industry folks have been clamoring for something more. In Spark, they seem to have found it.

A Spark of life

With the benefit of MapReduce hindsight, Spark brings a breath of fresh air to Hadoop. DataStax co-founder Jonathan Ellis informed me, "Spark is faster, easier to use, and more flexible than MapReduce."

No wonder even Cloudera is moving on from MapReduce -- the project its co-founder, Doug Cutting, built.

If big data is all about volume, velocity, and variety, Spark is much better equipped to handle it than (creaky, slow) batch-oriented MapReduce. This has led to huge community interest in Spark, the most actively developed Apache Software Foundation project ever.

Digging into Spark's superior utility over MapReduce, Ellis notes:

From a technical standpoint, the main win is that Spark is an optimistic framework instead of pessimistic. With MapReduce, every result in your pipeline is written to distributed storage, then read back off disk by the next stage. This means that if you have a failure part-way through, you don't need to recompute those intermediate steps and you can resume the calculation where it left off.
Spark instead records just the instructions needed to rebuild a pipeline from its inputs. If a failure does happen, it needs to start over from the beginning, but since failure mid-pipeline is relatively rare, it comes out way ahead on average.

The promise of Spark is that it can do all the things that Hadoop has been promising for years. As Stirman puts it, "For many people, Hadoop never lived up to all the hype, and the anticipation is that Spark brings people closer to what they hoped for."

Remembering Spark's place

Not that Spark will completely dominate big data going forward. Big data has always been bigger than Hadoop (MapReduce), as a Gartner survey of big data adoption indicates:

Hadoop vs. big data Gartner, Inc.

To secure its place, Spark's promoters need to foster its potential to play well with others. Shaun Connolly, Hortonworks vice president of Corporate Strategy, told me over email, "Spark is on the rise because it's useful and embeddable with a range of technologies." Stirman urges:

Let's not overlook that Hadoop and Spark remain focused on analytical use cases. There are other technologies such as MongoDB, Postgres, Cassandra, and Redis that are focused on running the operational applications that power the business, where data is born. These databases are absolutely complementary to Hadoop and Spark, much in the same way relational databases have been to data warehouses in previous generations of data architectures.

In its quest to live up to the Hadoop hype, "Spark is trying to do a lot," as Ellis relates. But it's not clear, he goes on, that it "can be best of breed" in processing and graph databases and streaming and machine learning and the other things it's currently being developed to handle.

Hadoop's future may depend on Spark, in short, but Spark's future may depend on limiting mission creep.


Copyright © 2015 IDG Communications, Inc.

How to choose a low-code development platform