Hadoop keeps marching on, somehow

Deployments of Hadoop in production have been slower to arrive than many thought, but Hadoop job growth data shows that enterprises are keeping the faith

Hadoop keeps marching on, somehow
Michael Coghlan (Creative Commons BY or BY-SA)

Nothing seems to stop the Hadoop train. Despite seemingly "anemic interest," complex setup, and a crazy quilt of different projects that ostensibly comprise the unified "thing" that is Hadoop, demand for Hadoop talent marches stolidly on.


Why hasn't the market moved on to something better? Perhaps the answer, as Google Cloud solutions consultant Sandeep Parikh puts it, is that Hadoop offers a "broad framework for enabling distributed compute," one whose breadth ensures its relevance for an equally broad class of big data needs.

A mishmash of "icing"

Among Hadoop's manifold problems, perhaps the most foundational is its very definition. Wikipedia, for example, says this: "Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware."

The problem, however, is that "Hadoop" covers an ever-increasing array of subprojects. Back in the day, Ian Murdock (then CEO of Progeny Linux) told me that "Linux is Linux is Linux." While Red Hat and (Novell's) Suse might talk up how different they were, the reality is that they were very similar -- and still are.

Not so Hadoop. An enterprise buyer in the market for Hadoop will discover it's pretty different depending on whether they're looking at Cloudera, Hortonworks, or MapR, the three dominant distributions.

Plus, it's only getting worse, as Gartner analyst Merv Adrian points out, with significant consequences:

This year, the expansion process has continued -- and it does matter. Why? Because it shows how differentiation and positioning are shaping the evolution of the commercially supported stack -- which will help mainstream buyers decide what direction to go in making choices. Commercially supported open source software is now chosen for production applications. And integration, cross-porting, backporting, and supporting an ever-increasing stack of projects that they do not "own" or exclusively develop is a cost for distributors -- who are not charging more as they add more projects to the stack.

How this hodgepodge may affect customers depends on your view of it. Parikh, for example, highlights that everything in Hadoop beyond distributed compute is essentially "icing," which "is kind of a mess with diff[erent] projects supporting diff[erent] versions, but they're all basically storing and computing."

Some of the complexity derives from a community that "keeps experimenting," as Cloudera co-founder Mike Olson describes. Things may become more complex, but they also potentially become much more powerful. Using Spark as an example, Olson continues: "That's why the meme of 'Spark kills Hadoop' is so wrong: Spark has added new capabilities to Hadoop, making it stronger."

As such, the nuances of the icing around Hadoop may differ from distribution to distribution, but the core meaning of Hadoop remains the same: "a framework for using massive amounts of data across a distributed network," as Gartner styles it.

Even so, for newbies, Hadoop can be a maze. Yet newbies keep coming.

"Anemic interest"? Maybe not

Gartner cited lukewarm interest in Hadoop in a recent report, but its survey data may not tell the complete story. That survey data is somewhat damning, with 54 percent of those surveyed having no plans to touch Hadoop, and 26 percent currently deploying it (in production or pilot).

Part of the problem, insists Gartner analyst Nick Huedecker, comes down to overkill: "Hadoop [is] overkill for the problems the business[es surveyed] face, implying the opportunity costs of implementing Hadoop [are] too high relative to the expected benefit."

Yet jobs data for Hadoop paints a different picture.

Even compared to other hot big data technologies, Hadoop continues to see blazingly hot demand within the enterprise:

hadoop jobs Indeed.com

That's absolute growth in Hadoop jobs (compared to MongoDB and Apache Cassandra). But when we look at relative job demand, Hadoop really stands out:

hadoop jobs 2 Indeed.com

I'm not as familiar with Cassandra, but I know MongoDB well, having worked there for a few years. Tens of thousands of companies run MongoDB in production. It enjoys millions of downloads.

Yet Hadoop job growth significantly outpaces it. Clearly, "anemic interest" in Hadoop is not very anemic, because interest in MongoDB and Cassandra is sky high.

Because it works

For those starting out, Hadoop looks like the promised land. As Ewan Leith, a data architect at RealityMine, notes, "YARN + HDFS are amazing architectural building blocks [that] let you do almost anything with distributed data." That's the hook.

But that's only the beginning, as Val Bercovici, member of the office of the CTO at NetApp, posits: We're "scratching [the] surface of [Hadoop's] potential."

That potential is buried in that messiness of an ever changing lineup of Hadoop's cast of projects. As Hadoop founder Doug Cutting told me, "It's an evolving ecosystem, with a fuzzy, changing, ill-defined edge. Get used to it. It will keep mutating."

While that complexity may bedevil some, it points to a very bright, expansive future for Hadoop. In the meantime, vendors are scrambling to smooth out the path to adoption. They happen to be smoothing in diverse directions at times, but that's a plus, as Twitter's open source guru Chris Aniszczyk declares, as "competing vendors improv[e] the technology so it doesn't [get] stale."

In short, Hadoop is complex. But that's part of its charm. Enterprises understand this and aren't dissuaded from trying it out, as the jobs data suggests.