Hadoop, we hardly knew ye

Once, Hadoop was synonymous with big data. Now, Spark and cloud services are riding high as Hadoop heads off into the sunset

Hadoop, we hardly knew ye
Credit: Thinkstock

It wasn’t long ago that Hadoop was destined to be the Next Big Thing, driving the big data movement into every enterprise. Now there are clear signs that we’ve reached “peak Hadoop,” as Ovum analyst Tony Baer styles it. But the clearest indicator of all may simply be that “Hadoop” doesn’t actually have any Hadoop left in it.

Or, as InfoWorld’s Andrew Oliver says it, “The biggest thing you need to know about Hadoop is that it isn’t Hadoop anymore.”

Nowhere is this more true than in newfangled cloud workloads, which eschew Hadoop for fancier options like Spark. Indeed, as with so much else in enterprise IT, the cloud killed Hadoop. Or perhaps Hadoop, by moving too fast, killed Hadoop. Let me explain.

Is Hadoop and the cloud a thing of the past?

The fall of Hadoop has not been total, to be sure. As Baer notes, Hadoop’s “data management capabilities are not yet being matched by Spark or other fit-for-purpose big data cloud services.” Furthermore, as Oliver describes, “Even when you’re not using Hadoop because you’re focused on in-memory, real-time analytics with Spark, you still may end up using pieces of Hadoop here and there.”

By and large, however, Hadoop is looking decidedly retro in these cloudy days. Even the Hadoop vendors seem to have moved on. Sure, Cloudera still tells the world that Cloudera Enterprise is “powered by Apache Hadoop.” But if you look at the components of its cloud architecture, it’s not Hadoop all the way down. IBM, for its part, still runs Hadoop under the hood of its BigInsights product line, but if you use its sexier new Watson Data Platform, Hadoop is missing in action.

The reason? Cloud, of course.

As such, Baer is spot on to argue, “The fact that IBM is creating a cloud-based big data collaboration hub is not necessarily a question of Spark vs. Hadoop, but cloud vs. Hadoop.” Hadoop still has brand relevance as a marketing buzzword that signifies “big data,” but its component parts (HDFS, MapReduce, and YARN) are largely cast aside for newer and speedier cloud-friendly alternatives as applications increasingly inhabit the cloud.

Change is constant, but should it be?

Which is exactly as it should be, argues Hadoop creator Doug Cutting. Though Cutting has pooh-poohed the notion that Hadoop has been replaced by Spark or has lost its relevance, he also recognizes the strength that comes from software evolution. Commenting on someone’s observation that Cloudera’s cloud stack no longer has any Hadoop components in it, Cutting tweeted: “Proof that an open source platform evolves and improves more rapidly. Entire stack replacement in a decade! Wonderful to see.”

It’s easy to overlook what a powerful statement this is. If Cutting were a typical enterprise software vendor, not only would he not embrace the implicit accusation that his Hadoop baby is ugly (requiring replacement), but also he’d do everything possible to lock customers into his product. Software vendors get away with selling Soviet-era technology all the time, even as the market sweeps past them. Customers locked into long-term contracts simply can’t or don’t want to move as quickly as the market does.

For an open source project like Hadoop, however, there is no inhibition to evolution. In fact, the opposite is true: Sometimes the biggest problem with open source is that it moves far too quickly for the market to digest.

We’ve seen this to some extent with Hadoop, ironically. A year and a half ago, Gartner called out Hadoop adoption as “fairly anemic,” despite its outsized media attention. Other big data infrastructure quickly marched past it, including Spark, MongoDB, Cassandra, Kafka, and more.

Yet there’s a concern buried in this technological progress. One of the causes of Hadoop’s market adoption anemia has been its complexity. Hadoop skills have always fallen well short of Hadoop demand. Such complexity is arguably exacerbated by the fast-paced evolution of the big data stack. Yes, some of the component parts (like Spark) are easier to use, but not if they must be combined with an ever-changing assortment of other component parts.

In this way, we might have been better off with a longer shelf life for Hadoop, as we’ve had with Linux. Yes, in Linux the modules are constantly changing. But there’s a system-level fidelity that has enabled a “Linux admin” to actually mean something over decades, whereas keeping up with the various big data projects is much more difficult. In short, rapid Hadoop evolution is both testament to its flexibility and cause for concern.

From CIO: 8 Free Online Courses to Grow Your Tech Skills