Hadoop remains the poster child for big data, yet other big data technology tech keeps stealing its thunder (and revenue). While it's true, as Gartner analyst Merv Adrian highlights, that the Hadoop market is "healthy, and growing, and has a enormous amount of upside adoption potential," it's equally true that "Hadoop is a very small market today."
The reason? In part, because enterprises are still trying to figure out what Hadoop is.
Nonetheless, as Hadoop pioneer and Cloudera vice president Charles Zedlewski told me in an interview, "Hadoop's core identity is that of a data management platform," and that identity has remained consistent, even if the underlying projects and subprojects that feed it change constantly.
In fact, in a surprising declaration, Zedlewski insists that rival Hadoop distributions are far more alike than competing Linux distributions. This should give enterprise IT hope.
I love you, you're perfect -- now change
One of Hadoop's biggest challenges -- and one of its biggest selling points -- is its versatility. I've sometimes compared it to Dr. Suess' thneed from "The Lorax," "a Fine-Something-That-All-People-Need!" with vendors marketing it broadly: "You can use it for carpets. For pillows! For sheets! Or curtains! Or covers for bicycle seats!"
In the case of Hadoop, this might not be too far off the mark.
For example, strip Hadoop to its essentials, and it has always been about distributed storage and compute, with HDFS and MapReduce filling those needs. Both are absolutely essential to Hadoop's very definition.
Except when they're not -- when batch-oriented MapReduce proved insufficiently fast for many use cases, enterprises began swapping it out for Apache Spark. Was it still Hadoop? Of course, as Zedlewski told me:
Hadoop will always be a thing that acquires, stores, processes, analyzes and serves data. That's been true throughout it's existence and hasn't changed much at all. Essential and non-essential components get improved, upgraded and swapped out over time but that doesn't change Hadoop's identity. Implying otherwise confuses technical design choices with users and market. The former changes all the time, the latter doesn't.
Indeed, he continues, "it would be strange to forswear the possibility that any component of the platform can be improved over time."
There are, in short, no "sacred cows" when it comes to the Hadoop technology stack. There is only the driving need to be the best data management platform for enterprises suffering from Gartner's three V's of big data: volume, velocity, and variety.
Like Linux, but different
As Zedlewski pointed out to me, this isn't much different than how Linux works: "Linux didn't stop being an operating system with the introduction of the XFS filesystem or with the migration from init to upstart." Components change all the time. Linux remains.
Today, he posits, variation between Hadoop distributions is actually less than we see in Linux land. ("There's more variation among the Red Hat, Ubuntu, and CoreOS kernels than there is among the core components of the various Hadoop distributions.") I found this a bit surprising given Hortonworks' noise earlier this year that Hadoop standardization was imperative, as it launched the Open Data Platform initiative.
While there was considerable blowback on ODP, with Gartner deriding it as "clearly for vendors, by vendors," surely there must be a germ of truth in the need for standardization?
Nope -- according to Zedlewski, not only do Hadoop distributions cohere more than their Linux peers, but Hadoop also sticks together more than relational database vendors, all singing the SQL tune. In the RDBMS world, "different vendors handle some datatypes differently. That's a serious pain for customers with no offsetting benefit. By contrast all the major Hadoop distributions work off a common catalog."
This despite there being no Linus Torvalds to shepherd the project (because, in part, there is no "Hadoop project"). This despite no copyleft GPL license to force everything together. Somehow it works.
Who will be the Ubuntu of Hadoop?
In the midst of our conversation on standardization, Zedlewski highlighted how the Linux distributions differ, and what this might portend for Hadoop:
I think it's really hard to diverge too substantially because distributions exist to serve users who have data management needs. So the market guides distributions to keep improving in a common direction. The divergence happens when the core platform gets asked to serve a number of competing missions. Red Hat, Ubuntu and CoreOS are interesting examples of what happens when a common platform gets dragged into three relatively different missions (enterprise server, desktop and scale-out respectively).
Today enterprises turn to Cloudera, Hortonworks, and MapR, the three dominant distributions, for similar requirements. But will that always be the case?
For example, MapR diverges from Cloudera and Hortonworks by introducing a proprietary file system (to replace HDFS), promising significant performance improvements. Cloudera offers expanded management, making it easier to manage Hadoop clusters. And Hortonworks positions itself as 100 percent open source, all the time.
Each of these is a slight variation on the Hadoop theme and may lead the distributions down different customer paths over time, though today they largely sell to the same crowd.
But it will be fascinating to watch Hadoop change in the hands of each of these vendors. The core Hadoop identity will remain fixed, but the individual components will not, across the Hadoop ecosystem and within particular distributions.
This is what makes Hadoop so exciting: Its willingness to change to meet evolving market needs. But what will make the Hadoop market equally exciting is watching the vendors differentiate themselves at the marketing level, which will, in turn, shape the technology they ship.
One of them could be Red Hat, another one Ubuntu, and another one CoreOS. There are very different financial fortunes tied to each of those identities. Watch this space.