In a world of real-time data, why are we still so fixated on Hadoop?
Hadoop, architected around batch processing, remains the poster child for big data, though its outsized reputation still outpaces actual adoption, as 451 Research survey data shows.
Companies that have yet to deploy Hadoop in earnest may want to wait. With Apache Spark and a number of other technologies (Storm, Kafka, and so on), we seem to be veering away from batch processing with Hadoop and toward a real-time future.
Batch wasn’t the point
Cloudera’s Doug Cutting is an incredibly intelligent person and a prolific open source developer. Hadoop, Lucene, and other essential tools of the big data trade bear his imprint.
While Cutting acknowledges the importance of real-time streaming technologies, he’s not a bit apologetic about batch-oriented Hadoop, as he tells me over email:
It wasn't as though Hadoop was architected around batch because we felt batch was best. Rather, batch, MapReduce in particular, was a natural first step because it was relatively easy to implement and provided great value. Before Hadoop, there was no way to store and process petabytes on commodity hardware using open source software. Hadoop's MapReduce provided folks with a big step in capability.
It’s hard to overstate exactly how critical this commodification of big data has been for the world. It’s not as if we didn’t store and analyze huge volumes of data before Hadoop. Rather, Hadoop gave us the ability to do so cheaply.
Hadoop, in sum, democratized big data.
A shift toward streaming data?
However, it isn’t the same as making big data easy. As DataStax chief evangelist Patrick McFadin informed me in an interview, getting value from enterprise data isn’t as simple as many like to pretend:
We’ve all heard the questions of ROI on storing and analyzing petabytes of data. Google, Yahoo, and Facebook make it sound amazing, and sadly, enterprises are looking at how to apply that analytics hammer to all the data. First: collect all the data. Second:… Third: Profit!
In between that data collection and the profit are a series of steps that can be cumbersome. As enterprises have sought to speed up their ability to analyze data in real time, new technologies have arisen to make this possible.
McFadin identifies key elements of this new big data stack. First, he says, there’s a queuing system, with Kafka, RabbitMQ, and Kinesis the likely suspects. Then there’s a stream processing layer, which might include Storm, Spark Streaming, or Samza. For high-speed storage, companies often turn to Cassandra, HBase, MongoDB, or possibly a relational database like MySQL.
Most interesting is where batch processing still fits. As McFadin tells me, “batch is now useful for processing after the fact” -- that is, for the likes of rollups and deeper analytics. The idea of merging both batch and real time has come to be known as “Lambda architecture,” which involves making the three of these elements work harmoniously together: batch, speed, and serving.
Batch, in other words, still has life in it.
Sending batch to the dustbin of history
Not everyone agrees. Zoomdata CEO and co-founder Justin Langseth, for example, labels Lambda an “unnecessary trade-off,” telling me, “There is now end-to-end tooling that can handle data from sourcing, to transport, to storage, to analysis and visualization,” without the need for batch.
Batch, in his mind, is an anachronism, a relic of big data days gone by:
Real-time data is obviously best handled as a stream. But it’s possible to stream historical data as well, just as your DVR can stream “Gone with the Wind,” or last week’s “American Idol” to your TV. This distinction is important, as we at Zoomdata believe that analyzing data as a stream adds huge scalability and flexibility benefits, regardless of if the data is real-time or historical.
Even more than scalability and flexibility benefits, however, may be the simplicity that comes from removing batch from one’s big data processes. As Langseth argues, “This massively simplifies big data architectures if you don’t need to worry about batch windows, recovering from batch process failures, and so on.”
Can’t we all just get along?
Not so fast, argues Cutting.
Rather than a wholesale swapping out of technologies like Hadoop, Cutting sees a world in which “streaming is real, but so is [Cloudera’s] Enterprise Data Hub.” In fact, he continues, “I don't think there will be any giant shift toward streaming. Rather, streaming now joins the suite of processing options that folks have at their disposal.”
More interesting, Cutting feels the “big bang” of big data, wherein the pace of innovation has been frenetic and, frankly, unwieldy for slow-moving enterprise IT, will drop as the industry settles around a few good approaches:
I suspect that major additions to the stack like Spark will become less frequent, so that over time, we'll standardize on a set of tools that provide the range of capabilities that most folks demand from their big data applications. Hadoop ignited a Cambrian explosion of related projects, but we'll likely enter a time of more normal evolution, as use of these technologies spreads through industries.
DataStax community manager Scott Hirleman agrees: “Batch isn’t going anywhere as there will always be a place for large-scale analytics with gobs of data.” He acknowledges “a ton of interest in streaming analytics,” but insists it’s “way too early to say” how this trend will impact big data plans.
In short, streaming analytics is all about “and,” not “or.” It’s a great complement to batch-oriented systems like Hadoop, but almost certainly won’t kill them off.