In 2015, Hadoop no longer means MapReduce on HDFS. Instead, it refers to a whole ecosystem of technologies for working with “unstructured,” semi-structured, and structured data for complex processing at scale.
This also now includes streaming use cases, which can be massively parallelized or happen in “real time” (which today means many different things ... other than traditional RTOS-style “real time”). The streaming Spark crowd now likes to contrast itself from the Hadoop -- or more specifically, the YARN -- crowd.
My guest authors and I have written a lot about Spark, Storm, Tez, and now Flink. Those articles have included everything from descriptions and comparisons to news and notes. But we haven't yet answered the big question: Which Hadoop engine should you use? Now the truth can be revealed -- in four parts.
Truth No. 1: You should probably use Spark unless you're talking about streaming data (even then, Spark may work)
If you’re starting a new project, you probably want to begin with HDFS, YARN, and Spark. Mesos and Kubernetes are cool, but not only do you want maturity, you will probably need a supported distribution like Hortonworks or Cloudera.
While you don’t need HDFS or YARN to run Spark, this is a well-trodden path. Moreover, you need a resource negotiator -- not if you have a couple of people running short jobs, but if you have multiple jobs or think you'll scale at all. Mesos and Kubernetes are rapidly maturing, but if you're deploying today and want major vendor support without having to pioneer too much, YARN will do the trick.
Why Spark? It's far faster than MapReduce and more widely used than Flink or any of the other satellite technologies. It can also handle many Storm use cases, except for “real time,” single-event analytics.
Spark starts in memory. Plus, Spark has mind share: The “data scientists” (aka mathematicians who write crummy Python code) will feel comfortable using Spark with the new Zeppelin or iPython Notebook. Spark lets you quickly run code interactively (aka REPL), which in turn makes it easier to debug -- it's really important in distributed environments.
The only question about Spark involves DataBricks (the primary developer), which has a very weak business model of selling you a cloudy version of iPython Nodebook/Zeppelin that I found underwhelming. Fortunately, mass adoption by the major Hadoop vendors and a growing ecosystem should make this irrelevant.
Truth No. 2: In 2015 you may have to fall back to MapReduce or writing Tez
At scale-out, problems are less technical than economic. In the event your working set (not “all of your data” but the subset you’ll work with at once) will not fit entirely into memory, Spark will “load” the data and “overflow” to disk.
That’s all well and good, but if the overflow portion is bigger than the working set in memory, you basically have a complete copy of your data on disk twice. Guess what happens to Spark’s lauded performance? (Cue flushing noise.)
Some things never change: If you can’t afford the memory to hold the whole job, you need to go to disk. If that's the case, then you have MapReduce and Tez as your probable first options.
If you must write this by hand (rather than have it as the engine behind something like Hive), you should probably choose MapReduce, which will make your job relatively slow but easier to write. Tez is another, potentially higher-performance option, but it's not for the fainthearted. One of Tez's authors describes it as “an assembly language,” so writing to it will be more work than your average business app developer is willing to invest. If you have to write Tez, maybe you can use Cascading or a similar tool set.
Truth No. 3: If the job involves “real time” streaming, you can probably use Spark, but if the job involves “low latency,” you probably need Storm
Streaming is a “new” (two-year-old) use case for Hadoop. Instead of copying big, fat blobs of data on to a disk and analyzing it with Hive or your own MapReduce/Tez code, you might instead analyze events as they come in. Cue the “Internet of things” or “complex event processing” or whatever fancy talk you want to use around processing messages.
Spark doesn’t really do messages one at a time; instead, it creates “microbatches” (essentially queuing and buffering them). This is probably fine for most use cases, especially ones you’d be willing to put through the JVM. There is, however, a point where Storm is better. It's more intuitive, and by the way, it can run a single event at a time (a microbatch, if you prefer).
Another consideration: Storm has a head start in streaming. If you're going to stream and nothing else, you might go with Storm. It's a well-trodden path, with lots of examples and plenty of people who have done it. Life will be easier for you. If you are using Spark everywhere else and think “microbatches” are probably “real time” with sufficiently low latency, then you may decide that one API is better than two.
Truth No. 4: At the moment Flink has more promise than practical experience
Overall I like the idea of Flink. Unifying messaging and batching while enjoying smarter memory management is a nice idea. That said, I suspect Flink will continue to be a niche technology that people use when Spark or Storm doesn't work out.
For example, if you have low-latency requirements and really know your stuff, you may want to implement your own memory management or fault-tolerance algorithms. Flink would be a very practical choice here.
That said, Spark and Storm are mature and widely deployed. Unless you need to do a specific task, you're not only choosing a tool, but selecting an ecosystem. Spark has several analytics tools (answers to iPython Notebook) that will integrate with others. Storm boasts many use cases, along with cut-and-paste code you can find around the Net. Flink is sort of like SWT to Java Swing back in the day.
Spark is also adopting many of the good ideas from Flink, and while Flink is a divergent path, it might one day become mainstream. That day, however, is not today.
Check out other options
Good gosh and all praise -- or, possibly, curses -- to the FSM because there are many more Hadoop engines to choose from.
Quite a few have a single corporate sponsor, which developed the solution in-house. These upstanding companies have shown what good open source community citizens they are by dumping their code on us while not explaining how, when, or why the hell to use it.
Some are niche players or would-be upstarts. That is why we have Javi Roman to catalog them all on what is quickly becoming my favorite page on the whole Internet. Generally, you’ll go trail hunting right about the time one of the more popular paths didn’t get you where you wanted to go. Bring your first aid kit.