Apache Flink: New Hadoop contender squares off against Spark

A flexible replacement for Hadoop MapReduce that supports real-time and batch processing, Flink offers advantages over Spark

deathmatch 4 arm wrestle battle fight contest
Shutterstock

The quest to replace Hadoop’s aging MapReduce is a bit like waiting for buses in Britain. You watch a really long time, then a bunch come along at once. We already have Tez and Spark in the mix, but there’s a new contender for the heart of Hadoop, and it comes from Europe: Apache Flink (German for "quick" or "nimble").

Flink sprung from Berlin’s Technical University, and it used to be known as Stratosphere before it was added to Apache’s incubator program. It’s a replacement for Hadoop MapReduce that works in both batch and streaming modes, eliminating the mapping and reducing jobs in favor of a directed graph approach that leverages in-memory storage for massive performance gains.

Some of you may have read the last paragraph and thought, "Hang on, isn’t that Apache Spark?" You'd be right; Spark and Flink have a lot in common. Here's some Scala that shows a simple word count operation in Flink:

case class Word (word: String, frequency: Int)
val counts = text
 .flatMap {line => line.split(" ").map(word => Word(word,1))}
 .groupBy("word").sum("frequency")

Here's a Scala implementation of word count for Spark:

val counts = text
 .flatMap(line => line.split(" ")).map(word => (word, 1))
 .reduceByKey{case (x, y) => x + y}

As you can see, while there are some differences in syntactic sugar, the APIs are rather similar. I'm a fan of Flink's use of case classes over Spark's tuple-based PairRDD construction, but there's not much in it. Given that Apache Spark is now a stable technology used in many enterprises across the world, another data processing engine seems superfluous. Why should we care about Flink?

The reason Flink may be important lies in the dirty little secret at the heart of Spark Streaming, one you may have come across in a production setting: Instead of being a pure stream-processing engine, it is in fact a fast-batch operation working on a small part of incoming data during a unit of time (known in Spark documentation as "micro-batching"). For many applications, this is not an issue, but where low latency is required (such as financial systems and real-time ad auctions) every millisecond lost can lead to monetary consequences.

Flink flips this on its head. Whereas Spark is a batch processing framework that can approximate stream processing, Flink is primarily a stream processing framework that can look like a batch processor. Immediately you get the benefit of being able to use the same algorithms in both streaming and batch modes (exactly as you do in Spark), but you no longer have to turn to a technology like Apache Storm if you require low-latency responsiveness. You get all you need in one framework, without the overhead of programming and maintaining a separate cluster with a different API.

Also, Flink borrows from the crusty-but-still-has-a-lot-to-teach-us RDBMS to bring us an aggressive optimization engine. Similar to a SQL database's query planner, the Flink optimizer analyzes the code submitted to the cluster and produces what it thinks is the best pipeline for running on that particular setup (which may be different if the cluster is larger or smaller).

For extra speed, it allows iterative processing to take place on the same nodes rather than having the cluster run each iteration independently. With a bit of reworking of your code to give the optimizer some hints, it can increase performance even further by performing delta iterations only on parts of your data set that are changing (in some cases offering a five-fold speed increase over Flink’s standard iterative process).

Flink has a few more tricks up its sleeve. It is built to be a good YARN citizen (which Spark has not quite achieved yet), and it can run existing MapReduce jobs directly on its execution engine, providing an incremental upgrade path that will be attractive to organizations already heavily invested in MapReduce and loath to start from scratch on a new platform. Flink even works on Hortonworks’ Tez runtime, where it sacrifices some performance for the scalability that Tez can provide.

In addition, Flink takes the approach that a cluster should manage itself rather than require a heavy dose of user tuning. To this end, it has its own memory management system, separate from Java’s garbage collector. While this is normally Something You Shouldn’t Do, high-performance clustered computing changes the rules somewhat. By managing memory explicitly, Flink almost eliminates the memory spikes you often see on Spark clusters. To aid in debugging, Flink supplies its equivalent of a SQL EXPLAIN command. You can easily get the cluster to dump a JSON representation of the pipelines it has constructed for your job, and you can get a quick overview of the optimizations Flink has performed through a built-in HTML viewer, providing better transparency than in Spark at times.

But let’s not count out Spark yet. Flink is still an incubating Apache project. It has only been tested in smaller installations of up to 200 nodes and has limited production deployment at this time (although it’s said to be in testing at Spotify). Spark has a large lead when it comes to mature machine learning and graph processing libraries, although Flink’s maintainers are working on their own versions of MLlib and GraphX. Flink currently lacks a Python API, and most important, it does not have a REPL (read-eval-print-loop), so it's less attractive to data scientists -- though again, these deficiencies have been recognized and are being remedied. I’d bet on both a REPL and Python support arriving before the end of 2015.

Flink seems be a project that has definite promise. If you’re currently using Spark, it might be worthwhile standing up a Flink cluster for evaluation purposes (especially if you’re using Spark Streaming). However, I wonder whether all of the "next-generation MapReduce" communities (including Tez and Impala along with Spark and Flink) might be better served if there were less duplication of effort and more cooperation among the groups. Can’t we all just get along?

Related:

Copyright © 2015 IDG Communications, Inc.