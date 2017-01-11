Apache Beam unifies batch and streaming for big data

Beam offers a high-level API and programming paradigm for streaming and batch systems

InfoWorld |

Apache Beam unifies batch and streaming for big data
Credit: flickr/Claudia Regina
More like this

Apache Beam, a unified programming model for both batch and streaming data, has graduated from the Apache Incubator to become a top-level Apache project.

Aside from becoming another full-fledged widget in the ever-expanding Apache tool belt of big-data processing software, Beam addresses ease of use and dev-friendly abstraction, rather than simply offering raw speed or a wider array of included processing algorithms.

Beam us up!

Beam provides a single programming model for creating batch and stream processing jobs (the name is a hybrid of "batch" and "stream"), and it offers a layer of abstraction for dispatching to various engines used to run the jobs. The project originated at Google, where it's currently a service called GCD (Google Cloud Dataflow). Beam uses the same API as GCD, and it can use GCD as an execution engine, along with Apache Spark, Apache Flink (a stream processing engine with a highly memory-efficient design), and now Apache Apex (another stream engine for working closely with Hadoop deployments).

The Beam model involves five components: the pipeline (the pathway for data through the program); the "PCollections," or data streams themselves; the transforms, for processing data; the sources and sinks, where data is fetched and eventually sent; and the "runners," or components that allow the whole thing to be executed on an engine.

Apache says it separated concerns in this fashion so that Beam can "easily and intuitively express data processing pipelines for everything from simple batch-based data ingestion to complex event-time-based stream processing." This is in line with reworking tools like Apache Spark to support stream and batch processing within the same product and with similar programming models. In theory, it's one fewer concept for prospective developers to wrap their head around, but that presumes Beam is used in lieu of Spark or other frameworks, when it's more likely it'll be used -- at first -- to augment them.

Hands off

One possible drawback to Beam's approach is that while the layers of abstraction in the product make operations easier, they also put the developer at a distance from the underlying layers. A good case in point: Beam's current level of integration with Apache Spark; the Spark runner doesn't yet use Spark's more recent DataFrames system, and thus may not take advantage of the optimizations those can provide. But this isn't a conceptual flaw, it's an issue with the implementation, which can be addressed in time.

The big payoff of using Beam, as noted by Ian Pointer in his discussion of Beam in early 2016, is that it makes migrations between processing systems less of a headache. Likewise, Apache says Beam "cleanly [separates] the user's processing logic from details of the underlying engine."

Separation of concern and ease of migration will be good to have if the ongoing rivalries, and competitions between the various big data processing engines continues. Granted, Apache Spark has emerged as one of the undisputed champs of the field and become a de facto standard choice. But there's always room for improvement or an entirely new streaming or processing paradigm. Beam is less about offering a specific alternative than about providing developers and data-wranglers with more breadth of choice between them.

Related:

Serdar Yegulalp is a senior writer at InfoWorld, focused on the InfoWorld Tech Watch news analysis blog and periodic reviews.

From CIO: 8 Free Online Courses to Grow Your Tech Skills
You Might Like
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.
Most Read
10 reasons you should not upgrade to Windows 10
10 reasons you shouldn't upgrade to Windows 10

You may still be better off sticking with Win7 or Win8.1, given the wide range of ongoing Win10...

upgrade underway
Windows 10 upgrade stuck at 99 percent? Here are your options

Now that we're down to the wire, many upgraders report that the installer hangs. If this happens to...

hourglass time sands
Microsoft yanks buggy speed-up patch KB 3161608, replaces it with KB 3172605

Microsoft and Intel are in a standoff when it comes to Bluetooth bugs in the Windows Update speed-up...

Resources
Top Stories
data science classes math
Microsoft’s R tools bring data science to the masses

Open source R is key for big data analytics, and Microsoft has infused many of its tools with the...

rescue recovery data binary sea ocean lifesaver
Review: DigitalOcean keeps the cloud simple

With a great UI, fast machines, low prices, and useful guides, DigitalOcean is an excellent choice for...

Pay the ransom? You won't get your data back

Admins, act now to avoid ransomware and other forms of extortion -- you won't likely get your data back...

A computer desktop with the word Python
Lambdascript adds functional programming to Python

The language project, still in the alpha stage, works with Python expressions and emphasizes literate...