Review: Spark lights a fire under big data processing

Apache Spark brings high-speed, in-memory analytics to Hadoop clusters, crunching large-scale data sets in minutes instead of hours

Become An Insider

Sign up now and get FREE access to hundreds of Insider articles, guides, reviews, interviews, blogs, and other premium content. Learn more.
At a Glance
  • The Apache Software Foundation Apache Spark 1.1.0

Apache Spark got its start in 2009 at UC Berkeley’s AMPLab as a way to perform in-memory analytics on large data sets. At that time, Hadoop MapReduce was focused on large-scale data pipelines that were not iterative in nature. Building analytic models on MapReduce in 2009 was a very slow process, so AMPLab designed Spark to help developers perform interactive analysis of large data sets and to run iterative workloads, such as machine-learning algorithms, that repeatedly process the same data sets in RAM.

Spark doesn’t replace Hadoop. Rather, it offers an alternative processing engine for workloads that are highly iterative. By avoiding costly writes to disk, Spark jobs often run many orders of magnitude faster than Hadoop MapReduce. By "living" inside the Hadoop cluster, Spark uses the Hadoop data layer (HDFS, HBase, and so on) for the end points of the data pipeline, reading raw data and storing final results.

Writing Spark applications

Spark, written in Scala, provides a unified abstraction layer for data processing, making it a great environment for developing data applications. Spark comes with a choice of Scala, Java, and Python language bindings that are, for the most part, equivalent except at the bleeding edge, where only Scala implementations are available.

One of the nice features in Spark is the ability to work interactively from the Scala or Python console. This means you can try out code and immediately see the results of execution. This is handy both for debugging, where you can change a value and proceed again without going through a compile step, and for data exploration, where a typical process consists of tight loops of inspect-visualize-update.

Spark’s core data structure is a resilient distributed data (RDD) set. In Spark, driver programs are written as a series of transformations of RDDs, followed by actions on them. Transformations, as the name suggests, create new RDDs from existing ones by changing them in some way, such as by filtering the data according to some criteria. Actions work on the RDDs themselves. An action might be counting the number of instances of a data type or saving RDDs to a file.

To continue reading this article register now