The rise and predominance of Apache Spark

Recent surveys and forecasts of technology adoption have consistently suggested that Apache Spark is being embraced at a rate that outperforms other big data frameworks

apache spark 900x600
Leo Cheung

Initially open-sourced in 2012 and followed by its first stable release two years later, Apache Spark quickly became a prominent player in the big data space. Since then, its adoption by big data companies has been on the rise at an eye-catching rate.

In-memory processing

Undoubtedly a key feature of Spark, in-memory processing, is what makes the technology deliver the speed that dwarfs performance of conventional big data processing. But in-memory processing isn’t a new computing concept, and there is a long list of database and data-processing products with an underlying design of in-memory processing. Redis and VoltDB are a couple of examples. Another example is Apache Ignite, which is also equipped with in-memory processing capability supplemented by a WAL (write-ahead log) to address performance of big data queries and ACID (atomicity, consistency, isolation, durability) transactions.

Evidently, the functionality of in-memory processing alone isn’t quite sufficient to differentiate a product from others. So, what makes Spark stand out from the rest in the highly competitive big data processing arena?

BI/OLAP at scale with speed

For starters, I believe Spark successfully captures a sweet spot that few other products do. The need for the ever demanding high-speed BI (business intelligence) analytics has, in a sense, started to blur the boundary between the OLAP (online analytical processing) and OLTP (online transaction processing) worlds.

On one hand, we have distributed computing platforms such as Hadoop providing a MapReduce programming model, in addition to its popular distributed file system (HDFS). While MapReduce is a great data processing methodology, it’s a batch process that doesn’t deliver results in a timely manner.

On the other hand, there are big data processing products addressing the need of OLTP. Examples of products in this category include Phoenix on HBase, Apache Drill, and Ignite. Some of these products provide a query engine that emulates standard SQL’s transactional processing functionality to various extent to apply to key-value based or column-oriented databases.

What was missing but in high demand in the big data space is a product that does batch OLAP at scale with speed. There is indeed a handful of BI analytics/OLAP products such as Apache Kylin and Presto. Some of these products manage to fill the gap with some success in the very space. But it’s Spark that has demonstrated success in simultaneously addressing both speed and scale.

Nevertheless, Spark isn’t the only winner in the ‘speed + scale’ battle. Emerged around the same time as Apache Spark did, Impala (now an Apache incubator project) has also demonstrated remarkable performance in both speed and scale in its recent release. Yet, it has never achieved the same level of popularity as Spark does. So, something else in Spark must have made it more appealing to contemporary software engineers.

Immutable data with functional programming

Apache Spark provides API for three types of dataset: RDDs (resilient distributed data) are immutable distributed collection of data manipulatable using functional transformations (map, reduce, filter, etc.), DataFrames are immutable distributed collections of data in a table-like form with named columns and each row a generic untyped JVM objects called Row, and Datasets are collections of strongly-typed JVM objects.

Regardless of the API you elect to use, data in Spark is immutable and changes applied to the data are via compositional functional transformations. In a distributed computing environment, data immutability is highly desirable for concurrent access and performance at scale. In addition, such approach in formulating and resolving data processing problem in the functional programming style has been favored by many software engineers and data scientists these days.

On MapReduce, Spark provides an API using implementation of map(), flatMap(), groupBy(), reduce() in classic functional programming language such as Scala. These methods can be applied to datasets in a compositional fashion as a sequence of data transformations, bypassing the need of coding modules of mappers and reducers as in conventional MapReduce.

Spark is “lazy”

An underlying design principle that plays a pivotal role in the operational performance of Spark is “laziness.” Spark is lazy in the sense that it holds off actual execution of transformations until it receives requests for resultant data to be returned to the driver program (i.e., the submitted application that is being serviced in an active execution context).

Such execution strategy can significantly minimize disk and network I/O, enabling it to perform well at scale. For example, in a MapReduce process, rather than returning the high-volume of data generated through map that is to be consumed by reduce, Spark may elect to return only the much smaller resultant data from reduce to the driver program.

Cluster and programming language support

As a distributed computing framework, robust cluster management functionality is essential for scaling out horizontally. Spark has been known for its effective use of available CPU cores on over thousands of server nodes. Besides the default standalone cluster mode, Spark also supports other clustering managers including Hadoop YARN and Apache Mesos.

On programming languages, Spark supports Scala, Java, Python, and R. Both Scala and R are functional programming languages at their heart and have been increasingly adopted by the technology industry in general. Programming in Scala on Spark feels like home given that Spark itself is written in Scala, whereas R is primarily tailored for data science analytics.

Python, with its popular data sicence libraries like NumPy, is perhaps one of the fastest growing programming language partly due to the increasing demand in data science work. Evidently, Spark’s Python API (PySpark) has been quickly adopted in volume by the big data community. Interoperable with NumPy, Spark’s machine learning library MLlib built on top of its core engine has helped fuel enthusiasm from the data science community.

On the other hand, Java hasn’t achieved the kind of success Python enjoys on Spark. Apparently the Java API on Spark feels like an afterthought. I’ve seen on a few occasions something rather straight forward using Scala needs to be worked around with lengthy code in Java on Spark.

Power of SQL and user-defined functions

SQL-compliant query capability is a significant part of Spark’s strength. Recent releases of Spark API support SQL 2003 standard. One of the most sought-after query features is the window functions, which are not even available in some major SQL-based RDBMS like MySQL. Window functions enable one to rank or aggregate rows of data over a sliding window of rows that help minimize expensive operations such as joining of DataFrames.

Another important feature of Spark API’s are user-defined functions (UDF), which allow one to create custom functions that leverage the vast amount of general-purpose functions available on the programming language to apply to the data columns. While there is a handful of functions specific for the DataFrame API, with UDF one can expand to using of virtually any methods available, say, in the Scala programming language to assemble custom functions.

Spark streaming

In the scenario that data streaming is an requirement on top of building an OLAP system, the necessary integration effort could be challenging. Such integration generally requires not only involving a third-party streaming library, but also making sure that the two disparate APIs will cooperatively and reliably work out the vast difference in latency between near-real-time and batch processing.

Spark provides a streaming library that offers fault-tolerant distributed streaming functionality. It performs streaming by treating small contiguous chunks of data as a sequence of RDDs which are Spark’s core data structure. The inherent streaming capability undoubtedly alleviates the burden of having to integrate high-latency batch processing tasks with low-latency streaming routines.

Visualization, and beyond

Last but not least, Spark’s web-based visual tools reveal detailed information related to how a data processing job is performed. Not only do the tools show you the break-down of the tasks on individual worker nodes of the cluster, they also give details down to the life cycle of the individual execution processes (i.e., executors) allocated for the job. In addition, Spark’s visualization of complex job flow in the form of DAG (directed acyclic graph) offers in-depth insight into how a job is executed. It’s especially useful in troubleshooting or performance-tuning an application.

So, it isn’t just one or two things among the long list of in-memory processing speed, scalability, addressing of the BI/OLAP niche, functional programming style, data immutability, lazy execution strategy, appeal to the rising data science community, robust SQL capability and task visualization, etc. that propel Apache Spark to be a predominant frontrunner in the big data space. It’s the collective strength of the complementary features that truly makes Spark stand out from the rest.

Copyright © 2017 IDG Communications, Inc.

How to choose a low-code development platform