Those curious about what's coming in Apache Spark 1.6 can now get hands-on experience with the latest changes to the big data processing framework.
Databricks, a major commercial contributor to Spark and provider of a cloud-hosted platform for running Spark applications, is offering Spark 1.6 in preview on its services. Those who want to try out Spark on their own can obtain Spark 1.6 pre-release code directly from Apache.
This new version packs major changes to the way Spark handles memory. Earlier editions of Spark used to subdivide available memory into two partitions, one for data and one for execution, and required users to figure out how much memory to split between the two. This might have been one of the reasons for InfoWorld contributor Ian Pointer's complaints about Spark's memory management.
In version 1.6, execution memory and storage memory can borrow from each other as needed. "For many applications," says Databricks in its blog post, "this will mean a significant increase in available memory that can be used for operators, such as joins and aggregations."
There are still some restrictions in the current implementation; while borrowed execution memory can be released if needed, borrowed storage memory is never released. For backward compatibility, 1.6 also includes a legacy, fixed-partition memory management mode.
Version 1.6 also continues 1.5's work on Project Tungsten, a major initiative to rewrite Spark's low-level handling of memory and CPU. As Spark picks up speed, both in terms of its performance and its popularity with developers, more of its performance bottlenecks are turning out to be limits of the JVM itself, such as its garbage collector.
Keeping it convenient
Another key addition to Spark 1.6 is the Dataset API for working with collections of typed objects. Previously, Spark users had a choice of two APIs for interacting with data: RDDs, essentially collections of native Java objects (useful, but slow); and DataFrames, collections of structured binary data (less safe, but faster).
Datasets meld the advantages of both: the static typing and user functions of RDDs, and the compile-time type checking and better performance of DataFrames. Datasets and DataFrames are also meant to interoperate freely; libraries that accept one kind of data can accept the other as well.
Spark's real-time data component, Spark Streaming, also gets a boost due to a change to its state-tracking API. This makes it easier for programs that store a large amount of state information when using streams (such as for user sessions), so the impact on memory only varies with the changes to that state, rather than the entire size of the state. Databricks claims this can provide as much as a tenfold improvement in some workloads, and since Spark Streaming has its share of controversies about performance, any improvement will likely be welcomed.
Other changes aren't as major, but useful: the ability to have persistent pipelines in Spark ML (so jobs can be suspended and resumed); the ability to perform Spark SQL queries on flat files in supported formats; some new machine learning algorithms; and improvements to the R and Python APIs. The last one may help open up Spark's reach; while Python remains a prime language of choice for science and math, API support for that language in Spark has traditionally lagged.