Review: Spark lights up machine learning

Spark ML brings efficient machine learning to large compute clusters and combines with TensorFlow for deep learning

At a Glance

As I wrote in March of this year, the Databricks service is an excellent product for data scientists. It has a full assortment of ingestion, feature selection, model building, and evaluation functions, plus great integration with data sources and excellent scalability. The Databricks service provides a superset of Spark as a cloud service. Databricks the company was founded by the original developer of Spark, Matei Zaharia, and others from U.C. Berkeley’s AMPLab. Meanwhile, Databricks continues to be a major contributor to the Apache Spark project.

In this review, I’ll discuss Spark ML, the open source machine learning library for Spark. To be more accurate, Spark ML is the newer of two machine learning libraries for Spark. As of Spark 1.6, the DataFrame-based API in the Spark ML package was recommended over the RDD-based API in the Spark MLlib package for most functionality, but was incomplete. Now, as of Spark 2.0, Spark ML is primary and complete and Spark MLlib is in maintenance mode.

Spark ML features

The Spark ML library provides common machine learning algorithms such as classification, regression, clustering, and collaborative filtering (but not deep neural networks) along with tools for feature extraction, transformation, dimensionality reduction, and selection and tools for constructing, evaluating, and tuning ML pipelines. Spark ML also includes utilities for saving and loading algorithms, models, and pipelines, for data handling, and for doing linear algebra and statistics.

Spark ML is also referred to in the documentation as MLlib, which is confusing. If that bothers you, you can ignore the older Spark MLlib package and forget that I ever mentioned it.

Spark ML is written in Scala and uses the linear algebra package Breeze. Breeze depends on netlib-java for optimized numerical processing. If you’re lucky there are machine-optimized native netlib-java binary proxies on your platform, which will make the whole library run much faster than a pure JVM implementation. On a Mac, that would be Apple’s veclib framework, which is installed by default.

To continue reading this article register now

How to choose a low-code development platform