Review: Spark lights up machine learning

Spark ML brings efficient machine learning to large compute clusters and combines with TensorFlow for deep learning

At a Glance

As I wrote in March of this year, the Databricks service is an excellent product for data scientists. It has a full assortment of ingestion, feature selection, model building, and evaluation functions, plus great integration with data sources and excellent scalability. The Databricks service provides a superset of Spark as a cloud service. Databricks the company was founded by the original developer of Spark, Matei Zaharia, and others from U.C. Berkeley’s AMPLab. Meanwhile, Databricks continues to be a major contributor to the Apache Spark project.

In this review, I’ll discuss Spark ML, the open source machine learning library for Spark. To be more accurate, Spark ML is the newer of two machine learning libraries for Spark. As of Spark 1.6, the DataFrame-based API in the Spark ML package was recommended over the RDD-based API in the Spark MLlib package for most functionality, but was incomplete. Now, as of Spark 2.0, Spark ML is primary and complete and Spark MLlib is in maintenance mode.

Spark ML features

The Spark ML library provides common machine learning algorithms such as classification, regression, clustering, and collaborative filtering (but not deep neural networks) along with tools for feature extraction, transformation, dimensionality reduction, and selection and tools for constructing, evaluating, and tuning ML pipelines. Spark ML also includes utilities for saving and loading algorithms, models, and pipelines, for data handling, and for doing linear algebra and statistics.

Spark ML is also referred to in the documentation as MLlib, which is confusing. If that bothers you, you can ignore the older Spark MLlib package and forget that I ever mentioned it.

To continue reading this article register now