Review: Spark lights up machine learning

Spark ML brings efficient machine learning to large compute clusters and combines with TensorFlow for deep learning

At a Glance

As I wrote in March of this year, the Databricks service is an excellent product for data scientists. It has a full assortment of ingestion, feature selection, model building, and evaluation functions, plus great integration with data sources and excellent scalability. The Databricks service provides a superset of Spark as a cloud service. Databricks the company was founded by the original developer of Spark, Matei Zaharia, and others from U.C. Berkeley’s AMPLab. Meanwhile, Databricks continues to be a major contributor to the Apache Spark project.

In this review, I’ll discuss Spark ML, the open source machine learning library for Spark. To be more accurate, Spark ML is the newer of two machine learning libraries for Spark. As of Spark 1.6, the DataFrame-based API in the Spark ML package was recommended over the RDD-based API in the Spark MLlib package for most functionality, but was incomplete. Now, as of Spark 2.0, Spark ML is primary and complete and Spark MLlib is in maintenance mode.

Spark ML features

The Spark ML library provides common machine learning algorithms such as classification, regression, clustering, and collaborative filtering (but not deep neural networks) along with tools for feature extraction, transformation, dimensionality reduction, and selection and tools for constructing, evaluating, and tuning ML pipelines. Spark ML also includes utilities for saving and loading algorithms, models, and pipelines, for data handling, and for doing linear algebra and statistics.

Spark ML is also referred to in the documentation as MLlib, which is confusing. If that bothers you, you can ignore the older Spark MLlib package and forget that I ever mentioned it.

Spark ML is written in Scala and uses the linear algebra package Breeze. Breeze depends on netlib-java for optimized numerical processing. If you’re lucky there are machine-optimized native netlib-java binary proxies on your platform, which will make the whole library run much faster than a pure JVM implementation. On a Mac, that would be Apple’s veclib framework, which is installed by default.

If you’re really lucky you will be able to use Databricks’ configuration of Spark clusters to use GPUs instead of the stock CPU-only Apache Spark. GPUs can potentially get you another 10-fold speed improvement for training complex machine learning models with big data, although they might actually slow down the training of simple machine learning models with small data compared to machine-optimized CPU libraries.

MLlib implements a truckload of common algorithms and models for classification and regression, to the point where a novice could become confused, but an expert would likely find a good choice of model for the data to be analyzed, eventually. To this plethora of models Spark ML adds the important feature of hyperparameter tuning, also known as model selection, which allows the analyst to set up a parameter grid, an estimator, and an evaluator, and let the cross-validation method (time-consuming but accurate) or train validation split method (faster but less accurate) find the best model for the data.

Spark ML has full APIs for Scala and Java, mostly full APIs for Python, and sketchy partial APIs for R. You can get a good feel for the coverage by counting the samples: There are 54 Java and 60 Scala ML examples, 52 Python ML examples, and only five R examples.

Installing Spark (or not)

My initial fear about installing Spark was that it would be a struggle involving multiple dependencies and builds. I was happily mistaken. All I had to do on both of my Macs was to download the current Spark binary and unpack it. I already had Java 1.8.0_05 installed, which gave me Scala 2.11.8 without additional work; I already had Python 2.7 and R 3.3.1 installed as well. I thought that Spark would balk when it didn’t detect a Hadoop installation, but it reported “WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable” and went on its merry way. Score one for prebuilt binaries with good fallbacks.

There is currently an easy way to set up a Spark cluster on Amazon EC2 using the default Spark AMI. It supposedly takes about five minutes, assuming you have already installed Spark locally and an Amazon account with EC2 permission.

Installing Hadoop clusters with Spark on a server farm can be complicated, depending on what cluster manager you use. Lacking a server farm, I didn’t try it; unless you do your own IT, you may never have to deal with the problem directly.

If you don’t want to install Spark but still want to use it, you can run it at Databricks. A free Databricks Community cluster gives you one node with 6GB of RAM and 0.88 core. Paid clusters can have memory optimized nodes (30GB, four cores) deployed on Amazon r3.2xlarge EC2 instances, compute optimized nodes (15GB, eight cores) deployed on Amazon c3.4xlarge EC2 instances, or the GPU nodes discussed above, deployed on Amazon g2.2xlarge (one GPU) or g2.8xlarge (four GPUs) EC2 instance types.

One advantage of using Databricks is that you can create clusters at will using any Spark version from 1.3 to the current version (2.0.1 as of this writing). This matters because some older Spark ML programs, including samples, only run correctly on specific versions of Spark and throw errors on others. You will usually find the supported Spark versions listed in the comments of the sample notebooks.

Running Spark ML

While Databricks offers a Jupyter-compatible notebook by default for running Spark, the standard Apache Spark distribution runs from the command line. There are multiple scripts in the Spark repository: ./bin/run-example for running sample code from the repository; ./bin/spark-shell to run a modified interactive Scala shell; ./bin/pyspark to run an interactive Python shell; ./bin/sparkR for running an R shell; and the generic ./bin/spark-submit for running applications. The spark-submit script in turn runs spark-class, which takes care of finding Java and the Spark .jar files, then launching the desired script.

By default, Spark generates an incredibly verbose stream of logging output even when there is little substantive output. In the screenshot below from running a simple Spark sample, see whether you can pick out the answer before reading the caption.

spark log output

Here we are running a Scala sample to calculate Pi using Spark. The calculated value, about 10 lines up from the bottom, is only accurate to three figures.

Fortunately, it is possible to throttle down the terminal logging output with a setting. There’s a better way to run Spark, however. Using a Jupyter-type notebook hides the logging and intersperses explanatory text, code, and outputs, as shown in the two Databricks screenshots below.

sparkmllib recommender

A Spark Scala example (Recommender Systems with ALS) as a Databricks notebook. Note that most of the logging output has been hidden.

sparkmllib naive bayes

A Spark Python example (Naive Bayes) as a Databricks notebook. Notice how the explanation, code, and output are interspersed.

You don’t actually need Databricks to run Jupyter Notebook with Spark. You can install Jupyter locally, and add whatever language kernels (such as Scala and R) you need on top of the default IPython kernel. If you search the web for “how to use Spark with Jupyter notebook,” you’ll find at least a dozen sets of instructions for how to set things up so that you can use Jupyter with Spark, giving you a Mathematica-like workbook. Pick whichever set of instructions that matches your environment most closely.

Learning Spark ML

On the Spark site you will find download instructions; instructions for running the distributed Scala, Java, Python, and R examples; a Spark API quick-start and programming guide; and other relevant documents. The examples page explains a few simple instances in Python, Scala, and Java. Additional examples can be found in the repository you downloaded, in the examples/src/main directory tree.

Databricks offers additional documentation and samples. You can start at the Welcome to Databricks page and go through the tutorial sequence on the main page, down to the notebook tutorials and getting-started videos. Once you know how to add a cluster and create a notebook, you may want to go through the visualizations section of the Databricks documentation as well as its MLlib documentation, from which the two notebook screenshots above were taken. Pay extra attention to the discussions of MLlib algorithms.

Spark meets TensorFlow

As we’ve seen, Spark ML supplies pretty much anything you’d want in basic machine learning, feature selection, pipelines, and persistence. It does a good job with classification, regression, clustering, and filtering. Given that it’s part of Spark, it has great access to databases, streams, and other data sources.

On the other hand, Spark ML is not really set up to model and train deep neural networks in the same way as Google TensorFlow, Caffe, and Microsoft Cognitive Toolkit. For that reason, Databricks has been working on combining the scalability and hyperparameter tuning of Spark with the deep neural network training and deployment of TensorFlow, as discussed in this blog post.

If you have a Hadoop installation and want to do data science with machine learning, Spark ML is an obvious way to go. If you need deep neural networks to model your data properly, then Spark ML is not the best choice — but you can combine it with one of the best choices to possibly create something better than either alone.

---

Cost: Free open source. Platform: Spark runs on both Windows and Unix-like systems (such as Linux, MacOS), with Java 7 or later, Python 2.6/3.4 or later, and R 3.1 or later. For the Scala API, Spark 2.0.1 uses Scala 2.11. 

InfoWorld Scorecard
Models and algorithms (25%)
Ease of development (25%)
Documentation (20%)
Performance (20%)
Ease of deployment (10%)
Overall Score (100%)
Spark ML 2.01 9 8 8 9 8 8.5
At a Glance
  • Spark MLlib has a wide assortment of machine learning tools, but little in the way of deep learning.

    Pros

    • Efficient implementation of machine learning on Hadoop
    • Highly scalable
    • Very good Scala and Java APIs; good Python API
    • Handles streams

    Cons

    • R API is very basic
    • Not really set up for deep neural networks
    • Running Jupyter notebooks with Spark takes some setup, unless you use the Databricks service

Copyright © 2016 IDG Communications, Inc.