Machine learning reviews

Review: Databricks makes big data dreams come true

Cloud-based Spark machine learning and analytics platform is an excellent, full-featured product for data scientists

Become An Insider

Sign up now and get FREE access to hundreds of Insider articles, guides, reviews, interviews, blogs, and other premium content. Learn more.
At a Glance
  • Databricks with Spark 1.6

Editor's Choice

For those of you just tuning in, Spark, an open source cluster computing framework, was originally developed by Matei Zaharia at U.C. Berkeley's AMPLab in 2009, and later open-sourced and donated to the Apache Foundation. Part of the motivation for creating Spark is that MapReduce only allows a single pass through the data, while machine learning (ML) and graphing algorithms generally need to perform multiple passes.

Spark is billed as a “fast and general engine for large-scale data processing,” with a tagline of “Lightning-fast cluster computing.” In the world of big data, Spark has been attracting attention and investment because it provides a powerful in-memory data-processing component within Hadoop that deals with both real-time and batch events. In addition to Databricks, Spark has been embraced by the likes of IBM, Microsoft, Amazon, Huawei, and Yahoo.

Spark includes MLlib for distributed machine learning and GraphX for distributed graph computation.

spark ecosystem

The Spark core supports APIs in R, SQL, Python, Scala, and Java. Additional Spark modules include Spark SQL and DataFrames; Streaming; MLlib for machine learning; and GraphX for graph computation.

MLlib is of particular interest in this review. It includes a wide range of ML and statistical algorithms, all tailored for the distributed memory-based Spark architecture. MLlib implements, among other items, summary statistics, correlations, sampling, hypothesis testing, classification and regression, collaborative filtering, cluster analysis, dimensionality reduction, feature extraction and transformation functions, and optimization algorithms. In other words, it’s a fairly complete package for data scientists.

To continue reading this article register now