Machine learning reviews

Review: Databricks makes big data dreams come true

Cloud-based Spark machine learning and analytics platform is an excellent, full-featured product for data scientists.

Review: Databricks makes big data dreams come true
At a Glance
  • Databricks with Spark 1.6

For those of you just tuning in, Spark, an open source cluster computing framework, was originally developed by Matei Zaharia at U.C. Berkeley's AMPLab in 2009, and later open-sourced and donated to the Apache Foundation. Part of the motivation for creating Spark is that MapReduce only allows a single pass through the data, while machine learning (ML) and graphing algorithms generally need to perform multiple passes.

Spark is billed as a “fast and general engine for large-scale data processing,” with a tagline of “Lightning-fast cluster computing.” In the world of big data, Spark has been attracting attention and investment because it provides a powerful in-memory data-processing component within Hadoop that deals with both real-time and batch events. In addition to Databricks, Spark has been embraced by the likes of IBM, Microsoft, Amazon, Huawei, and Yahoo.

Spark includes MLlib for distributed machine learning and GraphX for distributed graph computation.

spark ecosystem

The Spark core supports APIs in R, SQL, Python, Scala, and Java. Additional Spark modules include Spark SQL and DataFrames; Streaming; MLlib for machine learning; and GraphX for graph computation.

MLlib is of particular interest in this review. It includes a wide range of ML and statistical algorithms, all tailored for the distributed memory-based Spark architecture. MLlib implements, among other items, summary statistics, correlations, sampling, hypothesis testing, classification and regression, collaborative filtering, cluster analysis, dimensionality reduction, feature extraction and transformation functions, and optimization algorithms. In other words, it’s a fairly complete package for data scientists.

To continue reading this article register now

How to choose a low-code development platform