Machine learning reviews

Review: Databricks makes big data dreams come true

Cloud-based Spark machine learning and analytics platform is an excellent, full-featured product for data scientists

At a Glance
  • Databricks with Spark 1.6

Editor's Choice

For those of you just tuning in, Spark, an open source cluster computing framework, was originally developed by Matei Zaharia at U.C. Berkeley's AMPLab in 2009, and later open-sourced and donated to the Apache Foundation. Part of the motivation for creating Spark is that MapReduce only allows a single pass through the data, while machine learning (ML) and graphing algorithms generally need to perform multiple passes.

Spark is billed as a “fast and general engine for large-scale data processing,” with a tagline of “Lightning-fast cluster computing.” In the world of big data, Spark has been attracting attention and investment because it provides a powerful in-memory data-processing component within Hadoop that deals with both real-time and batch events. In addition to Databricks, Spark has been embraced by the likes of IBM, Microsoft, Amazon, Huawei, and Yahoo.

Spark includes MLlib for distributed machine learning and GraphX for distributed graph computation.

spark ecosystem

The Spark core supports APIs in R, SQL, Python, Scala, and Java. Additional Spark modules include Spark SQL and DataFrames; Streaming; MLlib for machine learning; and GraphX for graph computation.

To continue reading this article register now