Azure Databricks: Fast analytics in the cloud with Apache Spark

Microsoft’s partnership with Databricks adds new analytics tools to Azure’s data platform

We’re living in a world of big data. The current generation of line-of-business computer systems generate terabytes of data every year, tracking sales and production through CRM and ERP. It’s a flood of data that’s only going to get bigger as we add the sensors of the industrial internet of things, and the data that’s needed to deliver even the simplest predictive-maintenance systems.

Having that data is one thing, using it as another. Big data is often unstructured, spread across many servers and databases. You need something to bring it together. That’s where big data analysis tools like Apache Spark come into play; these distributed analytical tools work across clusters of computers. Building on techniques developed for the MapReduce algorithms used by tools like Hadoop, today’s big data analysis tools go further to support more database-like behavior, working with in-memory data at scale, using loops to speed up queries, and providing a foundation for machine learning systems.

Apache Spark is fast, but Databricks is faster. Founded by the Spark team, Databricks is a cloud-optimized version of Spark that takes advantage of public cloud services to scale rapidly and uses cloud storage to host its data. It also offers tools to make it easier to explore your data, using the notebook model popularized by tools like Jupyter Notebooks.

Microsoft’s new support for Databricks on Azure—called Azure Databricks—signals a new direction of its cloud services, bringing Databricks in as a partner rather than through an acquisition.

Although you’ve always been able to install Spark or Databricks on Azure, Azure Databricks makes it a one-click experience, driving the setup process from the Azure Portal. You can host multiple analytical clusters, using autoscaling to minimize the resources in use. You can clone and edit clusters, tuning them for specific jobs or running different analyses on the same underlying data.

To continue reading this article register now