Databricks offers a glimpse of Spark 2.0

Spark has taken big data by storm. What's next for the in-memory engine of choice? Spark's primary commercial backer, Databricks, offers a clue

Databricks offers a glimpse of Spark 2.0
Credit: Christian Schnettelker

Last week at Spark Summit East, Databricks dropped a few hints about where in-memory data processing tool Spark is headed. The company is the primary commercial entity behind Spark and plays a leading roll in its evolution.

Databricks' hosted Spark platform, Databricks Cloud, is available by subscription. To make it easier to get onboard with Spark in its cloud, Databricks announced a free tier, the Community Edition. It's available for now only as a beta invite, but general availability is planned for the middle of this year.

Databricks clearly sees that Community Edition as an on-ramp to the for-pay version of the product, noting that it will "enable users to seamlessly transition their prototypes to production applications on the full Databricks platform."

Databricks is determined to keep Spark evolving. In a set of slides delivered at the Spark Summit keynote, Databricks CTO and Spark creator Matei Zaharia talked about the forthcoming Spark 2.0. It will feature three key changes: Implementing the next phase of Project Tungsten to speed up Spark by working around Java's memory-handling limitation, improvements to Spark's real-time streaming system, and unifying the structured data APIs Spark uses (Datasets and DataFrames) in a single API.

One detail not mentioned, but on the minds of many Spark enthusiasts, is how Spark will further integrate with Apache Arrow, a new project for providing in-memory versions of columnar data for fast access.

All of these are genuinely exciting and important projects. Tungsten, in particular, points toward an approach to speeding up other big data projects written in Java.

Currently, the company claims it has 200 paying customers and insists it will continue to focus on the Databricks platform rather than diversify into other efforts.

But Databricks is hardly the only Spark player. IBM in particular has made Spark a key in its big data strategy by providing "Spark as a service" in its Bluemix cloud. Over the past year, Spark has replaced Hadoop as the big data engine of choice, and Databricks will face increasing competition as the project progresses to the next level.

From CIO: 8 Free Online Courses to Grow Your Tech Skills
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies