In the last month, several A-list names in cloud and business computing have declared interest (and made investments) in the Apache Spark data analysis project. What got them fired up?
Some of this is legitimate excitement over a promising technology with broad applications. But it's also about yet another project that can be monetized in the cloud, by wrapping it in convenience and offering it at scale.
The allure of Spark
Among the companies in recent months expressing their devotion to Spark:
- IBM. Aside from adding Spark support to its Bluemix PaaS, IBM is also preparing to contribute its SystemML machine learning algorithm construction technology to Spark.
- Microsoft. Adding Spark support to Azure HDInsight (its cloud-hosted version of Hadoop).
- Amazon. Its Elastic MapReduce service will be able to run Spark apps developed not only in Scala, but also Python and Java.
- Huawei. The Chinese networking giant recently unveiled a project called Astro that combines Spark, Spark SQL, and HBase into a single product. Spark is already used in Huawei's Hadoop-based FusionInsight product, offered as a service by way of Huawei's burgeoning cloud platform.
Spark is attractive mainly because it provides a powerful in-memory data-processing component within Hadoop that deals with both real-time and batch events. At Yahoo, where Hadoop originally sprung up, Spark has become a cornerstone in analytics operations.
For the above companies, Spark offers a grade-A ingredient for their cloud business, both with and without Hadoop (although typically with). With prices in a constant race to the bottom, competition between cloud vendors revolves around offering features formerly confined to the data center, but at a scale and with a degree of convenience unavailable there. (The fact that we're now at a phase where more enterprise data is being generated in the cloud rather than moved there also helps.)
Lighting the next fire
Where Spark goes from here is also crucial, since many of the future directions discussed for the project have potential effects on how Spark can be deployed as a cloud resource.
IBM's contributions to Spark are in that vein. Databricks, corporate developer of Spark, has plans of its own that could have even more radical effects. Its Project Tungsten constitutes a major revamp of the way Spark leverages memory allocation for the sake of boosting performance. This wouldn't benefit only Spark developers, but all those providing Spark as a service.
Ironically, the more popular Spark is in the cloud, the more directly it might threaten the business model of Databricks itself. InfoWorld's Andy Oliver profiled Databricks' Spark offering -- a sort of interactive data notebook for Spark -- and found it to be "far from the Tableau of data science" it seems intended to be. The other big contenders listed above may not have the same degree of interactivity for their Spark offerings, but they're arguably put together in ways that more directly complement actual Spark workloads.
Spark also needs to mature in other ways -- documentation, commercial support, and middleware integration, as well as having more pre-Spark apps written for use. Barring the last item, the vast majority of those are jobs well-suited to Spark's corporate contributors and sponsors -- unless, that is, their contributions end up being little more than ensuring Spark runs well in their cloud and for their customers.