How much does IBM love Apache Spark? Enough to make it part of Bluemix -- and enrich it with its own contributions.
These aren't any old contributions: Among them is an IBM invention designed that makes it easy to deploy machine-learning algorithms across Spark clusters.
Work smarter, not harder
This contribution to Spark, known as IBM SystemML, was originally outlined in a 2011 paper. In it, IBM researchers described the creation of a high-level language, akin to the R language for statistics and data analysis, for authoring machine learning algorithms that could run easily at scale. Those algorithms are then "compiled and optimized into a set of MapReduce jobs that can run on a cluster of machines," according to the IBM paper.
But since 2011, MapReduce has gradually taken a backseat to the more efficient YARN processing methodology. Plus, Spark has grown in importance and utility, due in part to YARN. In the light of that, it's wise for SystemML to look beyond MapReduce and become Spark-centric -- precisely what IBM has in mind here.
IBM is still vague on how it plans to accomplish this. Its press release notes that it will be working jointly with Databricks, one of the major commercial contributors to the Spark project, "to advance Spark's machine learning capabilities." Most likely this will involve IBM submitting SystemML-related patches for Spark to Databricks, and collaborating with them on the implementation.
A big blue Spark
IBM's other major Spark project is more predictable, but might be more immediately useful: Adding Spark processing as an IBM Bluemix service.
The details are still sketchy, though. In its press release, IBM describes Spark as a service on Bluemix as "[making] it possible for any app developer to quickly load data, model it, and derive the predictive artifact to use in their app."
If Spark on Bluemix follows the same model as IBM's previous as-a-service offerings, it will likely involve connecting a Spark-as-a-service instance to data stored in one of the existing Bluemix data management or big data offerings. Among the latter is BigInsights for Hadoop, IBM's version of Hadoop in Bluemix.
Most intriguing will be IBM's decision to keep Spark as its own independent service that can be freely coupled to items in the Bluemix catalog -- or to components in a private cloud by way of IBM's hybrid cloud solutions. The easy answer would be to rev BigInsights and add Spark, but IBM would be better served by expanding its ambitions and making Spark an ingredient that can be reused in multiple contexts.