Spark sits at the heart of IBM's reinvention

With 'Spark as a service,' IBM recasts many of the company's big data options around the analytics technology

Spark sits at the heart of IBM's reinvention
Credit: flickr/Lóránt Szabó

One good way to gauge the impact of a technology is by the number of products that have been rebuilt around -- or with -- it. Case in point: IBM and Apache Spark.

Previously, IBM was one of many companies adding support for Spark to its products and making major contributions to the Spark project. Now IBM is not only offering "Spark as a service" through the Bluemix PaaS, it is also re-engineering its data and analytics solutions to take advantage of Spark's processing power, ease of use, and growing roster of contributors.

New platforms for old

Rob Thomas, vice president of product development for IBM Analytics, says Spark is now part of 15 separate IBM applications, "ranging from different commerce applications to SPSS Advanced Analytics algorithms running on top of Spark." Most visible is Spark as a service, which hit beta in June and is now generally available.

The response to the service has been "enormous," said Thomas, citing its 3,000 users while still in beta.

"The reason we call Spark the analytics operating system," Thomas said, "is that it's a processing layer that can access all data. So you're not restricted to the data in Hadoop or the data in a particular warehouse."

Instead, Thomas said, Spark as a service can be used to access data inside other Bluemix options. This includes IBM's Hadoop as a service and DashDB data warehouse, as well as recently announced data services created with The Weather Company and Twitter, where data from those companies can be combined with customer info for expanded insights.

IBM has also been making major use of Spark in DataWorks, where it helps bridge remote and local data stores and perform data preparation on them. In addition to processing workloads, significant parts of the DataWorks codebase have been rewritten and consolidated on Spark, according to Thomas.

"DataWorks was 40 million lines of code when we launched it last year," he said, "and by replatforming that on Spark, we've reduced it to 5 million lines of code."

A fine fit

Charles King, principal analyst at Pund-IT, believes Spark is a good fit for IBM's customers. "Many of those companies have data assets whose volume and complexity are well suited to Spark," he said. "The sheer flexibility and scalability of Spark makes it a natural fit for IBM, especially given the richness of the company's software, hardware, and cloud assets."

In addition to using Spark's codebase to rebuild its products and offering it as a service to customers, IBM has shown it can produce other big-data goodies apart from Hadoop. With Spark, IBM can grow past Hadoop at little risk, since Spark can work with or without Hadoop, and existing Hadoop users will not be forced to change.

"Spark/Hadoop is not an either/or proposition," King said. "It's more about a developer's or his/her organization's performance demands, the size/type of data, and workload requirements."

IBM isn't the first to add Spark to its mix -- Amazon has made similar efforts -- but by using Spark on multiple levels throughout its big data line, Big Blue is reinventing itself in a way that could encompass its entire product line.

To comment on this article and other InfoWorld content, visit InfoWorld's LinkedIn page, Facebook page and Twitter stream.
From CIO: 8 Free Online Courses to Grow Your Tech Skills
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.