Apache Spark jumps on the R bandwagon

The big data processing technology looks to draw in data scientists

flying sparks fire

Apache Spark, the big data processing technology for iterative workloads that is growing in popularity, is about to add capabilities for DataFrames and the R language as part of two upcoming upgrades.

Spark in 2015 is focusing on data science and platform interfaces, said Matei Zaharia, who started the Spark project and is currently CTO at big data service provider Databricks, which is involved in Spark development. Increasingly, the people who want to use Spark "are not just software developers but they're data scientists, maybe experts in other fields who need to run computations on large data," Zaharia said at the Strata+Hadoop World conference in San Jose, Calif., late last week.

"The most exciting thing that we're doing [in data science] is adding DataFrames to Spark," said Zaharia. Due in Spark 1.3 in a couple weeks, DataFrames features common APIs for working with data on a single machine, providing a concise way to write expressions to do operations on data. Meanwhile, Spark 1.4, expected in June, will feature an R interface, thus backing Scala, Python, Java, and R --  the "four most-popular languages for big data today," he said.

Spark already features libraries for SQL, streaming, and advanced analytics, but the goal for the future is to create platform interfaces to extend Spark on a wide range of environments, such as NoSQL and traditional data warehouse environments, according to Zaharia.

Also in the Spark space, Databricks and Intel are collaborating to optimize Spark real-time analytic capabilities for Intel's architecture. "We believe Spark's efficient in-memory computation within Hadoop enterprise data hub, combined with the performance of Intel Architecture, enables advanced analytics with faster real-time decisions," Michael Greene, vice president of the Intel Software and Services Group, said in a blog post on Friday.


Copyright © 2015 IDG Communications, Inc.