4 things enterprises are doing right now with Spark

A survey from Spark developers Databricks shows the big data engine growing on its own outside of Hadoop, among other eyebrow-raising findings

4 things enterprises are doing right now with Spark
Credit: Thinkstock

A study about a software product commissioned by one of that product's commercial sponsors -- take it with a shaker full of salt.

Still, Databricks' latest survey about the use of the continuously evolving Spark big data processing engine turned up enlightening insights about where and how Spark is being put to work.

The poll, conducted with 1,417 respondents in 842 organizations, touched on which industries use Spark and in what applications, as well as which Spark components got the most work. But of all the topics discussed, these four details about Spark users stood out the most.

1. They're using Spark outside of Hadoop almost half the time

Most of Spark's use cases are in conjunction with Hadoop, where it runs under the YARN framework and is used to process large amounts of data stored via one of Hadoop's storage mechanisms.

But Spark was originally developed outside of Hadoop and can run in its native clustering system if needed. As it turns out, this option is popular with the polled Spark users: 48 percent of Spark deployments, according to the survey, are run in stand-alone mode. By contrast, 40 percent run as YARN jobs in Hadoop; 11 percent use the Mesos clustering tool.

This adds weight to the notion that Spark doesn't need Hadoop to succeed -- if anything, it's the other way around. It's not clear if Mesos users are deploying Mesos "by hand" or through a general framework like Mesosphere's DCOS, but the fact that many Spark jobs aren't run inside Hadoop is plain.

2. They're using Spark in the cloud more than half the time

Alongside the above statistics came word that 51 percent of the respondents are running Spark in a cloud environment, rather than in their data centers.

Though intriguing, it's also hazy on specifics. It doesn't say, for instance, what percentage of those users are running a cloud-hosted Hadoop environment as well or whether they're running Spark directly in the cloud via platforms like IBM Bluemix. But the statistic is striking and hints at how much of the data processed with Spark is already cloud-native.

3. They're using Python for Spark more

Programming in Spark has long been associated with Scala or Java, since Spark itself is written mainly in those two languages. The downside to the JVM-powered Scala, useful as it is for the analysis done with Spark, is how little attention it gets as a language. As of September 2015, it ranked 27th in the Tiobe programming index.

Python, by contrast, is in the top five, and the use of Python as a language for Spark programming has jumped 49 percent over the past year according to the survey. Meanwhile, 58 percent of Spark users surveyed employ Python, as both Scala and Java have declined year-over-year. This makes sense given Python's programmer-friendly reputation and its standing as the language of choice for science-and-statistics workloads.

The R language, another math-and-stats staple, is also making a showing. In only the first year since R was introduced to Spark, it's already used by 18 percent of Spark's audience.

4. They're using Spark SQL for BI

By and large the single most used component in Spark (69 percent of respondents work with it) is Spark SQL to run SQL queries against data exposed by Spark applications.

This makes sense -- after all, the single biggest cited use case for Spark (68 percent) was business intelligence. SQL querying remains a common BI methodology, as it's an entrenched and relatively standard syntax for data queries. This squares with "self-service data" as one of the desiderata for big data apps and with Spark playing a role in how that need is satisfied.

From CIO: 8 Free Online Courses to Grow Your Tech Skills
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.