Earlier this month, Cloudera announced a plan to make Spark the center of its Hadoop distribution. Astute observers will note that Cloudera has been moving in this direction for some time.
In a webinar yesterday, Doug Cutting, the creator of Hadoop and chief architect at Cloudera, offered more details about the company's decision to replace MapReduce with Spark. His presentation had broad implications for the Hadoop ecosystem. Here are the six major takeaways from that talk.
1. Hive and Pig
The linchpin of this plan is that Cloudera has been working on making it possible to use Spark under Pig and Hive and most of the other major components -- even tools like Sqoop! Whereas Hortonworks has been feverishly working on Tez integration, Cloudera has decided to go full-on with Spark.
A lot of the work Cloudera is doing is the unsexy stuff necessary to make Spark scale better. At the same time they noted that Spark doesn’t need to scale to as many nodes as MapReduce. In my experience, many Spark implementations require only a fraction as many nodes as MapReduce did.
During Cloudera’s webinar, there was a lot of hand waving around security. Most of what was laid out had nothing to do with Spark per se (such as perimeter security). However, it's a safe bet that in the not-so-distant future, Sentry -- Cloudera’s central security system -- will play a role. I’ve mentioned in the past this divergence of the stacks around security is a serious lock-in threat.
At the moment, managing a Spark cluster is hard. Cloudera is planning to “reduce the number of knobs” and generally make Spark more manageable. They also plan to improve Spark’s Web UI. These efforts will be open source, although the company will also offer a nonmandatory “pane of glass” available via Cloudera Manager. Translation: Yes, it will be open source, but hey, use our Cloudera Manager!
Like the other Hadoop vendors, Cloudera is focused on Spark on YARN, whereas DataBricks (the primary original sponsor of Spark) is actively marketing against YARN. In truth, a “stand-alone” Spark mode isn’t all that feasible for the multitenanted jobs that Cloudera focuses on. Meanwhile, others (possibly including Databricks) are planning to push Mesos as a future replacement for YARN. Indeed, Mesos is very promising, especially in a Dockerized, automated world.
What caught my eye in particular are Cloudera’s plans to make resource allocation more dynamic based on the needs of the job. This brings us a bit closer to commoditized computing and “scaling to the workload.”
5. Don’t forget Impala
Impala, Cloudera's SQL query engine, marked the one weird spot in the presentation. No mention was made of any integration with Impala (beyond the obvious fact that SparkSQL can connect to Impala or any SQL data source). This raised a lot of questions.
It also inspired a mea culpa. In my last article, I managed to make the observation that proprietary MPP systems like Netezza and Teradata would be eaten by the Hadoop ecosystem without mentioning Impala at all. Cloudera rightfully called me out on this. It's a significant oversight because Impala is an MPP system that uses parts of Hive for the SQL interface.
In short, if you are doing a long-running SQL job, Hive on Tez (or Spark) is fine. But if you are connecting Tableau, for instance, to Hadoop, you may find Impala delivers a much more pleasant user experience (it distributes the relevant part of the SQL queries to the nodes). On smaller working sets, this returns results faster. We actually use Impala at my company for most of our Tableau-Hadoop work and many of our BI type projects.
Spark’s architecture is fully distributed. It uses large working sets for one query, whereas Impala tends to break up queries to push those parts close to the nodes with the data. Could some compromise be found so that we could have fewer boxes and less reimplementation of common stuff? While there could be a damn good technical reason nothing can be done (no obvious solution presents itself to me), I’d love fewer moving parts.
Cloudera said Spark streaming is fine for one-second latencies, but not the subsecond latencies often demanded in streaming applications. Cloudera outlined its plans to improve Spark streaming, including better memory management, and to offer a new SQL interface to Spark. In the Q&A, when asked about Storm, Cloudera noted Spark's advantages (for example, like MapReduce, Spark has a rather low-level API). Ignored was the fact that Storm can go subsecond -- and often, Spark’s streaming buckles at scale.
Cloudera plans to improve Spark to the point that it replaces Storm entirely. The company notes Storm's dwindling developer community, that Twitter ditched it, and that microbatches are a common use case Storm doesn’t support well.
I wouldn’t plan to shoot Storm in the head yet; Spark has a way to go here. Cloudera has big plans, but they are not yet realized.
We have a winner
Frankly, Spark is the engine you should choose by default for your new big data projects. Yahoo, one of Hadoop’s birthplaces, has apparently reached that conclusion as well. Now Doug Cutting, Hadoop’s creator, says you should -- and his company, the largest Hadoop provider, is putting it at the center of its technology strategy. Are you getting the message?