The other night I ran into Jackie Barbetta and Don Rosenberg at our local Apache Spark meetup group. Barbetta and Rosenberg work in IBM Emerging Technologies on an open source project called EclairJS.
The project does the unthinkable: It brings the data science and distributed computing world of Spark to the JavaScript world of the “Web developer.”
Why on earth would you want to do that?
Apparently, part of the original management backing for the project came from the idea that JavaScript developers are a lot cheaper than Scala or Python developers. I cleared my throat and asked: “Um, how cheap will the very few JavaScript developers who understand machine learning and distributed computing actually be?” In reply, Rosenberg and Barbetta acknowledged that this was "management thinking.”
According to them, the real motivation is to make Spark accessible to JavaScript/Node.js developers who might embed results in a Web page -- or if they’re creating a custom dashboard, they can do so without leaving the familiar space of JavaScript. Moreover, if you’re developing a streaming-based application, the connection between the event/stream-based Node.js and Apache Spark streaming seems obvious.
Streaming analytics aside, merely having access to a stream processing system -- more particularly, stream processing with an appropriate storage mechanism -- is a key advantage for Node.js developers. If you, the Node.js developer, recently discovered that neither MySQL nor MongoDB are the best choices for high-end data streams, Spark and HBase might turn out to be your new best friends.
How EclairJS works
If you have a JavaScript-based Web page -- created with, say, AngularJS or Angular2 -- you can write to the EclairJS API. The EclairJS API more or less wraps the Spark API in a JavaScript-developer friendly way. The idea is that a JavaScript developer who doesn’t necessarily understand Scala-isms or Python-isms can write code that tastes like JavaScript and not merely Scala in JavaScript syntax.
This component wrapper connects to the Jupyter Gateway to Apache Toree on the server-side. Toree is an implementation of the iPython protocol, but is (clearly) not limited to Python and has its roots in the Jupyter/iPython project. Toree is more or less an RPC wrapper for Spark.
From Toree, your calls are marshaled through EclairJS running on the Nashorn JavaScript runtime (aka JavaScript for the JVM) and compiled directly into Apache Spark code.
How EclairJS Nashorn exposes the Apache Spark programming model to JavaScript
Besides writing a mess of Spark code in the middle of your JavaScript-oriented Web page, you can use iPython Notebook and write JavaScript instead of Python. Or if you like, you can use REPL support and evaluate your Spark-ish JavaScript at the command line:
What's the catch?
Rosenberg reports they haven’t done any performance testing, and some JavaScript-isms don’t translate cleanly to Spark. For instance, if you have a variable outside of your lambda function, it won’t be distributed. Most JavaScript programmers who've never used Spark will expect their closure to have access to the var from inside the lambda. Instead, you’ll need to bind the variable or pass it into the function as an additional argument.
Next, a lot of the type marshaling is limited to primitive or noncomplex types, so you may have to do your own stunts in marshaling/unmarshaling JSON. There is no “require” on the Nashorn side, only “load” -- if you dreamt there would be no learning curve or places you never have to try real hard, you’ll be disappointed.
As Spark evolves, the likes of DataFrames and DataSets are becoming more strongly typed. This doesn’t mesh with JavaScript, but as ES2015 and later take over, this may be less of a big deal. Besides, the more slovenly JavaScript developers probably don’t have a place in this distributed computing world.
Where you can find EclairJS and what it can do now
EclairJS is open source and hosted on GitHub, and there's a Google group. The first milestone implements the Spark Core API, Spark SQL, DataFrames, and Spark Streaming.
In other words, you can start playing with this right now. Barbetta and Rosenberg are working on porting MLib and GraphX so that Node.js/JavaScript types can access machine learning techniques and graph analysis.
A small team of IBMers is doing most of the development for EclairJS, and they’d love your help. With the hotness that is Node.js and the hotness that is Spark, this could be a good place for you to distinguish yourself by contributing. Or if you were good at the maths but ended up as a Web developer and got into Node.js, maybe this is where you can show your data science chops without having to learn Scala or Python.