Recently I went to the IBM Insight conference in Las Vegas. I have to say, I find it a bit ironic to hold a conference about data and analytics in a city built on a misunderstanding of statistics -- and in a casino, no less.
Nonetheless, I'm sure a lot of IBMers are very passionate about data, analytics, machine learning, and all that. I wouldn't question IBM's vocal commitment to Spark, either.
But IBM's stated committment to educate "more than 1 million data scientists and data engineers on Spark" is over the top. Don't get me wrong, I'm committed to training several new Spark developers myself. Spark is really, really important, but it's important like CGI was important to the evolution of the Web -- or maybe like Apache/Netscape/IIS modules were important to the evolution of the Web.
Datranets and the Daternet
As companies become more data-driven, two major movements will arise. On one hand, there will be a kind of internal data supply-chain integration. Rather than having disparate systems doing discrete processing, they'll be tied together not only for analytics, but for streams of processing. Data will run your business and form an internal "flow" or system of flows. I call this your "datranet."
But that isn't the end of it. As companies move to real time and the final evolution of B-to-B communication emerges -- and as more systems become automated with rules and business process management tools automating decision making -- networks of companies will share data along a kind of specialized Internet that I call "the Daternet."
There will be compromises along the way (please, no one talk about Virtual Private Datranets, because VPD sounds like a social disease), but at the end of the day, this is the natural evolution of, well, humanity and maybe everything else.
Spark is the C of the Daternet
Spark looks like a high-level, powerful API when compared to MapReduce, Tez, or Storm. If MapReduce was Fortran, then Spark is C. But in the future, the majority of the Daternet or your datranet will not be written directly to Spark APIs. Maybe you'll use an evolution of Cascading or some other tool yet to be defined.
Your datranet may run on top of something other than of Spark -- which more closely resembles Cloudera's Impala, for example. Quite likely the Daternet will be powered by Spark, but developers will write applications using a higher-level option built on Spark rather than writing to Spark itself.
Also, connecting disparate internal data sources to analytics is often a big storage and parallel computing problem rather than a massive in-memory analytics issue. You may need to execute over a smaller number of cores with lower latency than batch time, and it may not be a massive streaming scaleout.
Is Spark important? Heck yeah it is. But I'm not sure 1 million data scientists and data engineers need to learn it. They may need to learn more about how distributed computing works and about functional programming constructs. They may also need to learn about streaming.
Maybe it will be helpful to know Spark like it's helpful to know C today, but Spark does not represent, as IBM implies, some sort of end state. Spark is simply where we are today. Watson Analytics might be written against it (in part) for now.
In the future, the vast majority of the Daternet will be written using a new technology that makes Spark look more like C than, say, Java or C# or ASP or JSP or even PHP. Yes, Spark is a wonderful, essential piece of technology and I'm grateful Matei Zaharia started the project. Do we need 1 million new Spark developers? Only if you're in marketing.