Mention big-data tools like Spark and Kafka to most enterprise users, and the other big-data tool that comes to mind along with them is Hadoop. But does it need to?
Mesosphere, corporate backers of the Apache Mesos cluster-management project, are ginning up a big-data stack that eschews Hadoop, but embraces Spark (and Kafka, and Cassandra, and the Akka event framework) for real-time processing.
Mesosphere Infinity is "a turnkey, full-stack offering optimized for big data and IoT," and its main aim is to provide an easily erected stack for businesses for real-time data work. It also stands as a recent example of how many of the technologies reflexively associated with the Hadoop stack don't require Hadoop to be useful.
Look, ma, no Hadoop
Matt Trifiro, chief marketing officer for Mesosphere, explained in a phone conversation how Infinity is managed by another Mesosphere creation: Mesosphere DCOS, which allows entire data centers full of applications to be stood up easily. Infinity, in turn, is for managing a relatively narrow range of applications: Spark for data processing; Kafka for real-time data ingestion; and another Apache Foundation project, Cassandra, for data storage.
While Infinity "doesn't exclude Hadoop," said Trifiro, "it doesn't require it, either. You can use [Hadoop's] HDFS as a persistent data store, and you may have Hadoop processing over data pushed into Cassandra, but in terms of real-time acquisition, you need a specialized stack."
Sparks of inspiration
Spark has drawn attention as of late from a roster of A-list technology firms interested in both investing in the project and leveraging it for heavy-duty business analytics work. Still, like many other open source data tools, Spark is by itself far more "project" than "product" -- it isn't a trivial effort to use in an enterprise environment.
Trifiro claims Spark and the rest of the Infinity stack "was built from observation of what people were putting into production." Businesses were attempting to put together Spark and Kafka stacks for real-time analysis, said Trifiro, because "the demand for processing real-time data by non-Web companies is relatively new, and there's immense pressure on IT teams to do this." Standing up an entire such stack has "historically required a lot of expertise," and Infinity is meant to require minimal work to get up and running.
Mesosphere plans to make Infinity's stack even easier to consume by offering it via existing cloud services. Right now, though, the only named partner for cloud-based enterprise distribution is Cisco, the same company that worked hand-in-hand with Mesosphere to build Infinity.
One possible analogy is with running applications in containers, versus using virtualization and OpenStack. Containers offer a potentially more precise solution to the problems of running applications at scale than VMs did. Likewise, Spark alone, as opposed to Spark plus Hadoop, might present a better fit for the data-processing problems faced by enterprises -- as long as deployment and management of a Spark stack doesn't put them back at square one.