A living application doesn’t need to store data any more than a living organism needs to “store” its blood. Streaming data analytics is in fact the bloodstream of modern applications.
In making that observation I want to say: What a difference a a year makes. Less than a year ago, I characterized streaming analytics as “the outlier in discussions about big data architectures.” Now, with the rapid rise of the streaming economy and the recent avid adoption of Spark Streaming, the topic has gone mainstream (pun intended).
In fact, one might argue that streaming data architectures are as fundamental to today’s “live” cloud-oriented data services as relational data architectures were to the prior era of on-premises database computing.
That explains why, for example, there’s growing interest in such approaches as “Lambda architecture,” which refers to the need to integrate both batch and streaming data processing within a common architecture under a common development, runtime, and administration paradigm.
The rise of streaming analytics also explains why we’re seeing a surge in attempts to categorize “live” data integration patterns under which various stream-processing architectures can operate with one another, as well as with batch, at-rest, and other “less than live” database architectures. For example, check out this recent blog post by Ashish Singh, who outlines a “canonical stream-processing architecture” built from two Apache codebases, Kafka and Spark Streaming.
Discussions about data architectures once focused on stores: databases, file systems, and other at-rest repositories. Today, the emphasis is moving toward living streams. Stores aren't going away, of course, but they’re being repositioned as feeders of metadata into and out of ever-changing streams. By contrast, the traditional discussion of streams as feeders of data into and out of stores is giving way to an emphasis on all-streaming environments that support most of the core features (such as transactionality and embedded analytic processing) that we used to associate primarily with stores.
This shift is clearly reflected in Singh's discussion of stream processing systems (in this case, Spark Streaming) in conjunction with a message bus providing distributed commit-log services (in this case, Kafka) and a high-performance metadata store (in this case, Hbase). The sources and sinks of the streams are the application endpoints, which feed live data into streams and/or consume stream-processed data and results within the context of “live” processes.
None of these are new architectural concepts, particularly if you’ve worked with message-oriented middleware. What's new is what's missing: the old-fashioned idea of managing data “ingest” into stores, from which it’s accessed by consuming applications (despite the fact that the blog is entitled “Ingest Tips”). Instead, data processing is geared to ensuring live streaming from sources to sinks. The data architecture is concerned with designing the in-motion streaming patterns most suited to a distributed application, with the at-rest stores -- HDFS, Hbase, MongoDB, and so on --an important but not defining feature.
This architectural shift came into focus for me when I encountered a recent article by Ted Malaska about architectural patterns for near-real-time data processing with Hadoop. As I read about the four streaming patterns, I realized that on some level Hadoop itself -- in other words, HDFS or Hbase -- is an interchangeable component in these patterns that may be replaced by various alternative stores better suited to various roles.
That realization jumped out at me as I pondered this statement:
The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems.
In other words, the focus is on Hadoop’s streaming-focused “ecosystem,” not on its storage-centric platforms. That becomes clear as Malaska lays out the four principal stream-computing patterns:
- Stream ingestion that persists events at low latency to various stores (HDFS, HBase, Solr, and more)
- Near real-time processing with external context persisted and/or accessed from various stores
- Event-partitioned processing that persists relevant external content at ultra-low-latency to various in-memory platforms
- Complex multilatency (real-time and mini-batch) topologies that enable stateful, in-stream interception, sessionization, aggregation, windowed computation, machine learning processing, and other functions with high transactionality and accuracy
The author makes a strong case for using Spark Streaming for many of these living patterns. Consider his argument on its own merits. The point I want to leave you with is that the many Spark applications are streaming, and all-streaming architectures, in which HDFS plays an important but not defining role, are becoming very common.
This point holds regardless of what specific streaming platforms you use or recommend. As my colleague Roger Rea pointed out in a recent blog post:
Spark and Streams have been targeted at different areas of big data -- Spark for data at rest (albeit in memory to gain speed) and Streams for data in motion (processing events as they happen). Integrating the two enables wider range of access to data and wider range of analytic applications to solve business problems.
More broadly, stream computing platforms such as these encompass low-latency, application-level processing of live data in any volume, variety, frequency, format, payload, order, or pattern. This processing may involve any or all of the usual functions: acquisition, ingest, filtering, correlation, transformation, aggregation, calculation, analysis, query, display, delivery, alerting, routing, and so forth.
In the final analysis, the purpose of stream computing is to drive speedier results by delivering live intelligence into live business processes. Ideally, every "at rest" big data repository -- be it EDW, Hadoop, or whatever -- can and should host live data in order to drive live decisions.