The way that big data gets big is through a constant stream of incoming data. In high-volume environments, that data arrives at incredible rates, yet still needs to be analyzed and stored.
John Hugg, software architect at VoltDB, proposes that instead of simply storing that data to be analyzed later, perhaps we've reached the point where it can be analyzed as it's ingested while still maintaining extremely high intake rates using tools such as Apache Kafka.
-- Paul Venezia
Less than a dozen years ago, it was nearly impossible to imagine analyzing petabytes of historical data using commodity hardware. Today, Hadoop clusters built from thousands of nodes are almost commonplace. Open source technologies like Hadoop reimagined how to efficiently process petabytes upon petabytes of data using commodity and virtualized hardware, making this capability available cheaply to developers everywhere. As a result, the field of big data emerged.
A similar revolution is happening with so-called fast data. First, let's define fast data. Big data is often created by data that is generated at incredible speeds, such as click-stream data, financial ticker data, log aggregation, or sensor data. Often these events occur thousands to tens of thousands of times per second. No wonder this type of data is commonly referred to as a "fire hose."
When we talk about fire hoses in big data, we're not measuring volume in the typical gigabytes, terabytes, and petabytes familiar to data warehouses. We're measuring volume in terms of time: the number of megabytes per second, gigabytes per hour, or terabytes per day. We're talking about velocity as well as volume, which gets at the core of the difference between big data and the data warehouse. Big data isn't just big; it's also fast.
The benefits of big data are lost if fresh, fast-moving data from the fire hose is dumped into HDFS, an analytic RDBMS, or even flat files, because the ability to act or alert right now, as things are happening, is lost. The fire hose represents active data, immediate status, or data with ongoing purpose. The data warehouse, by contrast, is a way of looking though historical data to understand the past and predict the future.
Acting on data as it arrives has been thought of as costly and impractical if not impossible, especially on commodity hardware. Just like the value in big data, the value in fast data is being unlocked with the reimagined implementation of message queues and streaming systems such as open source Kafka and Storm, and the reimagined implementation of databases with the introduction of open source NoSQL and NewSQL offerings.
Capturing value in fast data
The best way to capture the value of incoming data is to react to it the instant it arrives. If you are processing incoming data in batches, you've already lost time and, thus, the value of that data.
To process data arriving at tens of thousands to millions of events per second, you will need two technologies: First, a streaming system capable of delivering events as fast as they come in; and second, a data store capable of processing each item as fast as it arrives.
Delivering the fast data
Two popular streaming systems have emerged over the past few years: Apache Storm and Apache Kafka. Originally developed by the engineering team at Twitter, Storm can reliably process unbounded streams of data at rates of millions of messages per second. Kafka, developed by the engineering team at LinkedIn, is a high-throughput distributed message queue system. Both streaming systems address the need of processing fast data. Kafka, however, stands apart.
Kafka was designed to be a message queue and to solve the perceived problems of existing technologies. It's sort of an über-queue with unlimited scalability, distributed deployments, multitenancy, and strong persistence. An organization could deploy one Kafka cluster to satisfy all of its message queueing needs. Still, at its core, Kafka delivers messages. It doesn't support processing or querying of any kind.