Review: Storm’s real-time processing comes at a price

The open source stream processing solution is proven reliable at scale, but difficult to learn and use

Storm, a top-level Apache project, is a Java framework designed to help programmers write real-time applications that run on Hadoop clusters. Designed at Twitter, Storm excels at processing high-volume message streams to collect metrics, detect patterns, or take actions when certain conditions in the stream are detected. Typically Storm scenarios are at the intersection of real time and high volume, such as analyzing financial transactions for fraud or monitoring cell-tower traffic to maintain service level agreements.

Traditionally these sorts of systems have been constructed using a network of computers connected by a message bus (such as JMS). What makes Storm different is that it combines the message passing and processing infrastructure into a single conceptual unit known as a “topology” and runs them on a Hadoop cluster. This means that Storm clusters can take advantage of the linear scalability and fault tolerance of Hadoop, without the need to reconfigure the messaging bus when increasing capacity.

When working with teams new to Storm, I have found it helpful to approach system design from three dimensions: operations, topology, and data. These roughly map onto their corresponding dimensions in traditional enterprise applications, but translated into the Hadoop world. A Storm topology is a processing workflow analogous to a set of steps in a processing pipeline that would be managed by Oozie in a multipurpose Hadoop cluster.

The topology is the fundamental unit of deployment in Storm. It consists of two types of objects: spouts (message sources) and bolts (message processors). Spouts are available for many common data sources such as JMS, Kafka, and HBase.

Deploying a Storm cluster

Storm clusters typically run multiple Storm applications (topologies) simultaneously. In this sense it is analogous to a Java application server. A developer bundles up a JAR file containing the Storm topology and all of its dependencies, then deploys it to the cluster where it runs until terminated. Storm application developers do not need to be aware of the specific configuration of the cluster their application runs on, so they can focus on the specifics of their application.

To continue reading this article register now