I see a future with relatively little batch processing, where even long processes will run a bit at a time. As traditional storage gives way to new data architectures and streaming becomes easier, “real time” data analytics will become the new normal.
With a client-server architecture dependent on a relational database management, streaming or event processing is relatively rare. You have traditional messaging products like Tibco, MQ Series, or your favorite messaging implementation. These scale well, but not massively -- and when kept in sync with your RDBMS, they scale only as well as your RDBMS.
More often than not, you end up doing your analytics on the back end. You’re in good company: When Google wrote its MapReduce paper, it was analyzing the Web in this conventional way.
Google moved on to streaming, however, and so should you. Systems based on streaming analytics require more resources initially, but make better use of those resources over time. Part of the reason why is they don’t reanalyze all of history simply to get a result.
Consider any age-old business problem that you might want to analyze and make a decision about. Perhaps you’re a supermarket or clothing retailer and you’re creating personalized customer promos based on past purchases. The moment a purchase comes in from a given customer, you check the purchase against the list of potential promos and add the customer to the list (or take action). Maybe there's a threshold of more than one purchase, in which case you add the purchase to the customer’s count (pickle purchases) in their profile and evaluate whether the count is over the threshold (say, four jars per month).
Or consider bigger problems, such as money laundering. Rather than run periodic analytics, run them in real time as you receive the data. Did the ice cream shops' December sales for some reason equal June’s, and what are all these $9,999 “repair expenses” every week? Simply evaluate these on a transactional basis and avoid a large batch analytics job. You can also react in real time as they go over a threshold, possibly avoiding larger expenses like asset forfeiture once the hammer comes down. In Hadoop terms, this is an almost simple Kafka-Storm job that writes to HBase and maintains an average.
Consider what you would go through to do this on a periodic basis: You would sort through potentially gigabytes of data per each customer, do some horrid joins, and probably still have to iterate at some point. The incremental solution is elegant and transforms what would be a more complex analytical system into something very close to a transactional system.
Little of this is new in terms of methodology. What's new is the ability to do it at scale on commodity hardware. In the money laundering scenario, a small bank could run a job with Oracle and VB code, but what about a larger bank? The volumes demand a large amount of hardware. Even a statewide financial institution might need newer technology that runs on commodity hardware to make this work.
Streaming is an essential step in the evolution of data-driven companies. The first step is data consolidation and visualization. The next step is moving past these visualizations and letting machines kick off business workflows where real decisions are made: We’re out of pickles, we have money, pickle demand is high, the price is low, order more pickles, provide a profit projection, and measure actual performance against it.
By performing such actions in batches, you mirror the human flow of traditional business practices. That's because humans have trouble taking time out from everything they're doing or doing things in parallel. Computers have less reason to do everything periodically.
The next big step in business evolution is to abandon the periodic, cyclical nature of things and move toward real-time, data-driven decision making. In other words, cultivate live streams and make decisions while events are happening.