The future of big data is very, very fast

With In-memory solutions such as Spark and NoSQL databases, systems will optimize themselves so quickly, human intervention may become superfluous

There are only two certainties in big data today: It won't look like yesterday's data infrastructure, and it'll be very, very fast.

This latter trend is evident in the rise of Apache Spark and real-time analytics engines, but it's also clear from the parallel rise of real-time transactional databases (NoSQL). The former is all about lightning-fast data processing, while the latter takes care of equally fast data storage and updates.

The two together combine to "tackle workloads hitherto impossible," as Aerospike vice president Peter Goldmacher told me in an interview.

The machines take over BI

This need for speed is increasingly evident in a new breed of BI. While we normally think of BI as analysis of data by data analysts, DataStax CEO Billy Bosworth said in an interview that, increasingly, machines will take over data analytics.

"'Machine BI'," he says, "is intelligence that has to take place at the processing speed of a machine in order to make a transactional app smarter from transaction to transaction. Human intervention is not possible in this model, and therefore, not a design objective."

In such a world -- say, an online travel application -- the machine must take clickstream data in real time and translate it into relevant offers, layout, and more. There's simply no time for a human to probe the mysteries of user behavior.

As Goldmacher spins it, "IT must capture enormous data sets in order to populate Hadoop and Spark, and the capture mechanism is almost always some sort of low-cost NoSQL environment."

Hence, a real-time NoSQL database like Cassandra, MongoDB, or Aerospike responds to clicks immediately, then pushes that clickstream data into a tool like Hadoop to perform deeper analysis, which then pushes that understanding back to the NoSQL database to be acted upon.

This model keeps getting faster now that companies are swapping out Hadoop's venerable (and batch-oriented) MapReduce for real-time Spark. Indeed, the connection between a real-time analytics system and a real-time transactional system keeps getting tighter.

In this way, Bosworth suggests, "The 'learning' that occurs is put into a fast feedback loop at machine speeds to make each transaction more informed or contextual when appropriate."

How fast is fast? Speaking of increasingly sophisticated graph databases like Neo4j, Neo Technology Founder and CEO Emil Eifrem suggested to me, "When you have a highly connected data set -- for example in a fraud detection system or a recommendation engine or an identity management application -- then a graph database can easily be a million times faster than a relational database."

When I pushed back on his "million times faster" claim, Eifrem responded, "It's basically 1,000X performance improvements despite a 1,000X increase in data size. A graph structure not only speeds up traversals, but also ensures constant performance regardless of the database size."

This isn't about a better mousetrap, in other words. It's about creating a completely new type of application.

Battling fraud with speed

As one example, let's consider the electronic payments world. Here, correctly identifying fraud has the obvious benefit of generating more revenue for the company, but it has the less obvious benefit of customer satisfaction, which generates revenue for the company in the form of reduced churn. So how do electronic payment providers write better fraud detection algorithms?  

The primary opportunity comes in the form of their ability to write a more sophisticated rules engine that has the capacity to leverage larger data sets with no impact on the time frame to return a result.  

Online payment vendors have written complicated execution-oriented applications that take data from an analytics product, often Hadoop and/or Spark, and marry it to data coming from a live event, in this case a payment transaction. The vendor's app is a sophisticated, rules-based decision engine that decides whether or not to approve a transaction based on that customer's past purchases (their profile in Hadoop) and what kind of purchase they are making right now (online payment).

Neither of the data sets originate in a NoSQL database like Aerospike, but both reside in that database to support the application in real time.

Historical fraud algorithms are relatively simple because the database that supports the application is slow, and the more data you put in the database, the slower it goes. As such, legacy fraud algorithms leverage small data sets and ask relatively simple questions like: Is the customer current on her payments? Is she over her credit limit? Is the vendor domestic? And so on.

A modern fraud app asks these questions and more complicated ones based on a much larger data set. Simply put, a faster database enables a more sophisticated rules engine that delivers more accurate results and better profits.

Up, up, and away!

As new tools like the Spark Cassandra Connector emerge, which allows an enterprise to turn to Spark to analyze data stored in Cassandra, companies will increasingly be able to glean BI from mountains of data at speed.

They'll also build new applications they simply couldn't build before.

Those applications, InfoWorld's Andy Oliver tells me, increasingly run in the cloud. At a recent NoSQL conference, he talked to a range of enterprises looking to embrace NoSQL. "None were really anxious to try and scale their RDBMS in the cloud," he stressed.

Why? Because "If you want to distribute processing, data, and achieve high availability and disaster recovery, there are many affordable and scalable NoSQL options. If you want to do that with an RDBMS, it involves a series of rather horrible trade-offs and a massive wad of cash."

Most companies, anxious to operate faster will eschew those trade-offs and embrace the speed of Spark and NoSQL, increasingly together.