The 7 most common Hadoop and Spark projects

Think you're breaking new ground with your Hadoop project? Odds are it fits neatly into one of these seven common types of projects

There's an old axiom that goes something like this: If you offer someone your full support and financial backing to do something different and innovative, they’ll end up doing what everyone else is doing.

So it goes with Hadoop, Spark, and Storm. Everyone thinks they're doing something special with these new big data technologies, but it doesn't take long to encounter the same patterns over and over. Specific implementations may differ somewhat, but based on my experience, here are the seven most common projects.

Project No. 1: Data consolidation

Call it an "enterprise data hub" or "data lake." The idea is you have disparate data sources, and you want to perform analysis across them. This type of project consists of getting feeds from all the sources (either real time or as a batch) and shoving them into Hadoop. Sometimes this is step one to becoming a “data-driven company”; sometimes you simply want pretty reports. Data lakes usually materialize as files on HDFS and tables in Hive or Impala. There's a bold, new world where much of this shows up in HBase -- and Phoenix, in the future, because Hive is slow.

Salespeople like to say things like “schema on read,” but in truth, to be successful, you must have a good idea of what your use cases will be (that Hive schema won’t look terribly different from what you’d do in an enterprise data warehouse). The real reason for a data lake is horizontal scalability and much lower cost than Teradata or Netezza. For "analysis," many people set up Tableau and Excel on the front end. More sophisticated companies with “real data scientists” (math geeks who write bad Python) use Zeppelin or iPython notebook as a front end.

Project No. 2: Specialized analysis

Many data consolidation projects actually begin here, where you have a special need and pull in one data set for a system that does one kind of analysis. These tend to be incredibly domain-specific, such as liquidity risk/Monte Carlo simulations at a bank. In the past, such specialized analyses depended on antiquated, proprietary packages that couldn't scale up as the data did and frequently suffered from a limited feature set (partly because the software vendor couldn't possibility know as much about the domain as the institution immersed in it).

In the Hadoop and Spark worlds, these systems look roughly the same as data consolidation systems but often have more HBase, custom non-SQL code, and fewer data sources (if not only one). Increasingly, they're Spark-based.

Project No. 3: Hadoop as a service

In any large organization with “specialized analysis” projects (and ironically one or two “data consolidation” projects) they’ll inevitably start feeling the “joy” (that is, pain) of managing a few differently configured Hadoop clusters, sometimes from different vendors. Next they’ll say, “Maybe we should consolidate this and pool resources,” rather than have half of their nodes sit idle half the time. They could go to the cloud, but many companies either can’t or won’t, often for security (read: internal politics and job protection) reasons. This generally means a lot of Chef recipes and now Docker container packages.

I haven’t used it yet, but Blue Data appears to have the closest thing to an out-of-the-box solution here, which will also appeal to smaller organizations lacking the wherewithal to deploy Hadoop as a service.

Project No. 4: Streaming analytics

Many people would call this "streaming," but streaming analytics is rather different from streaming from devices. Often, streaming analytics is a more real-time version of what an organization did in batches. Take antimoney laundering or fraud detection: Why not do that on the transaction basis and catch it as it happens rather than at the end of a cycle? The same goes for inventory management or anything else.

In some cases this is a new type of transactional system that analyzes data bit by bit as you shunt it into an analytical system in parallel. Such systems manifest themselves as Spark or Storm with HBase as the usual data store. Note that streaming analytics do not replace all forms of analytics; you’ll still want to surface historical trends or look at past data for something that you never considered.

Project No. 5: Complex event processing

Here we're talking about real-time event processing, where subseconds matter. While still not fast enough for ultra-low-latency (picosecond or nanosecond) applications, such as high-end trading systems, you can expect millisecond response times. Examples include real-time rating of call data records for telcos or processing of Internet of things events. Sometimes, you'll see such systems use Spark and HBase -- but generally they fall on their faces and have to be converted to Storm, which is based on the Disruptor pattern developed by the LMAX exchange.

In the past, such systems have been based on customized messaging software -- or high-performance, off-the-shelf, client-server messaging products -- but today's data volumes are too much for either. Trading volumes and the number of people with cellphones have shot up since those legacy systems were created, and medical and industrial sensors pump out too many bits. I haven’t used it yet, but the Apex project looks promising and claims to be faster than Storm.

Project No. 6: Streaming as ETL

Sometimes you want to capture streaming data and warehouse it somewhere. These projects usually coincide with No. 1 or No. 2, but add their own scope and characteristics. (Some people think they're doing No. 4 or No. 5, but they’re actually dumping to disk and analyzing the data later.) These are almost always Kafka and Storm projects. Spark is also used, but without justification, since you don’t really need in-memory analytics.

Project No. 7: Replacing or augmenting SAS

SAS is fine; SAS is nice. SAS is also expensive and we’re not buying boxes for all you data scientists and analysts so that you can “play” with the data. Besides, you wanted to do something different than SAS could do or generate a prettier graph. Here is your nice data lake. Here is iPython Notebook (now) or Zeppelin (later). We’ll feed the results into SAS and store results from SAS here.

While I’ve seen other Hadoop, Spark, or Storm projects, these are the “normal,” everyday types. If you’re using Hadoop, you probably recognize them. Some of the use cases for these systems I’ve implemented years before, working with other technologies.

If you’re an old-timer too scared of the “big” in big data or the “do” in Hadoop, don’t be. The more things change the more they stay the same. You’ll find plenty of parallels between the stuff you used to deploy and the hipster technologies swirling around the Hadooposphere.

Copyright © 2015 IDG Communications, Inc.

How to choose a low-code development platform