The siren song of Hadoop

Hadoop provides power and versatility for data scientists, at the cost of complexity.

Hadoop elephant code

Hadoop seems incredibly well-suited to shouldering machine-learning workloads. With HDFS you can store both structured and unstructured data across a cluster of machines, and SQL-on-Hadoop technologies like Hive make those structured data look like database tables. Execution frameworks like Spark let you distribute compute across the cluster as well. On paper, Hadoop is the perfect environment for running compute-intensive distributed machine learning algorithms across a vast amount of data.

Unfortunately, though, Hadoop seems incredibly well-suited for a lot of other things too. Streaming data? Storm and Flink! Security? Kerberos, Sentry, Ranger, and Knox! Data movement and message queues? Flume, Sqoop, and Kafka! SQL? Hive, Impala and Hawq! The Hadoop ecosystem has become a bag of often overlapping and competing technologies. Cloudera vs. Hortonworks vs. MapR is responsible for some of this, as is the dynamism of the open source community.

As a technology enthusiast, this is actually quite exciting. From an implementation perspective it’s a nightmare.

Innovation or insanity?

I’ve seen the pain play out with several organizations. MapReduce code is obsoleted as the organization moves towards Spark. IT is using Hive to control data access, but you can’t easily run Spark jobs against Hive tables. Kerberos makes everything confusing and difficult. To quote the most popular technical guide to Kerberos: “Just as the infamous Necronomicon is a collection of notes scrawled in blood as a warning to others, this book is: (1) Incomplete. (2) Based on experience and superstition, rather than understanding and insight. (3) Contains information that will drive the reader insane.”

At this point the cloud starts to look pretty good. Why suffer all these infrastructure headaches when Amazon has already figured everything out for you with their Hadoop-as-a-service offering, Elastic Map Reduce? Well, you’re about to get caught between EMR’s cost structure and Hadoop’s history as a platform for aggregating vast amounts of consumer-grade storage and compute hardware. HDFS assumes you have access to lots of cheap but fallible disks and helpfully replicates your data 3x by default. EMR then helpfully charges you for all that storage. The entire Hadoop ecosystem has been architected without storage thrift in mind, yet storage will drive the majority of your bottom line in the cloud.

So as someone who’s trying to get real data science work done with Hadoop, you’re fighting several battles — a cacophony of conflicting technologies, multiple ways to accomplish the same goal, natural disconnects between IT and users, a gnarly cost structure in the cloud, and a constantly shifting technology landscape that obsoletes past work.

The solution is to get at least one layer of abstraction between your data science users and the raw Hadoop layer. With platforms that provide this layer of abstraction data scientists can define the work they’d like to do, for example a null value replacement operation followed by building a logistic regression model, while the platform itself identifies the correct set of Hadoop technologies to accomplish those goals, for example choosing between the MapReduce, Pig, and Spark execution frameworks. If a new execution framework were to enter the Hadoop ecosystem, one need only update the data science platform, not thousands of lines of code.

Despite the chaos, Hadoop has tremendous potential to tackle modern machine learning workflows. Just don’t let it drive you insane in the process.

This article is published as part of the IDG Contributor Network. Want to Join?