Know this right now about Hadoop

From core elements like HDFS and YARN to ancillary tools like Zookeeper, Flume, and Sqoop, here's your cheat sheet and cartography of the ever expanding Hadoop ecosystem.

I write a lot about Hadoop, for the obvious reason that it's the biggest thing going on right now. Last year everyone was talking about it -- and this year everyone is rolling it out.

Some are still in the installation and deployment phase. Some are in the early implementation phases. Some have already used it to tackle real and previously intractable business problems. But trust me, Hadoop is the hottest ticket in town.

Hadoop and the techniques surrounding it are a bit complicated. The complexity doesn't come from mathematics or even computing theories -- much of it is the nature of the beast. There are so many technologies and components to choose from, often with overlapping capabilities, that it's hard to decide what to use to solve which problems.

If you try and learn the entire Hadoop ecosystem at once, you'll find yourself on a fool's errand. Instead, focus on the foundation, then develop a passing familiarity with the solutions for common problems -- and get a sense of what you need.

Core/foundational elements (this you must know)

Hadoop's underpinning is HDFS, a big distributed file system. Think of it as RAID plus CIFS or NFS over the Internet: Instead of striping to disks, it stripes over multiple servers. It offers redundant, reliable, cheap storage. While Hadoop may be most famous for implementing the MapReduce algorithm, HDFS is really the Hadoop ecosystem's foundation. Yet it is a replaceable base, because there are alternative distributed file systems available.

YARN (Yet Another Resource Negotiator), which sits on top of HDFS, is a bit hard to describe because like Hadoop it isn't all one thing. At its base YARN is a cluster manager for Hadoop. Also, MapReduce is built on top of it. Yet part of the motivation for YARN is that the Hadoop ecosystem expands beyond algorithms and technologies based on MapReduce. YARN is based on these concepts:

  • MapReduce: You've heard of this. In the case of Hadoop it is both an algorithm and an API. Abstractly, the algorithm depends on the distribution of data (HDFS) and processing (YARN) and works like this. MapReduce is also a Java API provided by Hadoop that allows you to create jobs to be processed using the MapReduce algorithm.

    Here's an example of how the MapReduce algorithm works: A client named Jim calls and he's in construction. You don't remember who he is, but he knows you. Let's say it's 1955, and instead of a computer you have a giant Rolodex (notecards on a spindle) of all of your contacts organized by last name. Jim is going to get wise and realize you don't know who he is before you find a notecard "James Smith, Contractor, Steel Building Construction." If, instead, you have 10 duplicate Rolodexes and possibly half of them have A-M and the other have O-Z and there are at least 11 people in the office, you could ask your office manager (writing frantically on a pad) to find all cards with "James, Jim, or Jimmy that have 'construction' or 'contractor.'"

    Your office manager could ask 10 people in the office to start grabbing cards in their rolodex. The Office manager might also divvy up the job and ask one person to only bother with A-B, and so on. The office manager has "mapped" the job to the nodes. Now the nodes find the cards (the result) and send them to the office manager who combines the cards onto one sheet of paper and "reduces" them into the final answer (in our case possibly throwing out duplicates or providing a count). With any luck we get one card, and worst case you have a new request based on what you "cold read" from the client on the phone.
  • (Global) Resource Manager: Negotiates resources among the various applications requesting them. Its underlying scheduler is pluggable and can prioritize different guarantees such as SLA, fairness, and capacity constraints.
  • Node Managers: Node Managers are per-machine slaves that monitor and report resource utilization (CPU, memory, disk, network) back to the Resource Manager.
  • Application Masters: Each application making requests from your Hadoop cluster generally has one Application Master. In some cases you may group a few applications together and use one Master for all of them. Some tools like Pig use a single Application Master across multiple applications. The Application Master(s) compose Resource Requests to the Resource Manager and allocates (resource) Containers in collaboration with the NodeManagers. These Containers, distributed across the nodes and managed by the NodeManagers, are the actual resources allocated for the application requests.
  • Pig: Technically Pig is a tool and Pig Latin is the language but nearly everyone will use Pig to refer to the language asallingca ita igpa atinla isa illysa. While it is possible to execute jobs on Hadoop using SQL, you'll find that SQL is a relatively limited language for potentially unstructured or differently structured data. Pig feels to me like PERL, SQL and regular expressions had a love child. It isn't superhard to learn or use, and it lets you create MapReduce jobs in far fewer lines of code than using the MapReduce API. There are lots of things you can MapReduce with Pig that you can't do with SQL.
  • Hive and Impala: Hive is essentially a framework for data warehousing on top of Hadoop. Moreover, if you love your SQL and really want to do SQL on Hadoop, Hive is for you. Impala is another implementation of the same general idea and is also open source but not hosted at Apache. Why can't we all get along? Well, Hive is more or less backed by Hortonworks and Impala by Cloudera. Cloudera says Impala is faster and most third-party benchmarks seem to agree (but not the 100x Cloudera claims).  Cloudera doesn't claim that Impala is a complete replacement for Hive (and ships it as well), but that it is superior for some use cases.

Wait, there's more! EMC didn't want to be without its own answer to this, so it has Greenplum/HAWQ from its new Pivotal division. Those are most decidedly not open source. Not to be outdone, Hortonworks and others are backing Tez, which claims it will offer a thousand-fold improvement.

You should probably know at least what Hive is and some basics of how to use it. That's somewhat transferable to knowledge of Impala if you end up working for someone that uses Cloudera's distribution. I'd have my eye on Tez and not really bother learning the others unless you work somewhere that decides the vendor lock-in is really worth ditching the existing expensive proprietary data warehouse infrastructure for a new expensive proprietary data warehouse infrastructure in Pivotal.

Ecosystem (this you should know)

You're not going to be able to claim you "know Hadoop" if you're ignorant of the ecosystem around it. So familiarize yourself with the following:

  • HBase/Cassandra: Both are column family databasesbuilt on various parts of Hadoop. They are in many ways very similar, but their differences are substantial. While they compete in some areas, there are key areas ofdifferentiation. If running MapReduce jobs against your column family store is a big deal to you, then you should probably go with HBase. If you're doing time-series data -- more of an operational store -- and you need nice, pretty management tools, dashboards, and so on, then you may find Cassandra is your best buddy even if she's a bit cursed.

    Datastax and Rackspace seem to be the big backers of Cassandra. HBase seems to be supported by the major Hadoop vendors and, oddly, Microsoft. I'd learn either as the key knobs are the same, then transfer the skills. Cassandra will probably seem like a less steep learning curve if you're starting from scratch, but HBase will be more familiar if you're already heavily into Hadoop. To complicate matters, if you think you'll need more database support for contextual security you may also want to look at Accumulo.
  • Spark/Shark and Storm: HDFS is high latency. What if you want to do grand MapReduce distributed computing, but latency doesn't work for you? Well, shell out the cash for more memory and go with Spark, which integrates on top of Hadoop and HDFS but runs jobs in memory. Spark can also "stream," where you basically have long-running jobs that continuously return results as new data comes in. Shark is Hive on Spark. Storm is very similar as far as capabilities. Databricks and Cloudera seem to be backing Spark, and Hortonworks seems to backing Storm but hasSpark in a preview. I'd settle for knowing what these thing are for now unless you need to stream or work in one of the industries where low latency is a must and not a "would be nice."
  • Oozie: This is basically workflow-based job control for Hadoop. Mainframers are nodding their heads and can read on. Basically any given system has a lot of repetitive tasks, and probably most of your MapReduce jobs are not really "ad hoc" even if you conceive them of that way. That is, businesses cycle, computing cycles, and thus your jobs are repetitive and cyclic. However, jobs often depend on other jobs and events, meaning you're not going to run end-of-month reporting until the end of the month, and based on the results of that job you may need to run another job and so on. Once your organization has rolled out Hadoop and starts running anything regularly, I'd invest in learning Oozie.
  • Ambari: This is more of a tool for the Administrators, but as a developer you should know a bit about it. While rolling out Hadoop nodes using the command line is incredibly relaxing and rewarding, Ambari can automate these tasks. Moreover, while logging into each node and looking at its stats is fun for the whole family, Ambari ties it up in a nice dashboard with pretty graphs. Ambari doesn't really support Windows and I'd say Hadoop overall is still alpha-ish on Windows (which has been great for the consulting business BTW). Hortonworks includes Ambari in its distribution while Cloudera rolled its own with Cloudera Manager.

Ancillary (this you should know about)

Finally, here are the items about which you should at least be able to talk a good game.

  • Zookeeper: This is for configuration and (cluster) group membership and coordination.
  • Sqoop: This is essentially an ETL (extract, transform, load) for sucking data out of your RDBMS or pushing data back into the RDBMS. It's not a difficult tool to use or learn if you need to. It might be good to spend some time on it, because chances are this is where your data will come from for your early Hadoop installation, integration, and implementation (or proof-of-concept projects).
  • Flume: If you've used an ESB, messed with Spring Integration, or even glanced at the Enterprise Integration Patterns book, then you have some passing familiarity with Flume, an integration tool that lets you publish data to channels, aggregate data, multiplex data, filter data, and more. Unless you've rolled out your "data lake" and have a relatively high level of maturity using Hadoop in your organization, I'd work on a passing strategic knowledge of Flume for now, just enough to talk strategy.

By my informal study -- and plenty more formal by analysts -- Hadoop is the hottest growth area of the industry. While you may have a (misguided) passion for $RANDOMCOMPUTERLANGUAGE or ${api.of.theday}, if you want to be the kind of highly valued developer that every organization is trying to poach and no one wants to lose, this is the stuff you should know right about now. The curve will rise as time goes on.

In the comments I expect "the author is clearly an idiot because he left out $FAVORITEPETPROJECT" or some random detail I omitted or oversimplified. All that means is that this is a fast-growing ecosystem and the secret to winning the gold rush is get there and buy the land first.

Copyright © 2014 IDG Communications, Inc.