Know this right now about Hadoop

Here's what you must know this instant about big data darling Hadoop and its expanding universe of add-ons and subprojects

Page 3 of 3
  • Spark/Shark and Storm: HDFS is high latency. What if you want to do grand MapReduce distributed computing, but latency doesn't work for you? Well, shell out the cash for more memory and go with Spark, which integrates on top of Hadoop and HDFS but runs jobs in memory. Spark can also "stream," where you basically have long-running jobs that continuously return results as new data comes in. Shark is Hive on Spark. Storm is very similar as far as capabilities. Databricks and Cloudera seem to be backing Spark, and Hortonworks seems to backing Storm but has Spark in a preview. I'd settle for knowing what these thing are for now unless you need to stream or work in one of the industries where low latency is a must and not a "would be nice."
  • Oozie: This is basically workflow-based job control for Hadoop. Mainframers are nodding their heads and can read on. Basically any given system has a lot of repetitive tasks, and probably most of your MapReduce jobs are not really "ad hoc" even if you conceive them of that way. That is, businesses cycle, computing cycles, and thus your jobs are repetitive and cyclic. However, jobs often depend on other jobs and events, meaning you're not going to run end-of-month reporting until the end of the month, and based on the results of that job you may need to run another job and so on. Once your organization has rolled out Hadoop and starts running anything regularly, I'd invest in learning Oozie.
  • Ambari: This is more of a tool for the Administrators, but as a developer you should know a bit about it. While rolling out Hadoop nodes using the command line is incredibly relaxing and rewarding, Ambari can automate these tasks. Moreover, while logging into each node and looking at its stats is fun for the whole family, Ambari ties it up in a nice dashboard with pretty graphs. Ambari doesn't really support Windows and I'd say Hadoop overall is still alpha-ish on Windows (which has been great for the consulting business BTW). Hortonworks includes Ambari in its distribution while Cloudera rolled its own with Cloudera Manager.

Ancillary (this you should know about)

Finally, here are the items about which you should at least be able to talk a good game.

  • Zookeeper: This is for configuration and (cluster) group membership and coordination.
  • Sqoop: This is essentially an ETL (extract, transform, load) for sucking data out of your RDBMS or pushing data back into the RDBMS. It's not a difficult tool to use or learn if you need to. It might be good to spend some time on it, because chances are this is where your data will come from for your early Hadoop installation, integration, and implementation (or proof-of-concept projects).
  • Flume: If you've used an ESB, messed with Spring Integration, or even glanced at the Enterprise Integration Patterns book, then you have some passing familiarity with Flume, an integration tool that lets you publish data to channels, aggregate data, multiplex data, filter data, and more. Unless you've rolled out your "data lake" and have a relatively high level of maturity using Hadoop in your organization, I'd work on a passing strategic knowledge of Flume for now, just enough to talk strategy.

By my informal study -- and plenty more formal by analysts -- Hadoop is the hottest growth area of the industry. While you may have a (misguided) passion for $RANDOMCOMPUTERLANGUAGE or ${api.of.theday}, if you want to be the kind of highly valued developer that every organization is trying to poach and no one wants to lose, this is the stuff you should know right about now. The curve will rise as time goes on.

In the comments I expect "the author is clearly an idiot because he left out $FAVORITEPETPROJECT" or some random detail I omitted or oversimplified. All that means is that this is a fast-growing ecosystem and the secret to winning the gold rush is get there and buy the land first.

This article, "Know this right now about Hadoop," was originally published at Keep up on the latest news in application development and read more of Andrew Oliver's Strategic Developer blog at For the latest business technology news, follow on Twitter.

| 1 2 3 Page 3
From CIO: 8 Free Online Courses to Grow Your Tech Skills
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.