The mythical Hadoop skills gap

Oh no! Big data is failing because we can't find enough people who know the technology! Relax, they're out there -- but don't fall for the buzzwords

The mythical Hadoop skills gap
nikoretro via Flickr

There first rule of the Hadoop skills club: Don’t talk about the Hadoop skills club. That said, plenty of talent is out there. But you're looking for it wrong.

I called it first: We’re already in the trough of disillusionment with big data in general and Hadoop in particular. Now the press is overwrought about a “Hadoop pullback,” though we on the ground have yet to experience it.

The top reason cited is lack of skills and familiarity, which in English means they couldn’t get the project off the ground because they couldn’t find enough of the right people. If you’re already in the big data business, this won’t mean anything to you, because it means that demand exceeds supply. You can then crack open a bottle of bubbly -- you can afford it.

Inflated IT salaries create distortions in the market. At the moment, it's generally cheaper to poach people from other companies than it is to train and grow your talent. This assumes you can find the right people to poach. But hey, I can help.

There is no skills gap -- the IT recruiting industry is full of idiots

Hadoop is hard, but frankly, the idiotic buzzword-matching to skills make no sense. Here's the relevant experience prospective candidates should have even if they don't know Hadoop:

  • HDFS: Any distributed file system, or a reasonable understanding of how RAID works along with Linux shell skills. Frankly, you mostly need to be able to explain over and over why you can’t put HDFS on the SAN to your EMC-loving network and storage team. I’ve included a diagram to help:
    Hadoop HDFS
  • HIVE: Any SQL database experience along with any tool like SQL loader.
  • PIG: SQL experience and anything like PERL or maybe even PL/SQL. Pig has a surprising learning curve that isn’t immediately apparent. Working with the weird way it mixes assignments and structure is painful, at least at first. Not months of learning curve for anyone familiar with those things, but it's the kind of cavernous topic you think you’ve mastered, then find out something new.
  • Kafka: Any messaging technology. Kafka is dead simple. If you’ve worked with JMS, AMQ, MSMQ, MQ Series, WS-MQ, and so on, you’ll be simultaneously impressed and disappointed with Kafka’s simplicity.
  • MapReduce: Java and any distributed computing background. It didn't take me much more than reading the Wikipedia page to immediately understand the MapReduce algorithm.
  • Spark: Any functional programming experience, especially Scala or Python -- or any procedural language and Calculus. You need some understanding of distributed computing and what a graph is (as in data structure, not charts).

That is by no means an exhaustive list, but any reasonably experienced Java/Linux developer with some knowledge of functional programming, SQL, and distributed computing should be able to pick up all of that inside of a month of concentrated effort -- really. I’ve proven myself right before. Maybe if we started cataloging what you need to know rather than playing resume buzzword bingo, there wouldn’t be a technical skills gap for big data.

Big data isn’t really an IT project

The real skills gap for big data is in management -- and this is real. You need to be able to grasp the high-level concepts behind these new technologies, manage an IT rollout of them while driving organizational change, and bring in the right business and mathematical talent to make use of them. That kind of management talent is uncommon.

Until there are pre-packaged business solutions, we’ll be doing a lot of roll-your-own research projects. Meaning, how should we analyze our sales data? How should we evaluate the risk of our liquid or semi-liquid assets? Deploying Hadoop or Spark or whatever is simple. Figuring out your business, cataloging its data assets, and applying the right algorithms to it is harder. Managing all of those activities at once requires actual talent, but it isn’t technical talent.

That said, my company still has a tough time hiring because the tech industry is full of high-salary expectations -- often without the core skills, work ethic, and background knowledge to match. We aim (on the technical side) for folks with the Java/Linux/SQL background necessary to pick up Hadoop and the rest quickly. Meanwhile, we’re also hiring folks with a mathematical and business background. That can be a tall order, but it isn’t in any way specific to big data or Hadoop.