Are you doing an ETL job or batch process in Hadoop? If so, you have steps in the load to worry about. Some of those steps can run in parallel, whereas some depend on the other steps and shouldn’t go until those steps complete.
The company I work for is a big Kettle user. Kettle is an open source tool from Pentaho (also known as Pentaho Data Integration). Our main use for it is to push data from A to B and transform it at step C, where step D and E can happen in parallel, but not before step C. In other words, we use Kettle/PDI for orchestration.
There are other tools that do this. But Kettle worked best and first in terms of supporting Hadoop and has a bazillion connectors and a whole marketplace. Kettle/PDI is still my top choice today.
Nonetheless, it's worth walking through the other choices you're likely to encounter. Some can be summarily dismissed, but others are worth considering in special cases.
New on the scene is Nifi, which is regarded as a tool for the Internet of things. Nifi is rather promising. It's as mature as you imagine a new open source project would be (read: not). Tellingly, Nifi doesn’t support important portions of the Hadoop ecosystem, such as Hive.
For the most part Nifi isn’t really for orchestration or batching, but for streaming transformation. You don't want a whole different tool for drawing flows between sources and destinations with transformation for batch as opposed to streaming. I predict Nifi will evolve and probably become the tool of choice ... but not today.
Technically, you could use Oozie. But as I’ve mentioned in the past, the Oozie documentation is always wrong, with the wrong schema at the top of many of its examples. The validator validates things that won’t run, and there are things that will run but it won’t validate. If you use Oozie for any multistep job, the only certainty is that getting it working will take longer than your estimate. Oh, and returning a value from one step to the next almost never works the first time.
ESBs and more
If you prefer, you can opt to use your favorite ESB. At least an ESB supports orchestration, although it also includes lots of messaging and tools you don’t need. Moreover, you’re probably going to have to write the support for Hadoop all by yourself. Also, this weird mix of client-server messaging and big data platform gives me a small headache.
Finally, you can buy a really big expensive thing! I’m looking at you, Informatica. Ironically, heavy-duty integration solutions are less mature with Hadoop than other tools. Kettle/PDI has been doing it longer. Yet if you’re doing transformation of mainframe data, Informatica has it baked in and proven, whereas Pentaho doesn’t offer commercial support for the open source mainframe integration plug-in from Legstar.
While the existing Legstar plug-in does (shockingly) work, it doesn’t support the latest version of Kettle/PDI. The older version of Kettle/PDI doesn’t seem to support the latest versions of Hadoop, so that’s fun.
Ultimately, Kettle is still the best game in town for orchestrating a batch process involving Hadoop. But I also see an opportunity for something better. Somebody is going to combine an orchestration tool for Hadoop jobs like Kettle/PDI with the right plug-ins for legacy integration. Given that commercial solutions in this space cost hundreds of thousands of dollars and beyond, how about an open source alternative?