- Application Masters: Each application making requests from your Hadoop cluster generally has one Application Master. In some cases you may group a few applications together and use one Master for all of them. Some tools like Pig use a single Application Master across multiple applications. The Application Master(s) compose Resource Requests to the Resource Manager and allocates (resource) Containers in collaboration with the NodeManagers. These Containers, distributed across the nodes and managed by the NodeManagers, are the actual resources allocated for the application requests.
- Pig: Technically Pig is a tool and Pig Latin is the language but nearly everyone will use Pig to refer to the language as allingca ita igpa atinla isa illysa. While it is possible to execute jobs on Hadoop using SQL, you'll find that SQL is a relatively limited language for potentially unstructured or differently structured data. Pig feels to me like PERL, SQL and regular expressions had a love child. It isn't superhard to learn or use, and it lets you create MapReduce jobs in far fewer lines of code than using the MapReduce API. There are lots of things you can MapReduce with Pig that you can't do with SQL.
- Hive and Impala: Hive is essentially a framework for data warehousing on top of Hadoop. Moreover, if you love your SQL and really want to do SQL on Hadoop, Hive is for you. Impala is another implementation of the same general idea and is also open source but not hosted at Apache. Why can't we all get along? Well, Hive is more or less backed by Hortonworks and Impala by Cloudera. Cloudera says Impala is faster and most third-party benchmarks seem to agree (but not the 100x Cloudera claims). Cloudera doesn't claim that Impala is a complete replacement for Hive (and ships it as well), but that it is superior for some use cases.
Wait, there's more! EMC didn't want to be without its own answer to this, so it has Greenplum/HAWQ from its new Pivotal division. Those are most decidedly not open source. Not to be outdone, Hortonworks and others are backing Tez, which claims it will offer a thousand-fold improvement.
You should probably know at least what Hive is and some basics of how to use it. That's somewhat transferable to knowledge of Impala if you end up working for someone that uses Cloudera's distribution. I'd have my eye on Tez and not really bother learning the others unless you work somewhere that decides the vendor lock-in is really worth ditching the existing expensive proprietary data warehouse infrastructure for a new expensive proprietary data warehouse infrastructure in Pivotal.
Ecosystem (this you should know)
You're not going to be able to claim you "know Hadoop" if you're ignorant of the ecosystem around it. So familiarize yourself with the following:
- HBase/Cassandra: Both are column family databases built on various parts of Hadoop. They are in many ways very similar, but their differences are substantial. While they compete in some areas, there are key areas of differentiation. If running MapReduce jobs against your column family store is a big deal to you, then you should probably go with HBase. If you're doing time-series data -- more of an operational store -- and you need nice, pretty management tools, dashboards, and so on, then you may find Cassandra is your best buddy even if she's a bit cursed.
Datastax and Rackspace seem to be the big backers of Cassandra. HBase seems to be supported by the major Hadoop vendors and, oddly, Microsoft. I'd learn either as the key knobs are the same, then transfer the skills. Cassandra will probably seem like a less steep learning curve if you're starting from scratch, but HBase will be more familiar if you're already heavily into Hadoop. To complicate matters, if you think you'll need more database support for contextual security you may also want to look at Accumulo.