Know this right now about Hadoop

Here's what you must know this instant about big data darling Hadoop and its expanding universe of add-ons and subprojects

I write a lot about Hadoop, for the obvious reason that it's the biggest thing going on right now. Last year everyone was talking about it -- and this year everyone is rolling it out.

Some are still in the installation and deployment phase. Some are in the early implementation phases. Some have already used it to tackle real and previously intractable business problems. But trust me, Hadoop is the hottest ticket in town.

[ 18 essential Hadoop tools for crunching big data. | Work smarter, not harder -- download the Developers' Survival Guide for all the tips and trends programmers need to know. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]

Hadoop and the techniques surrounding it are a bit complicated. The complexity doesn't come from mathematics or even computing theories -- much of it is the nature of the beast. There are so many technologies and components to choose from, often with overlapping capabilities, that it's hard to decide what to use to solve which problems.

If you try and learn the entire Hadoop ecosystem at once, you'll find yourself on a fool's errand. Instead, focus on the foundation, then develop a passing familiarity with the solutions for common problems -- and get a sense of what you need.

Core/foundational elements (this you must know)

Hadoop's underpinning is HDFS, a big distributed file system. Think of it as RAID plus CIFS or NFS over the Internet: Instead of striping to disks, it stripes over multiple servers. It offers redundant, reliable, cheap storage. While Hadoop may be most famous for implementing the MapReduce algorithm, HDFS is really the Hadoop ecosystem's foundation. Yet it is a replaceable base, because there are alternative distributed file systems available.

YARN (Yet Another Resource Negotiator), which sits on top of HDFS, is a bit hard to describe because like Hadoop it isn't all one thing. At its base YARN is a cluster manager for Hadoop. Also, MapReduce is built on top of it. Yet part of the motivation for YARN is that the Hadoop ecosystem expands beyond algorithms and technologies based on MapReduce. YARN is based on these concepts:

  • MapReduce: You've heard of this. In the case of Hadoop it is both an algorithm and an API. Abstractly, the algorithm depends on the distribution of data (HDFS) and processing (YARN) and works like this. MapReduce is also a Java API provided by Hadoop that allows you to create jobs to be processed using the MapReduce algorithm.

    Here's an example of how the MapReduce algorithm works: A client named Jim calls and he's in construction. You don't remember who he is, but he knows you. Let's say it's 1955, and instead of a computer you have a giant Rolodex (notecards on a spindle) of all of your contacts organized by last name. Jim is going to get wise and realize you don't know who he is before you find a notecard "James Smith, Contractor, Steel Building Construction." If, instead, you have 10 duplicate Rolodexes and possibly half of them have A-M and the other have O-Z and there are at least 11 people in the office, you could ask your office manager (writing frantically on a pad) to find all cards with "James, Jim, or Jimmy that have 'construction' or 'contractor.'"

    Your office manager could ask 10 people in the office to start grabbing cards in their rolodex. The Office manager might also divvy up the job and ask one person to only bother with A-B, and so on. The office manager has "mapped" the job to the nodes. Now the nodes find the cards (the result) and send them to the office manager who combines the cards onto one sheet of paper and "reduces" them into the final answer (in our case possibly throwing out duplicates or providing a count). With any luck we get one card, and worst case you have a new request based on what you "cold read" from the client on the phone.
  • (Global) Resource Manager: Negotiates resources among the various applications requesting them. Its underlying scheduler is pluggable and can prioritize different guarantees such as SLA, fairness, and capacity constraints.
  • Node Managers: Node Managers are per-machine slaves that monitor and report resource utilization (CPU, memory, disk, network) back to the Resource Manager.
1 2 3 Page
From CIO: 8 Free Online Courses to Grow Your Tech Skills
Join the discussion
Be the first to comment on this article. Our Commenting Policies