Yahoo struts its Hadoop stuff

A peek under the hood of Yahoo's Hadoop deployment illustrates how vast the ecosystem has become -- and how the company that invented it is still leading the way

Yahoo struts its Hadoop stuff

We've heard a lot lately about how Hadoop has stalled. Matt Asay recently cited a Gartner report describing demand for Hadoop as "fairly anemic," while in March InfoWorld blogger James Kobielus pointed to a "puny adoption rate."

The problem is, to get real value from Hadoop, you need to know what you're doing. Some enterprises encounter problems managing Hadoop at scale; others experiment without clear objectives, resulting in initiatives that never get off the ground.

If you want to see what a massively successful Hadoop deployment looks like, you can't do much better than Yahoo -- which, after all, created Hadoop in the first place. Recently, InfoWorld spoke with a couple of senior Hadoop technologists at Yahoo to discover what lessons could be learned from the company's gargantuan Hadoop deployment.

How big is it? According to Sumeet Singh, senior director of cloud and big data platforms, Yahoo has 43,000 servers in about 20 YARN (aka MapReduce 2.0) clusters with 600 petabytes of data on HDFS to serve the company's mobile, search, advertising, personalization, media, and communications efforts. That's quite a data lake. Singh says roughly 33 million jobs per month are processed on top of this infrastructure, and each job may include tens of thousands of tasks.

Hadoop plus a whole lot more

At the same time, Yahoo has led the way in extending beyond MapReduce and HDFS to embrace other technologies in the ever-growing Hadoop family.

According to Tim Tully, VP engineering at Yahoo, the company stores six petabytes of metrics across 5,000 HBase nodes to power its Flurry mobile analytics alone. But for real-time, interactive analytics at the raw event level, Flurry uses an internally developed query language on top of Spark, Hadoop's increasingly popular in-memory cousin.

Moreover, says Tully, "We’ve hit a wall with HBase. We’ve taken it so far that we’re probably taking it as far as it should have gone and beyond, to be honest. It takes a lot of manpower to make sure it stays up."

For mobile analytics, Yahoo is in the process of replacing HBase with Druid, an open source, column-oriented, in-memory database. According to Tully:

Druid is a real-time query engine. It memory maps your data and stores it in a columnar way, compressed, and it builds indexes on each of the columns. When you run a query on it, it can do very large, OLAP-style processing on the fly in hundreds of milliseconds instead of precalculating everything and dumping it into HBase.

Yahoo is already running Druid on commodity boxes packed with 192GB to 256GB of RAM. "The performance is great ... some of the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second."

The move to in-memory processing is perhaps the broadest trend in the database world, but even more so in the Hadoop realm. Particularly with exploratory analytics intrinsic to Flurry, you don't want to wait for a disk-bound batch job to complete before you iterate.

Building on a YARN foundation

The 600-petabyte data lake overseen by Singh, which provides Hadoop services to the entire company, is another story. The lion's share of processing is handled primarily by YARN, which in itself was a huge improvement over earlier versions of Hadoop.

Yahoo was the first company to adopt YARN at scale "primarily because we were the ones driving the initiative," says Singh. Upgrading to YARN, an endeavor that began in 2011, enabled Yahoo to improve resource utilization by leaps and bounds. "We were hovering around 16 to 18 percent aggregate utilization across all these clusters, which now ... exhibit 70 to 80 percent average monthly utilization, which is exceptionally high."

Yet for analytics applications that demand iteration and interactivity, Singh's operation also offers Spark on every YARN cluster -- plus 3,000 servers running Storm for near-real-time event processing:

The difference between Hadoop and Storm is that Storm is processing an event a time versus Hadoop or Spark, which is processing a batch at a time. The batch can be large, the batch can be small, the batch can be really small, which we call a microbatch. The paradigm in Spark that processes microbatch is called Spark streaming. All that is possible in our YARN clusters. Storm is really processing an event storm and there are lots of use cases for that, particularly in advertising, which we call the feedback loop that is essential for budgeting and reporting and managing a campaign.

Singh says Yahoo is a leading contributor to the Apache Storm community. He also notes Yahoo was an early sponsor of Spark when it was being developed at UC Berkeley.

Sorting through the Hadoop ecosystem

This list of technologies in the Hadoop family keeps growing. How does Yahoo decide which to use in order to add new functionality? It's not a trivial question for a company operating at Yahoo's scale, says Singh, given the size of the teams needed to support and battle-harden each open source project. It requires a big investment in resources to ensure security, multitenancy, and scalability.

Singh cites Yahoo's dependence on what he calls "three primary vehicles": Pig, Hive/HCatalog, and Oozie. "The biggest one is Apache Pig, which actually started out in Yahoo and still dominates. I would say 50 to 60 percent of all the processing happens through jobs that are written as Pig scripts." While Pig provides the scripting capabilities, Hive provides the SQL-like abstraction and HCatalog provides the metadata repository for all applications. Oozie is a workflow scheduler, which Singh says his team uses to schedule almost 70 percent of Hadoop jobs.

Yahoo is also a heavy user of Tez, which addresses many of the inefficiencies of MapReduce, including a load of unnecessary disk writes and an arbitrarily strict paradigm. According to Singh, Tez "works like MapReduce, feels like MapReduce but it’s not. Your processing becomes much faster than MapReduce, sometimes an order of magnitude faster." Utilization also improves, as does the predictability of execution time.

Singh says that almost 60 percent of the SQL workload on Hadoop has already moved from MapReduce to Tez. "My hope is that by the end of the year we don’t really have MapReduce workloads on our clusters. It would either be Tez or Spark. Spark is also increasing in volume. It still accounts for only 1 percent of that 33 million jobs I was talking about, but I anticipate that to grow."

For years Hadoop has referred to a cluster of technologies rather than only MapReduce and HDFS. It's fascinating see the company that invented Hadoop extend beyond those foundations and lead the way in contributing to many open source projects that augment or replace core Hadoop technology.

If that ecosystem seems complicated to the point of overwhelming, it is -- at least to those unfamiliar with the territory. Companies shouldn't enter the realm of Hadoop and its many satellites casually if they expect to get significant value. It takes dedication and expertise. If Hadoop adoption is slowing, that up-front commitment is surely a big reason why.


Copyright © 2015 IDG Communications, Inc.