What's the New York Times doing with Hadoop?

A Times software engineer talks about how Hadoop is driving business innovation at the newspaper and Web site

With Hadoop World NYC just around the corner on Oct. 2, 2009, I thought I'd share two pieces of news.

First, I've received a 25 percent discount code for readers thinking about attending Hadoop World. Hurry because the code expires on Sept. 21.

[ Stay up to speed with the open source community via InfoWorld's Technology: Open Source newsletter. ]

Second, check out this Q&A with New York Times software engineer and Hadoop user, Derek Gottfrid. Derek's doing some very cool work with Hadoop and will be presenting at Hadoop World.

Open Sources: What got you interested in Hadoop initially and how long have you been using Hadoop?

Gottfrid: I've been working with Hadoop for the last three years. Back in 2007, the New York Times decided to make all the public domain articles from 1851-1922 available free of charge in the form of images scanned from the original paper. That's 11 million articles available as images in PDF format. The code to generate the PDFs was fairly straightforward, but to get it to run in parallel across multiple machines was an issue. As I wrote about in detail back then, I came across the MapReduce paper from Google. That, coupled with what I had learned about Hadoop, got me started on the road to tackle this huge data challenge.

Open Sources: How do you use Hadoop at the New York Times and why has it been the best solution for what you're trying to accomplish?

Gottfrid: We continue to use Hadoop as a one-time batch process for tremendous volumes of image data at the New York Times. We've also moved up the food chain and use Hadoop for traditional text analytics and Web mining. It's the most cost-effective solution for processing and analyzing large sets of data, such as user logs.

Open Sources: How would you like to see Hadoop evolve? Or what are the three features you'd most like to see in Hadoop?

Gottfrid: I'd like to see the Hadoop road map clarified, as well as the individual subprojects to get rid of some of the weird interdependencies so that we can get to a legitimate 1.0 release that solidifies the APIs.

Open Sources: What can attendees expect learn about Hadoop from your preso at Hadoop World?

Gottfrid: In my session, which I've titled "Counting, Clustering and other Data Tricks," I'm planning to take attendees on the journey I've gone through at the New York Times using Hadoop for simple stuff like image processing to the more sophisticated Web analytics use cases I'm working on today.

Open Sources: What are you hoping or expecting to get out of Hadoop World?

Gottfrid: I attended the Hadoop Summit in the Silicon Valley, and now I'm interested to see what people in our eastern region are doing with Hadoop. I'm always open to learning new tricks and tips to better leverage the platform.

I'll be at Hadoop World to find out how companies are using Hadoop today, and what use cases will pop up in the future.

Will you be there?

Follow me on Twitter: SavioRodrigues.

p.s.: I should state: "The postings on this site are my own and don't necessarily represent IBM's positions, strategies, or opinions."

Related:

Copyright © 2009 IDG Communications, Inc.

How to choose a low-code development platform