With Hadoop World NYC just around the corner on Oct. 2, 2009, I thought I'd share two pieces of news.
First, I've received a 25 percent discount code for readers thinking about attending Hadoop World. Hurry because the code expires on Sept. 21.
[ Stay up to speed with the open source community via InfoWorld's Technology: Open Source newsletter. ]
Second, check out this Q&A with New York Times software engineer and Hadoop user, Derek Gottfrid. Derek's doing some very cool work with Hadoop and will be presenting at Hadoop World.
Open Sources: What got you interested in Hadoop initially and how long have you been using Hadoop?
Gottfrid: I've been working with Hadoop for the last three years. Back in 2007, the New York Times decided to make all the public domain articles from 1851-1922 available free of charge in the form of images scanned from the original paper. That's 11 million articles available as images in PDF format. The code to generate the PDFs was fairly straightforward, but to get it to run in parallel across multiple machines was an issue. As I wrote about in detail back then, I came across the MapReduce paper from Google. That, coupled with what I had learned about Hadoop, got me started on the road to tackle this huge data challenge.
Open Sources: How do you use Hadoop at the New York Times and why has it been the best solution for what you're trying to accomplish?
Gottfrid: We continue to use Hadoop as a one-time batch process for tremendous volumes of image data at the New York Times. We've also moved up the food chain and use Hadoop for traditional text analytics and Web mining. It's the most cost-effective solution for processing and analyzing large sets of data, such as user logs.
Open Sources: How would you like to see Hadoop evolve? Or what are the three features you'd most like to see in Hadoop?
Gottfrid: I'd like to see the Hadoop road map clarified, as well as the individual subprojects to get rid of some of the weird interdependencies so that we can get to a legitimate 1.0 release that solidifies the APIs.
Open Sources: What can attendees expect learn about Hadoop from your preso at Hadoop World?