With Hadoop World NYC just around the corner on Oct. 2, 2009, I thought I'd share two pieces of news.
First, I've received a 25 percent discount code for readers thinking about attending Hadoop World. Hurry because the code expires on Sept. 21.
[ Stay up to speed with the open source community via InfoWorld's Technology: Open Source newsletter. ]
Second, check out this Q&A with New York Times software engineer and Hadoop user, Derek Gottfrid. Derek's doing some very cool work with Hadoop and will be presenting at Hadoop World.
Open Sources: What got you interested in Hadoop initially and how long have you been using Hadoop?
Gottfrid: I've been working with Hadoop for the last three years. Back in 2007, the New York Times decided to make all the public domain articles from 1851-1922 available free of charge in the form of images scanned from the original paper. That's 11 million articles available as images in PDF format. The code to generate the PDFs was fairly straightforward, but to get it to run in parallel across multiple machines was an issue. As I wrote about in detail back then, I came across the MapReduce paper from Google. That, coupled with what I had learned about Hadoop, got me started on the road to tackle this huge data challenge.
Open Sources: How do you use Hadoop at the New York Times and why has it been the best solution for what you're trying to accomplish?
Gottfrid: We continue to use Hadoop as a one-time batch process for tremendous volumes of image data at the New York Times. We've also moved up the food chain and use Hadoop for traditional text analytics and Web mining. It's the most cost-effective solution for processing and analyzing large sets of data, such as user logs.
Open Sources: How would you like to see Hadoop evolve? Or what are the three features you'd most like to see in Hadoop?
Gottfrid: I'd like to see the Hadoop road map clarified, as well as the individual subprojects to get rid of some of the weird interdependencies so that we can get to a legitimate 1.0 release that solidifies the APIs.
Open Sources: What can attendees expect learn about Hadoop from your preso at Hadoop World?
This whitepaper explains the terminology and concepts behind Data Replication technologies and establishes some sizing rules through worked examples. Learn the new paradigm in disaster tolerance—protect data anywhere.
Download now »Server virtualization is a popular option for dealing with mounting datacenter costs. Another equally promising approach is the use of an Application Delivery Controller. Citrix NetScaler provides a low-cost way for organizations to reduce their server count and accrue cost savings from a reduction in space, cooling, power and personnel.
Download now »
The emergence of WLANs has created a new breed of security threats to enterprise networks.
Included in HP ProCurve WLAN solutions is security technology that alleviates threats from WLANs through:
* Monitoring wireless activity inside and out of the enterprise
* Classifying WLAN transmissions into harmful and harmless
* Preventing transmissions that pose a security threat to the enterprise network
* Locating participating devices for physical remediation
Effectively address data protection challenges, implementing solutions that help store and protect businesscritical data while cutting costs and improving efficiency and reliability.
Download now »
Sign up to receive InfoWorld Resource Alerts
