"When it comes to the heavy lifting of getting yesterday's data into our system, or plugging through gigabits-big log files, [Hadoop] is the opportune technology to bring in that data, whether it's structured, semi-structured or even unstructured," Lazzaro says.
Playing with big data
Hadoop lets enterprises store and process data they previously discarded -- log files, for example -- because it was too hard to process and didn't fit cleanly into traditional database schemas. That's the crux of so-called big data, says Matt Aslett, research manager, data management and analytics, at 451 Research. "It's about doing things with data that was previously thrown away in a way that enables new applications and new projects."
In addition to being scalable, Hadoop computing systems are flexible. Hadoop is schema-less, which lets users join and aggregate data from disparate sources for more complex analyses. New nodes can be added as needed, and Hadoop's built-in fault tolerance features allow the system to redirect work to another location if a node is lost.
"That schema-less approach, which lets you just store the data and then figure out what you want to do with it, is much more appropriate for unstructured and semi-structured data like Web log data, as well as for data that you know has value for the organization, but you may need to do some experimentation to figure out what that value is," Aslett says. "The cost of doing that in an enterprise data warehouse would just be prohibitive."
Return Path, an email certification and reputation monitoring company, started experimenting with Hadoop in 2008, attracted by its enormous storage potential and the ability to easily scale the platform by adding servers. Return Path collects massive amounts of data from ISPs and analyzes it to establish email sender reputations, pinpoint deliverability issues or monitor potentially harmful messages, for instance.
In the early days, signing on a new ISP or two could result in a quadrupling of its data. The company found itself in a position where it couldn't keep data as long as it wanted to, nor could it process the data as fast as it wanted to, recalls CTO Andy Sautins. Over the years, he and his team tried a few custom solutions to augment the company's traditional enterprise data warehouse. "These worked fairly well but required much more time and investment in software development than made sense," Sautins says.
Hadoop was a game-changer. "It let us change the conversation around what it meant to retain data. It wasn't in terms of weeks, it was years," Sautins says. "Hadoop really helped us be able to weather the storm of retaining and processing more data."
Moving out of the shadows
Apache Hadoop includes two main subprojects: the Hadoop Distributed File System (HDFS), which provides high-throughput access to application data, and Hadoop MapReduce, which is a software framework for distributed processing of large data sets on compute clusters. It's augmented by a growing group of Apache projects, such as Pig, Hive and Zookeeper, that extend its usability.
Hadoop's emergence as an enterprise platform mirrors in many ways the arrival of Linux: Deployments were preceded by shadow IT projects, or skunk works, to test the merits of the software before adopting it on a wider scale.
Adoption is growing largely through developers "who've got an ear to ground, figuring out what the other companies are doing," 451 Research's Aslett says. "It's just as we saw Linux move in to enterprises through the IT department and internal projects, when the CEO/CIO didn't necessarily know that it was in there. It's exactly the same with Hadoop," Aslett says.