Hadoop is coming out of the shadows and into production in IT shops that are drawn to its ability to store, process and analyze extremely large volumes of data. But the relative newness of the open-source platform and a shortage of experienced Hadoop talent pose technical challenges that enterprise IT teams need to address.
Hadoop grew out of the work of Doug Cutting and Mike Cafarella, who originally developed it to support Apache Nutch, an open-source search engine. It became an Apache project when Cutting and a team of engineers at Yahoo split the distributed computing code out of the Nutch crawler to create Hadoop.
[ Also read "Enterprise Hadoop: Big data processing made easier." | Explore the current trends and solutions in BI with InfoWorld's interactive Business Intelligence iGuide. | Read about InfoWorld's 2012 Technology of the Year Award winners. | Read about InfoWorld's top 10 emerging enterprise technologies. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]
MORE: Gartner: 16 long-held IT business practices you need to kill
Today Hadoop powers every click at Yahoo, where the Hadoop production environment spans more than 42,000 nodes. That kind of scalability is a sweet spot of Hadoop, which is designed to handle data-intensive distributed applications spanning thousands of nodes and exabytes of data, with a high degree of fault tolerance.
Hadoop pioneers in the online world -- including eBay, Facebook, LinkedIn, Netflix, and Twitter -- paved the way for companies in other data-intensive industries such as finance, technology, telecom and government. Increasingly, IT shops are finding a place for Hadoop in their data architecture plans. The appeal, in a nutshell, is that Hadoop can enable massively parallel computing on inexpensive commodity servers. Companies can collect more data, retain it longer, and perform analyses that weren't practical in the past because of cost, complexity and a lack of tools.
At Concurrent Computer, the decision to use Hadoop was driven in large part by volume.
"Scalability was the biggest concern. With a traditional relational database, every time you want to scale or get bigger, you end up paying a premium," says Will Lazzaro, director of engineering at Concurrent, which provides video-on-demand systems and processes billions of records a day related to viewers, content consumption and platform operations.
"When it comes to the heavy lifting of getting yesterday's data into our system, or plugging through gigabits-big log files, [Hadoop] is the opportune technology to bring in that data, whether it's structured, semi-structured or even unstructured," Lazzaro says.
Playing with big data
Hadoop lets enterprises store and process data they previously discarded -- log files, for example -- because it was too hard to process and didn't fit cleanly into traditional database schemas. That's the crux of so-called big data, says Matt Aslett, research manager, data management and analytics, at 451 Research. "It's about doing things with data that was previously thrown away in a way that enables new applications and new projects."
In addition to being scalable, Hadoop computing systems are flexible. Hadoop is schema-less, which lets users join and aggregate data from disparate sources for more complex analyses. New nodes can be added as needed, and Hadoop's built-in fault tolerance features allow the system to redirect work to another location if a node is lost.
"That schema-less approach, which lets you just store the data and then figure out what you want to do with it, is much more appropriate for unstructured and semi-structured data like Web log data, as well as for data that you know has value for the organization, but you may need to do some experimentation to figure out what that value is," Aslett says. "The cost of doing that in an enterprise data warehouse would just be prohibitive."
Return Path, an email certification and reputation monitoring company, started experimenting with Hadoop in 2008, attracted by its enormous storage potential and the ability to easily scale the platform by adding servers. Return Path collects massive amounts of data from ISPs and analyzes it to establish email sender reputations, pinpoint deliverability issues or monitor potentially harmful messages, for instance.
In the early days, signing on a new ISP or two could result in a quadrupling of its data. The company found itself in a position where it couldn't keep data as long as it wanted to, nor could it process the data as fast as it wanted to, recalls CTO Andy Sautins. Over the years, he and his team tried a few custom solutions to augment the company's traditional enterprise data warehouse. "These worked fairly well but required much more time and investment in software development than made sense," Sautins says.
Hadoop was a game-changer. "It let us change the conversation around what it meant to retain data. It wasn't in terms of weeks, it was years," Sautins says. "Hadoop really helped us be able to weather the storm of retaining and processing more data."
Moving out of the shadows
Apache Hadoop includes two main subprojects: the Hadoop Distributed File System (HDFS), which provides high-throughput access to application data, and Hadoop MapReduce, which is a software framework for distributed processing of large data sets on compute clusters. It's augmented by a growing group of Apache projects, such as Pig, Hive and Zookeeper, that extend its usability.
Hadoop's emergence as an enterprise platform mirrors in many ways the arrival of Linux: Deployments were preceded by shadow IT projects, or skunk works, to test the merits of the software before adopting it on a wider scale.
Adoption is growing largely through developers "who've got an ear to ground, figuring out what the other companies are doing," 451 Research's Aslett says. "It's just as we saw Linux move in to enterprises through the IT department and internal projects, when the CEO/CIO didn't necessarily know that it was in there. It's exactly the same with Hadoop," Aslett says.
The emergence of vendors with commercial, enterprise-oriented Hadoop distributions -- including support, management tools and configuration assistance -- has further accelerated adoption in the enterprise realm. Key players in this arena are Cloudera, MapR Technologies, and Hortonworks, which was spun out of Yahoo last year to develop its own distribution of Hadoop.
Concurrent uses the Cloudera CDH platform. "Certainly we could take the open-source version without the Cloudera support, but we found a vendor partner that allows us to expand our solution and leverage their expertise, and really understand how the system works, not just hack it together because it's open source," Lazzaro says.
Return Path started working with MapR's commercial distribution last year, a move it made to improve stability and boost performance. "We've been able to see a roughly 2.5- to three-times increase in performance for our workloads," Sautins says. "That means we can either run things twice as fast, which is great, or we can run half the servers, which can also be very compelling." [Also see: "MapR makes Hadoop better, faster, easier"]
Along with multiplying options for commercial Hadoop distributions, there are other signs the open source platform is gathering steam. Venture capital is flowing, and new startups with management add-ons and analytic applications are appearing at a dizzying pace. It's also getting increasing attention from traditional data management players -- including IBM, Oracle, Microsoft and EMC -- eager to cash in on the action.
On the funding front, 2011 was a huge year for Hadoop vendors: Cloudera landed $40 million in Series D funding; MapR secured $20 million in Series B funding; Datameer, which makes analytics tools built on Hadoop, secured $9.25 million in its second funding round; and in September, $11 million went to DataStax, which offers a commercial version of the Apache Cassandra distributed database management system as well as a new product that couples Cassandra with Hadoop analytics.
Another event that portends increasing financial investments in Hadoop-related startups is Accel Partners' launch of a $100 million big data fund earmarked for startups working in areas including data management, storage, data analytics and business intelligence. To help spend the money, Accel lined up a team of fund advisers, and the Hadoop realm is well represented by Cutting, who's now with Cloudera; Gil Ebaz, founder of Hadoop user Factual; Cloudera Chief Scientist Jeff Hammerbacher, who once led the data team at Facebook; and Facebook's Jay Parikh.
"There's already a second and third generation of startups being created to take advantage of this macro trend. We're the old guys in the room now, after doing this for three years," says Charles Zedlewski, vice president of product at Cloudera.
Choosing workloads, finding talent
Hadoop makes it easier to process big data, but it's no cure-all. One common challenge for enterprises is how to choose the most appropriate technology to handle different kinds of data.
"I think there's still a lot of confusion about what applications, what workloads, should be on Hadoop versus those that should be in a traditional enterprise data warehouse," Aslett says. "Unfortunately at this point, there aren't any easy answers for that."
Another challenge that will only heighten as Hadoop heads for the mainstream is finding people to work with the technology. "There's a lack of skills, and that's definitely a challenge in terms of the continued adoption of Hadoop," Aslett says.
Major players including Cloudera, IBM, Hortonworks and MapR are all investing heavily in training programs to teach IT pros how to deploy, configure and manage Hadoop products. "They're well aware that this is actually an issue that could limit the continued adoption of Hadoop at an enterprise level."
"If you go out there and try to hire, it's incredibly difficult," acknowledges Omer Trajman, vice president of customer solutions at Cloudera. A more feasible approach is to look internally for candidates ripe to learn Hadoop, he suggests.
"The most successful companies aren't necessarily going out and trying to hire aggressively. They have people who have the basic skills required, individuals who have backgrounds in statistics, science, data processing, Java development and analytics," Trajman says. "It's really about looking inward into an organization, finding people who already have familiarity with the business and domain expertise, and teaching them how to use these tools."
On the positive side, as awareness of Hadoop grows, the number of IT pros learning Hadoop is growing, too.
"Every time I've talked to a recruiter for the last two years, I've asked if they have anybody with Hadoop experience. Usually the answer was 'ha-what?' Increasingly it's maturing, so you are seeing more people in the field," says Concurrent's Lazzaro.
Figuring out what kind of person is best to hire can be a challenge in itself.
"We originally thought we needed to find a hardcore Java developer," Return Path's Sautins says. But in reality, the talent that's best suited for working with Hadoop isn't necessarily a Java engineer. "It's somebody who can understand what's going on in the cluster, is interested in picking up some of these tools and figuring out how they work together, and can deal with the fact that pretty much everything in the Hadoop ecosystem is not even a 1.0 release yet," Sautins says. "That's a real skill set."
Read more about data center in Network World's Data Center section.
This story, "Hadoop wins over enterprise IT, spurs talent crunch" was originally published by Network World.