Hadoop software tends to be grouped alongside NoSQL databases as big data technology. Its core components of Hadoop consist of MapReduce, which distributes processing jobs on a Hadoop cluster, and the Hadoop Distributed File System. A number of other open source projects, as well as some commercial software, round out the Hadoop ecosystem.
A company's journey into this ecosystem could well begin as an informal experiment. For example, Weber says a company may have an employee interested in Hadoop who downloads the software and builds a small cluster.
Doing something more ambitious with Hadoop will typically require additional resources. In Shutterfly's case, the organization started with in-house resources, but now works with an outside contractor and plans to bring on additional help. Shutterfly also aims to harness Hadoop for website analytics. The company hopes to glean greater insight into customer transactions and the website's overall technical performance.
While Shutterfly works with the contractor on a limited basis, the company is will be working with vendors like Hortonworks to start "an effort that is much more formalized," Weber says. Contractor and vendor resources will initially focus on getting the company's Hadoop project off the ground. Weber says he also aims to train a small group of in-house personnel beyond introductory Hadoop knowledge.
Case study: how ComScore is using Hadoop to tame its big data flow
Monsanto, an agricultural products company based in St. Louis, also finds itself cultivating internal resources and looking for outside support. The company's geographic location away from the big IT centers on the East and West coasts creates Hadoop recruiting and hiring issues. "Being in the Midwest, that is a challenge for us," says Lori Yancey, R&D IT staffing lead at Monsanto.
The company has been evaluating Hadoop since late 2009. Last year, Monsanto decided to build out a full production cluster, notes Erich Hochmuth, R&D IT high performance analytics lead at Monsanto. He says the company has a couple Hadoop projects underway and uses the platform "for analytics over large unstructured and semi-structured datasets."
Monsanto's Hadoop initiatives focus on using the platform to build enterprise data processing pipelines for analyzing and storing data generated from scientific instruments. Hochmuth says building these analysis pipelines in Hadoop will allow Monsanto to scale as new scientific instruments are adopted and, as a result, increase data volume. Traditional solutions, on the other hand, require IT personnel to rewrite and engineer the analysis pipelines to accommodate increases in data volume.
Hochmuth says Monsanto has tapped Cloudera as a source of Hadoop know-how. Cloudera will offer consulting services to get Monsanto's Hadoop projects up and running. Once Monsanto has a team using Hadoop, the next step will involve building up its in-house knowledge, Hochmuth notes. To that end, Cloudera will provide on-site training sessions for Hadoop administrators, as well as ongoing enterprise support, he adds.
Consulting, development, training address Hadoop skills shortage
Vendors pursuing the Hadoop skills gap offer a mix of consulting, software development, and training services. Key players here include Hadoop distributors and specialized IT services companies.