Increasing efforts by enterprises to glean business intelligence from the massive volumes of unstructured data generated by web logs, clickstream tools, social media products and the like has led to a surge of interest in open source Hadoop technology, analysts say.
Hadoop, an Apache data management software project with roots in Google's MapReduce software framework for distributed computing, is designed to support applications that use massive amounts of unstructured and structured data.
[ Find new value in data overload with InfoWorld's iGuide to the new business intelligence. | Get smarter about how you handle the explosion of enterprise data with InfoWorld's Enterprise Data Explosion newsletter. ]
Unlike traditional relational database management systems, Hadoop is designed to work with multiple data types and data sources. Hadoop's Distributed File System (HDFS) technology allows large application workloads to be broken up into smaller data blocks that are replicated and distributed across a cluster of commodity hardware for faster processing
The technology is already used widey by some of the world's largest Web properties, such as Facebook, EBay, Amazon, Baidu, and Yahoo. Observers note that Yahoo has been one of the biggest contributors to Hadoop.
Increasingly, Hadoop technology is used in banks, advertising companies, life science firms, pharmaceutical companies and by other corporate IT operations, said Stephen O'Grady, an analyst with RedMonk.
What's driving Hadoop is the desire by companies to leverage massive amounts of different kinds of data to make business decisions, O'Grady said. The technology lets companies process terabytes and even petabytes of complex data relatively effectively and at substantially lower cost than conventional relational database management systems, experts say.
"The big picture is that with Hadoop you can have even a one and two person startup being able to process the same volume of data that some of the biggest companies in the world are," he said.
Hadoop user Tynt, a Web analytics firm, provides analytics services for more than 500,000 websites. Its primary offering is a service that lets content publishers get insight into how their content is being shared. On an average day Tynt collects and analyzes close to 1 terabyte of data from hundreds of millions of web interactions on the sites that it monitors.
The company switched to Hadoop about 18 months ago when its MySQL database infrastructure began collapsing under the sheer volume of data that Tynt was collecting.
"Philosophically, Hadoop is a whole different animal," said Cameron Befus, Tynt's vice president of engineering.