In a further sign that big data is beginning to hit its stride, EMC Greenplum on Monday debuted a new Apache Hadoop distribution that it hopes will soon become the de facto standard, pushing aside current leaders like Cloudera and Hortonworks.
"The bet that we're making at Greenplum is that the future of our big data business is Hadoop, so it's important that we have our own distribution,"
"Virtually every customer and prospect we talk to is doing something with Hadoop," says Josh Klahr, vice president of product management at EMC Greenplum. "It ranges from CIOs that have heard of Hadoop and spun up a Big Data team to figure out what to do with it to folks that are being a little more thoughtful about it and have figured out a use case."
"We're seeing rapid adoption," he adds. "The Hadoop business is growing 60 to 70 percent a year. We see this as a real sea change."
Since its emergence as an Apache Lucene subproject in 2006, Apache Hadoop has quickly become the preferred solution for big data applications with massive repositories of unstructured data. Hadoop has many things to recommend it: It's flexible, scalable, built on commodity hardware and fault-tolerant.
However, there are hurdles to implementing Hadoop in the enterprise. One critical hurdle is the lack of useful interfaces and high-level tooling for Business Intelligence and datamining. That's where Greenplum sees its opportunity, Klahr says.
Greenplum bets Hadoop is the future of big data
"The bet that we're making at Greenplum is that the future of our big data business is Hadoop, so it's important that we have our own distribution," Klahr explains. "There are some things we're adding into the distribution that aren't yet supported by Apache. We think that the Hadoop market is going to be so big that we wanted to have our own distribution."
Some elements of EMC Greenplum's distribution, Pivotal HD, may never find their way back into the Apache project. The core of Pivotal HD is the marriage of Greenplum's MPP (massively parallel processing) database technology with the Apache Hadoop framework, a technology called HAWQ. HAWQ is essentially a fully functional, high-performance relational database that runs in Hadoop and which speaks SQL natively.
"Our plan is to actively contribute certain elements back to Apache, but we're going to have HAWQ as a proprietary service that we won't be open sourcing," Klahr says.
Klahr says that HAWQ delivers performance improvements of 50X to 500X when compared with existing SQL-like services (like Hive) on top of Hadoop.
HAWQ connects data workers, data tools to Hadoop repositories
"There's this whole group of data workers and data tools that exist in the enterprises we work with that can't easily talk to Hadoop," Klahr says. "But thousands of folks can talk SQL. We're bringing a pure SQL database engine and we're embedding it into our Hadoop distribution. It's a SQL database that you can connect any BI tool to."
With HAWQ in place, Hadoop can become a singular data repository from which organizations can both run MapReduce queries and SQL queries with ease.