In a further sign that big data is beginning to hit its stride, EMC Greenplum on Monday debuted a new Apache Hadoop distribution that it hopes will soon become the de facto standard, pushing aside current leaders like Cloudera and Hortonworks.
"The bet that we're making at Greenplum is that the future of our big data business is Hadoop, so it's important that we have our own distribution,"
"Virtually every customer and prospect we talk to is doing something with Hadoop," says Josh Klahr, vice president of product management at EMC Greenplum. "It ranges from CIOs that have heard of Hadoop and spun up a Big Data team to figure out what to do with it to folks that are being a little more thoughtful about it and have figured out a use case."
"We're seeing rapid adoption," he adds. "The Hadoop business is growing 60 to 70 percent a year. We see this as a real sea change."
Since its emergence as an Apache Lucene subproject in 2006, Apache Hadoop has quickly become the preferred solution for big data applications with massive repositories of unstructured data. Hadoop has many things to recommend it: It's flexible, scalable, built on commodity hardware and fault-tolerant.
However, there are hurdles to implementing Hadoop in the enterprise. One critical hurdle is the lack of useful interfaces and high-level tooling for Business Intelligence and datamining. That's where Greenplum sees its opportunity, Klahr says.
Greenplum bets Hadoop is the future of big data
"The bet that we're making at Greenplum is that the future of our big data business is Hadoop, so it's important that we have our own distribution," Klahr explains. "There are some things we're adding into the distribution that aren't yet supported by Apache. We think that the Hadoop market is going to be so big that we wanted to have our own distribution."
Some elements of EMC Greenplum's distribution, Pivotal HD, may never find their way back into the Apache project. The core of Pivotal HD is the marriage of Greenplum's MPP (massively parallel processing) database technology with the Apache Hadoop framework, a technology called HAWQ. HAWQ is essentially a fully functional, high-performance relational database that runs in Hadoop and which speaks SQL natively.
"Our plan is to actively contribute certain elements back to Apache, but we're going to have HAWQ as a proprietary service that we won't be open sourcing," Klahr says.
Klahr says that HAWQ delivers performance improvements of 50X to 500X when compared with existing SQL-like services (like Hive) on top of Hadoop.
HAWQ connects data workers, data tools to Hadoop repositories
"There's this whole group of data workers and data tools that exist in the enterprises we work with that can't easily talk to Hadoop," Klahr says. "But thousands of folks can talk SQL. We're bringing a pure SQL database engine and we're embedding it into our Hadoop distribution. It's a SQL database that you can connect any BI tool to."
With HAWQ in place, Hadoop can become a singular data repository from which organizations can both run MapReduce queries and SQL queries with ease.
"With Pivotal HD, we can check off many of the items on our Hadoop wish-list--things like plug-in support for the ecosystem of tools, improved data management and greater elasticity in terms of the storage and compute layer," says Steven Hirsch, chief data officer and senior vice president of Global Data Services at NYSE Euronext.
"But above all," Hirsch says, "it provides true SQL query interfaces for data workers and tools--not a superficial implementation of the kind that's so common today, but a native implementation that delivers the capability of real and true SQL processing and optimization."
"Having a single Hadoop infrastructure for big data investigation and analysis changes everything," Hirsch says. "Now add all of this functionality to the fact that the SQL performance is up to 100X faster than other offerings and you have an environment that we at NYSE Euronext are extremely excited about."
EMC plans to make Pivotal HD available at the end of the first quarter as a software-only or appliance-based solution.
Thor Olavsrud covers IT Security, Big Data, Open Source, Microsoft Tools and Servers for CIO.com. Follow Thor on Twitter @ThorOlavsrud. Follow everything from CIO.com on Twitter @CIOonline, Facebook, Google + and LinkedIn. Email Thor at email@example.com
This story, "EMC Greenplum tackles big data with Hadoop distribution" was originally published by CIO.