Will 'enterprise data clouds' reinvent data warehouses?

Greenplum looks to use commoditized hardware to manage multiple warehouses, jumping into a cloud trend

Today database vendor Greenplum unveiled a "solution for enterprise data clouds." The company claims this represents a shift in how enterprise data is managed, and it's aiming to displace data warehouse appliances in large enterprises all together. Fox Interactive Media, T-Mobile, Zions Bank, and others are already working with Greenplum to build early iterations of enterprise data clouds (EDCs).

So what the heck are EDCs? It turns out that it's just jargon for using cloud computing to create and manage multiple data warehouses on a common pool of commoditized hardware. I wanted the real scoop, so I sat down with Scott Yara, Greenplum's cofounder and president, to talk about this new offering and the future of enterprise-class data warehousing.

[ Keep up on the latest in cloud developments with InfoWorld's Cloud Computing newsletter and Cloud Computing channel. ]

whurley: What's the most significant benefit you think customers will see as a result of enterprise data clouds?

Yara: There is a fundamental friction that stems from the competing needs of business analysts and IT operations. Business analysts want the flexibility and power to combine and analyze any data, and get their work done without resource delays or long process hurdles. IT operations wants to centralize and consolidate to reduce costs, streamline support, and deliver better quality of service. Until now these have been largely irreconcilable -- but no longer.

The answer is self-service -- i.e., have IT provide a platform that allows business analysts to serve themselves without IT involvement. In a self-serve environment, IT gives the business the power and control to instantly provision and deploy data warehouses. A warehouse is provisioned from an infrastructure pool that could consist of on-premise physical servers, virtual machines, or potentially even public cloud resources. The self-service provisioning layer needs to provide a Web interface to business analysts, allowing them to create and manage warehouses. Meanwhile, IT operations can focus on assembling pools of tens, hundreds, or thousands of servers to hold the warehouses.

By getting this right, each party can now focus on doing what it does best. Business analysts can spin up warehouses in minutes as projects dictate, and bring together the data they need in their own project space. IT operations can manage the infrastructure pools and self-serve provisioning platform as one infrastructure, without the need to concern themselves with the contents and usage of any particular warehouse.

The introduction of self-serve as the central tenant of Greenplum’s EDC initiative is a breath of fresh air for everyone involved and opens the door to new kinds of practices, use cases, and collaboration that wouldn’t have been possible otherwise.

whurley: Are there going to be issues with moving customers on proprietary data warehousing solutions on to enterprise data clouds?

Yara: Most companies have hundreds or thousands of data marts in addition to any central EDW [enterprise data warehouse]. Initially we’re focusing on making it easy to bring those smaller silos into the EDC -- i.e., just pump across the raw data without any modification to the data model -- so that companies can consolidate these and start querying them or pointing BI tools at them with minimal disruption. These marts tend to have less complex integrations with BI and ETL tools, and are a much easier place to start than a larger central store.

Even more importantly, our customers want to use EDC to bring in a range of raw data -- such as event streams, telco CDRs, and transactions -- and allow analysts to create sandboxes and start putting that data to work. So they will be able to do new analysis and will quickly be encouraged to move other silos into the EDC because of the combined value.

whurley: Can you give us an idea of the cost of creating and maintaining an enterprise data cloud?

Yara: From a software cost perspective, the EDC is not a separate product or cost item from Greenplum. There is no premium from our point of view. From a cost of software perspective, the most flexible option is to buy a subscription by capacity (in Terabytes) and expand that as needed. It is no more expensive to carve up 100 Terabytes into 100 warehouses than to run it as one big warehouse.

From an operational standpoint, customers building an on-premise EDC (as we’d expect most to initially) start by assembling a pool of tens, hundreds, or more commodity servers. Then from there they provision warehouses as needed.

We expect EDC to lead to more streamlined IT operation of data warehouses, better utilization, and higher service levels.

whurley: Some consider Oracle to be behind the 8-ball with Exadata being a late commer to the market. What are your thoughts on how Greenplum's solution stacks up?

Yara: Exadata is essentially Oracle RAC with some storage enhancements in an appliance form factor. Customers have been skeptical of Exadata because they are very familiar with the limitations of Oracle’s database in high-scale data warehousing environments, and they see more of the same with Exadata. We saw some initial interest in Exadata, but we don’t generally see them in competitive situations.

More generally, EDC is in many ways a response to the inflexibility of the hardware appliance model. Fortunately for Greenplum, we didn’t build a proprietary hardware solution like Teradata or Netezza and are now the leader in high-end software-based data warehousing, as is indicated by our 6.5PB warehouse at eBay. This allows us to transition to the next evolutionary step -- flexible self-service infrastructure on commodity hardware -- in a way that these hardware vendors cannot.

whurley: What's the future for data warehousing and enterprise data clouds?

Yara: The biggest change is going to be the new kinds of practices that an EDC approach brings to data warehousing. Today we are still using 30-year old approaches: treating the warehouse like a mainframe as Teradata has taught the industry. The result of these unrealistic practices is organization tension, lack of agility, and an explosion of shadow IT as the business fights to get its work done despite the constraints of the warehouse. EDC and the newer practices that are emerging are creating a new agile approach that is about getting data in the hands of analysts ASAP without extensive upfront modeling, empowering them with sandboxes and self-service, and focusing on pragmatic business results rather than process.

So will Greenplum's entry cause a potential data warehousing shake-up? Tough to say. There are lots of players, established appliance vendors like Teradata and Netezza are having trouble inserting their proprietary hardware into a cloud infrastructure ridden with commoditized hardware, and newcomers like Cloudera are still in their infancy. Still, data warehousing is a target-rich environment, and Greenplum is giving it a good shot. Huge data stores at low cost are always cited as a prime application of cloud computing, and widening access to critical data warehouse analytics is a compelling idea.

Copyright © 2009 IDG Communications, Inc.

How to choose a low-code development platform