Greenplum 6 review: Jack of all trades, master of some

Substantial rev of the open source, MPP data warehouse offers high concurrency, embedded analytics, and data science capabilities

At a Glance

An MPP (massively parallel processing) database distributes data and queries across each node in a cluster of commodity servers. Greenplum’s approach to building an MPP data warehouse is unique. By building on an established open source database, PostgreSQL, they are able to focus engineering efforts on adding value where it counts: parallelization and associated query planning, a columnar data store for analytics, and management capabilities.

Greenplum is owned and developed by Pivotal, with support from the open source community, and is available free under the Apache 2 license. The latest release, Greenplum 6.0, goes a long way toward re-integrating the Greenplum core with PostgreSQL, incorporating nearly six years of improvements from the PostgreSQL project. These efforts mean that, going forward, Greenplum will gain new features and enhancements for “free,” while Pivotal focuses on making these additions work well in a parallel environment.

Greenplum architecture

An MPP database uses what is known as a shared nothing architecture. In this architecture, individual database servers (based on PostgreSQL), known as segments, each process a portion of the data before returning the results to a master host. Similar architectures are seen in other data processing systems, like Spark or Solr. This is one of the key architectural features that allows Greenplum to integrate other parallel systems, like machine learning or text analytics.

To continue reading this article register now