Greenplum 6 review: Jack of all trades, master of some

Substantial rev of the open source, MPP data warehouse offers high concurrency, embedded analytics, and data science capabilities

At a Glance

An MPP (massively parallel processing) database distributes data and queries across each node in a cluster of commodity servers. Greenplum’s approach to building an MPP data warehouse is unique. By building on an established open source database, PostgreSQL, they are able to focus engineering efforts on adding value where it counts: parallelization and associated query planning, a columnar data store for analytics, and management capabilities.

Greenplum is owned and developed by Pivotal, with support from the open source community, and is available free under the Apache 2 license. The latest release, Greenplum 6.0, goes a long way toward re-integrating the Greenplum core with PostgreSQL, incorporating nearly six years of improvements from the PostgreSQL project. These efforts mean that, going forward, Greenplum will gain new features and enhancements for “free,” while Pivotal focuses on making these additions work well in a parallel environment.

Greenplum architecture

An MPP database uses what is known as a shared nothing architecture. In this architecture, individual database servers (based on PostgreSQL), known as segments, each process a portion of the data before returning the results to a master host. Similar architectures are seen in other data processing systems, like Spark or Solr. This is one of the key architectural features that allows Greenplum to integrate other parallel systems, like machine learning or text analytics.

Because Solr, for example, has a similar distributed architecture, Greenplum can link the individual Solr processing instances with the segment hosts to provide a more or less seamless query and analytic experience. This also means the data is processed in-place, avoiding costly movement of data across the network.

greenplum architecture 01 Pivotal

Figure 1: High-level view of the Greenplum architecture. 

Deploying Greenplum

Greenplum can be deployed several ways: in all three major clouds via their respective marketplaces, containerized, or on bare metal. As with any cluster application, the best performance is obtained on dedicated bare metal machines. I deployed a two-node cluster on Google Cloud Platform with all of the bells and whistles in just a few minutes. And I installed Greenplum locally in a VM using the pre-compiled binaries in about an hour.

The local install was necessary because Greenplum 6 is not yet available in the clouds; it’s due November 2019. The local install also gave me an opportunity to assess the quality of the Greenplum documentation. As you might expect from a formerly closed-source, proprietary product, it is excellent.

Having multiple deployment options allows companies to fine tune their deployments to match operational requirements. For example, models can be trained on a multi-node bare metal cluster for fast model development, then deployed on a single-instance of Pivotal Postgres running a REST endpoint in a container to operationalize’ the model.

Greenplum federated queries

Data today is everywhere—in different locations, different formats, and different “temperatures.” The Pivotal Extension Framework (PXF), introduced in Greenplum 5, grew out of the old HDFS connector into a general purpose method of accessing external data tables in Greenplum. PXF also connects to different data formats, such as text files (e.g. web logs), foreign databases, ORC, Parquet, and HBase. New data sources can be added to PFX using a Java API.

Combining PXF with the external access capabilities brought over with PostgreSQL 9.4, Greenplum can perform federated queries across data locations, including Kafka streams, HDFS, Spark, and Amazon S3 object stores. The latter ability, querying Amazon S3 object stores, includes Amazon’s native S3 SELECT API, improving performance by filtering at the edge. 

Federated queries can be more useful than you imagine. For example, suppose we wish to locate all individuals that:

work at ‘IDG’ and know each other ‘directly’ and whose names sound like ‘Doug’ or ‘Steve’ and have made a phone call to each other within 24 hours from either Singapore or San Francisco

This kind of query might be seen in a fraud investigation or in response to a financial regulator’s information request. In a typical enterprise, this information will be spread across half a dozen or more different systems and require perhaps a week or more to answer. With federated query, we can stitch this together into a single query and answer within an hour. In an era of heightened regulatory oversight, many companies struggle to avoid fines for answering queries late, and federated queries help a lot here.

Greenplum analytics and machine learning

Greenplum’s MADlib extension, a SQL-based library for data analytics and machine learning, was initially developed by several universities and Greenplum. MADlib was designed to work with the shared-nothing parallel architecture of Greenplum. Not all machine learning algorithms can be made parallel, but for those that can, MADlib achieves more or less linear scalability with the size of the data set, while avoiding data transfers. MADlib includes a little more than 50 of the most commonly used machine learning algorithms.

One of the most useful features of MADlib is the SQL interface, enabling the citizen data scientist to add value without having to climb the learning curve of Python or R. Models can be deployed via a MADlib REST endpoint to operationalize the analytic insights. For an enterprise that has a medium level of analytic maturity and that implements champion/challenger decision management strategies, using SQL can increase the number of models under consideration without additional resources being diverted from a central team.

For the traditional data analyst, the PivotalR connector (available on CRAN) provides a classical R language interface to MADlib by translating R code into the corresponding SQL statements on the client, then sending them over to the Greenplum cluster for execution. This avoids data transfer and allows the manipulation of large data frames that would otherwise be impossible in R because of memory constraints.

greenplum model deployment 02 Pivotal

Figure 2: Machine learning model deployment with Greenplum and MADlib.

HTAP data warehouse

Hybrid transactional/analytical processing (HTAP) is a term coined by Gartner. Their definition:

Hybrid transaction/analytical processing (HTAP) is an emerging application architecture that “breaks the wall” between transaction processing and analytics. It enables more informed and “in business real time” decision-making. 

In practice this means the system’s use cases are a mix of long and short queries, as well as updates and deletes. In order to support HTAP and prevent resource starvation, Greenplum implements a form of SQL containerization called resource groups that allows resource isolation in a multi-tenanted HTAP environment. By using a resource group you can limit CPU, RAM (by group or query), and maximum concurrency. Resource groups improve performance on mixed workloads and prevent query competition for resources.

One of the key differences between PostgreSQL and Greenplum is the query planner. Although Greenplum inherited the PostgreSQL query planner when it was forked, efficient query planning in a distributed environment is significantly different than on a single machine. For that reason Greenplum set out to build their own query planner, basing it on the Cascades Framework for Query Optimization. This algorithm evaluates all possible query plans and assigns them a cost, selecting the lowest cost (fastest) plan for execution.

Greenplum provides a few features to help the query planner avoid data movement, like the ability to replicate dimension tables to each node in the cluster for faster local join operations and tunable data compression.

Semi-structured data processing is inherited from PostgreSQL and includes JSON and JSONB, XML, key-value pairs (HSTORE), and plain text. GIN (Generalized Inverted Index), also inherited from PostgreSQL, can be used to index a text column that is frequently used. For more complex text queries, GPText can be used. GPText integrates Greenplum segments with Apache Solr shards to provide natural language search queries. Because the Solr shards are on the same node, they have the same parallel architecture.

Greenplum performance

HTAP databases require a balancing act between large, long-running analytical queries, short ad-hoc queries, and the ACID transactions on the OLTP side of the equation. Good performance in this mixed workload scenario is important for the hybrid use case that Greenplum is aiming for. The PostgreSQL 9.4 kernel gave Greenplum 6 a host of optimizations, mostly around avoiding locks, that result in a 60-fold increase in performance over Greenplum 5 on TPC-B benchmarks.

greenplum tpc b 03 Pivotal

Figure 3: TPC-B benchmark results for Greenplum 6 and Greenplum 5. 

Given that PostgreSQL has paved the way for further optimizations (and is now on version 12), we can expect further improvements in Greenplum as the kernel is upgraded again in Greenplum 7.

Greenplum Command Center

The Greenplum Command Center is part of the Pivotal offering and provides a web-based interface for monitoring and managing a Greenplum cluster (or multiple clusters). Although hard-core DBAs are unlikely to give up their command line interfaces, the Command Center is a welcome management tool for departmental level deployments that may not have access to a full-time DBA. I found it easy to navigate and well documented. Users, queries, nodes, segments, and resource groups can all be easily managed via the interface.

greenplum command center 04 IDG

The Greenplum Command Center. 

Greenplum in the enterprise

Greenplum makes an ideal choice for a departmental standard, as it can handle mixed workloads, including predictive analytics, in a single platform. If you’re not picking software a-la-carte from an ELA menu, or wish to escape A.I. ‘pilot purgatory’, investment in Greenplum’s HTAP approach might provide a way to increase innovative uses of machine learning and analytics at a lower price point than competing solutions.

Greenplum is also a no-brainer for enterprise-level Netezza or Teradata replacements. And while Greenplum is not quite up to wresting OLTP from the likes of Oracle Database or Microsoft SQL Server across the enterprise, it will work well for mid-size transactional systems.

Greenplum is a good example of the 80/20 rule. While it does not perform any single task as well as a built-for-purpose tool, it does most of them well enough to cover 80% of the use cases, and that’s without the organizational and operational overhead involved in stitching together multiple systems and integrating them into an analytics pipeline. This weighs heavily in its favor when considering the total cost of ownership.

Cost: Free open source under the Apache 2.0 License. 

Platforms: Available as source code; as packages for CentOS, Red Hat, Debian, and Ubuntu Linux distributions; and in the Amazon Web Services, Microsoft Azure, and Google Cloud Platform marketplaces.  

At a Glance
  • Greenplum Database is a flexible analytics solution for the majority of data workloads in the enterprise. By combining the most common analytic and data science use cases in a single environment, Greenplum reduces data movement between point solutions and minimizes the learning curve for citizen data scientists.

    Pros

    • ANSI SQL compliant
    • Easy access to external data including cloud stores
    • Good support for big data file formats (ORC, RCFile, Parquet, etc.)
    • Integrated machine learning
    • Low barrier to adoption (SQL skills)

    Cons

    • Cannot scale a cluster down
    • No cross-cluster queries
    • Resource groups depend on Linux control groups (cgroups)

Copyright © 2019 IDG Communications, Inc.