Hortonworks' Hadoop now launches from Google Cloud Platform

Support for Hadoop includes the HDP distribution and connections to Google's other data-crunching services

Hadoop elephant code

Fans of Hortonworks have a new cloud venue in which to deploy and run the company's edition of Hadoop: the Google Cloud Platform.

In joint announcements, Hortonworks and Google revealed that the Hortonworks Data Platform (HDP) distribution of Hadoop is now available and fully certified on Google's cloud.

This isn't the first time Hadoop has been available on the Google Cloud Platform. Previously, Google provided setup scripts and software libraries for Apache Hadoop -- but only for the core version of Hadoop, without add-ons or third-party refinements. But now HDP can be deployed on Google Cloud Platform with the same command-line tools already used for setting up Hadoop -- tools produced in collaboration between the two companies.

Version 2.2 of HDP, which has been approved for Google Cloud Platform, was released last October and boasts a bevy of features designed to leverage the latest developments in Hadoop. Among them is closer integration with the Spark in-memory processing framework, new processing engines such as Apache Kafka for analyzing data in real time, and many other changes.

The collaboration between Google and Hortonworks also allows other Google Cloud Platform features to be leveraged through HDP. For instance, Google Cloud Storage -- Google's object storage system -- has an HDP connector that allows analyses to be run on data in Cloud Storage without being copied to an HDFS volume. Google's BigQuery data analytics platform also sports an HDP connector -- one apparently designed to entice Hadoop users, since it can perform certain kinds of processing with less work.

In theory, linking BigQuery and HDP should provide Google Cloud Platform customers with a migration path between the two, but it's more likely that BigQuery will be used for processing selected jobs rather than replacing Hadoop entirely -- if only because of the far larger and more established audience for Hadoop. Dataflow, another Google Cloud offering in the same vein, superficially resembles a Hadoop replacement but is designed more as a competitor for a specific Hadoop component, Spark.

BigQuery might also have an edge over Hadoop in cost, depending on the workload -- and provided the BigQuery feature set is all people need. With Hadoop, costs are incurred for both storage and Google Compute Engine. Connections to BigQuery are billed at that service's rate; processing incurs a charge of $5 per terabyte and storage costs 2 cents per gigabyte per month, but loading or exporting data into the system, as well as simple table reads, cost nothing.

It wouldn't be wise to rule out the possibility that Hadoop users on Google Cloud Platform may gravitate more toward Google's tools in the long run. Having more Hadoop jobs hosted in the cloud, data and all, may spur interest in Google's data analytics -- provided Hortonworks' offerings don't overshadow them or prove to be the better value.