3 big data platforms look beyond Hadoop

Learn how the Cloudera, Hortonworks, and MapR data platforms are evolving to meet the demands for real-time analytics and machine learning

3 big data platforms look beyond Hadoop

A distributed file system, a MapReduce programming framework, and an extended family of tools for processing huge data sets on large clusters of commodity hardware, Hadoop has been synonymous with “big data” for more than a decade. But no technology can hold the spotlight forever.

While Hadoop remains an essential part of the big data platforms, and the major Hadoop vendors—namely Cloudera, Hortonworks, and MapR—have changed their platforms dramatically. Once-peripheral projects like Apache Spark and Apache Kafka have become the new stars, and the focus has turned to other ways to drill into data and extract insight. 

Let’s take a brief tour of the three leading big data platforms, what each adds to the mix of Hadoop technologies to set it apart, and how they are evolving to embrace a new era of containers, Kubernetes, machine learning, and deep learning.

Cloudera Enterprise Data Hub

Cloudera was the first to market with a Hadoop distribution—not surprising given that its core team consisted of engineers who had leveraged Hadoop in places like Yahoo, Google, and Facebook. Hadoop co-creator Doug Cutting serves as chief architect. 

The company’s strategy with the Cloudera Enterprise Data Hub (EDH) is to “curate and extend” the open source projects in the Hadoop ecosystem to provide a commercially licensed platform, with enterprise-grade support and service as part of the pricetag. The company also offers an open-source, free-to-use Hadoop distribution, called Cloudera Data Hub (CDH). In addition, Cloudera offers a 60-day trial edition of EDH as another way to get started.

Where to download Cloudera

Cloudera provides multiple ways to download and use CDH. VMs and Docker images can be used to run EDH locally; Cloudera Manager can be used to deploy CDH and EDH (including the trial version) on a cluster; and Cloudera Director can deploy to cloud environments, among them Amazon by way of AWS Quick Start.

Unique Cloudera features

Cloudera has centered on Apache Spark, and Spark-related projects, as the heart and soul of its distribution. Taking full advantage of the unified analytics engine, Cloudera makes use of Spark Streaming, Spark MLlib, and Spark SQL for real-time streaming data, machine learning, and SQL-style querying of data, respectively.

A significant value-add provided by Cloudera is its Cloudera Navigator software, a set of proprietary data governance, management, and optimization tools. Cloudera Navigator tracks the provenance of data in an organization for management, compliance, and auditing, provides ongoing data workload usage statistics, and recommends data placement strategies to match.

The native machine learning aspects of Cloudera EDH are limited to Spark MLlib. Native support for TensorFlow, for instance, isn’t an advertised EDH feature. However, the Cloudera Data Science Workbench product provides a user-friendly data science front end to EDH, where end users can create their own integrations between EDH and frameworks like TensorFlow.

Hortonworks Data Platform 

The Hortonworks Data Platform (HDP) is a pure open source Hadoop distribution. The product itself is free to use. Hortonworks’ enterprise customers pay for support and also receive proactive troubleshooting tools (which are themselves proprietary) to head off future issues.

Where to download Hortonworks

The Hortonworks site provides downloads for HDP in multiple formats. Automated installers can deploy HDP on a variety of local or cloud architectures, and RPMs are available for those who want to deploy manually. Earlier versions of HDP are available as Hortonworks Sandbox editions, which are pre-configured HDP environments packaged in a virtual machine for dev-and-test use.

Unique Hortonworks features

HDP 3.0, now in GA, includes automatic provisioning for cloud environments and cloud-native data storage formats (e.g., Amazon S3 and Google Cloud Storage); interactive SQL query functionality by way of Apache Hive, and support for GPU-based processing.

The most significant new addition involves containers. Apps in Docker containers can be run as YARN jobs, side by side with traditional Hadoop workloads. Deploying in Docker containers is a useful way to ensure that a job can be run with a specific edition of a language runtime. It’s also possible to run containers on Kubernetes, by way of Kubernetes on YARN, where YARN is used as the scheduler in Kubernetes.

Another new feature, currently available as a technology preview, allows you to deploy TensorFlow deep learning applications in containers across an HDP cluster. It’s clearly intended to be a step towards turning HDP into an end-to-end machine intelligence platform.

MapR Converged Data Platform

MapR’s flagship product, rechristened the “MapR Converged Data Platform” in 2016, sits between Hortonworks and Cloudera in terms of its licensing. MapR has an entirely open source community distribution, which can be used freely, but also provides a for-pay enterprise edition with high availability, data snapshotting, disaster recovery, technical support, and other enterprise-grade features.

Where to download MapR

MapR offers an installer package to deploy either the community or the enterprise edition. Cloud deployments are available directly to AWS, Microsoft Azure, Google Cloud, and other cloud providers worldwide. MapR also offers a ”Sandbox” edition, with virtual machine images available for VMware or VirtualBox.

Unique MapR features

MapR Converged Data Platform comprises three major components: the MapR-FS file system (essentially, transparent integration of multiple data storage paradigms into file system interfaces including Hadoop’s HDFS); a NoSQL-style document database; and an Apache Kafka-compatible event streaming engine.

This Kafka-compatible MapR Streams event streaming engine is another major differentiator for MapR, with its emphasis on online, streaming, real-time, and edge processing scenarios. A small-footprint edition of MapR called MapR Edge provides is designed for processing data in IoT scenarios.

MapR has made room in its platform to accomodate two recent significant trends, containers and machine learning. Docker images can be scheduled and run across a MapR cluster using Kubernetes, and MapR provides a Kubernetes volume driver that allows those containers to connect directly to MapR-FS resources.

Copyright © 2018 IDG Communications, Inc.

How to choose a low-code development platform