For all its open source roots, Hadoop remains strongly guided by the companies that offer it in a commercial distribution.
Among them, Hortonworks is one of the biggest, and the latest version of the Hortonworks Data Platform (HDP) distribution of Hadoop comes with features that speak to the lingering issues of usability, manageability, and security for Hadoop users.
Scheduled for general availability in November, HDP 2.2 rolls up many of the changes made to Hadoop's open source components since the last release. Those changes reflect one of the biggest changes n the Hadoop world in the past year: the move away from the legacy MapReduce framework to YARN, a new, more powerful job-processing system.
According to Jim Walker, director of product marketing at Hortonworks, the Hadoop app ecosystem's shift from MapReduce and toward YARN in so short a timespan has been reflected in developers' avoidance of MapReduce. "All the abstractions, to abstract away MapReduce from people, that's all happened," Walker said in a phone conversation.
As a result, he noted, much of the excitement around Hadoop has moved toward more ambitious, higher-level work like running SQL queries against Hadoop via Stinger or using Spark to perform high-speed, in-memory data science. No surprise, then, that the two -- both now possible thanks to YARN -- feature prominently in HDP 2.2.
Hortonworks has been behind the Stinger initiative, a project designed to speed up SQL querying on Hadoop and provide it with behaviors more commonly associated with transactional databases: ACID transactions, updates and deletes, and so on. The company previously supported Spark within HDP as a technology preview. In HDP 2.2, Spark is integrated more directly into YARN.
Though Spark is speedy and enjoys the integration of SQL querying, Walker doesn't feel there's a short-term route toward replacing Hive's querying system with the superspeedy SQL engines for Spark. "Spark SQL is great for developers who want to express a SQL statement in live code," he said, "but is it a substitute for Hive? It'll take time to get to that path." Stinger itself, he noted, took a long time to bring up to speed due to the optimizations that needed to be added: "These are not simple things."
Another area where Hortonworks' rising tide may lift many other boats is its work with the newly introduced Apache Argus project, created to provide security policies throughout Hadoop. Argus is the open source version of a commercial product, XA Secure, that Hortonworks purchased earlier in the year and donated back to Apache. But the main problem with creating any security-policy solution for Hadoop is getting others to adopt it. To that end, Hortonworks has worked on integrating Argus with projects like Hive, HBase, Storm, and Knox.
The company is also fully conscious of how difficult it is to add security to a project as sprawling and multifaceted as Hadoop. "Authentication first, then enforcement," is the approach Hortonworks is taking, as Walker describes it. This top-down approach involves first integrating Hadoop with existing LDAP and Active Directory resources, so that organizations can leverage what they already have, then adding policy frameworks and enforcement to the various engines that run inside Hadoop. "We have visibility into all the layers of the Hadoop stack," Walker noted, citing that as one of the possible ways Hortonworks can encourage development and uptake of those features.
Likewise, with backup and upgrades, Hortonworks lays claim to having invested in underlying Apache projects for the good of all other Hadoop users, not just its own. Notes in the HDP 2.2. press release state that "investments at the core of many of the projects [in Hadoop]" have made it easier to perform rolling upgrades "without taking the entire cluster down." And for backup, the Apache Falcon project in HDP 2.2 has been extended to perform cloud-based backups to either Microsoft Azure or Amazon S3 stores.