For all of Hadoop's growth and acceptance as an enterprise technology, it still lacks key features. Some of its shortcomings, such as consistent data governance methodologies, are too important to ignore any longer.
Hortonwork, a major Hadoop vendor, is attempting to address the problem by creating the Data Governance Initiative for Hadoop. The goal is to incorporate governance for data in Hadoop's design, rather than as a good idea in the abstract.
Hortonworks, creator of the HDP distribution (now supported on Google Cloud Platform), has approached the problem from several fronts. On one front, there's Hadoop itself, where Hortonworks has been attempting to address the need for better data governance and security technology by contributing to related Apache Foundation projects. One such project, Apache Ranger, was derived from a closed-source product that Hortonworks bought and transformed for Apache; another, Apache Falcon, aids with the mechanics of managing data lifecycles (intake, purging, and so on). With these moves, Hadoop and its associated projects will have a common set of mechanisms for dealing with data security, both inside and outside of Hadoop.
On another front, Hortonworks has collaborated with companies using Hadoop in the field and at scale -- Target, Merck, Aetna, and SAS, specifically -- to implement the proposed measures. By doing much of this work out in the open, Hortonworks hopes others will come on board and not simply because the pieces are being rolled into the projects underlying Hadoop.
A third approach, auxiliary to the other two, works with existing enterprise data-governance technology, but not to replace it. Tim Hall, vice president of product management at Hortonworks, explained in a phone call that the point of this initiative is not to give a company that uses -- or is considering -- Hadoop a substitute for its existing tool set, but a complement to it.
"Hadoop augments your existing data architecture," Hall explained. "It allows you to modernize it and cost-effectively land massive volumes of data into it; this isn't a rip-and-replace strategy. We want to build this [initiative] with the intent of ensuring that we can provide the metadata, the policy access, et cetera, to third-party data governance tools, so that regardless of where you're trying to look at information, you get a consistent view." In this case, "consistent" refers to enforcing the policies set on the data and the access controls meant to protect it.
According to Hall, the main problem with third-party governance has to do with the tools seeing the edges of Hadoop, but being unaware of what goes on inside it. "If you have to stitch a compliance report together," Hall said, "you're like, 'Well, I sent it to my Hadoop infrastructure, but I don't know if Johnny or Suzy hacked up the data a hundred ways to Sunday.' I know it came out the other end looking like this, but what happened in the middle?"
Other vendors' approaches to solve this problem, Hall noted, work only as long as every job created is written using the vendor's tooling. As he put it, the idea is to have "comprehensive visibility [from the core of Hadoop] regardless of whether their tool was used or not."
Hortonworks' short-term plan, currently under development, is to create a working prototype that implements the most core functionality: a REST API, a centralized taxonomy, import/export metadata, and so on. Following that, the next push will be formally announced at the February Strata conference, with the features rolled into a future releases of HDP "as they land throughout the year."
It also remains to be seen how other major Hadoop vendors pick up and elect to extend on this work. MapR, Pivotal, and Cloudera all have a stake in open source Hadoop, so any security and governance developments affecting Hadoop's core will likely become key parts of their own distributions. In turn, each would have new ways to contribute back -- and new hooks for proprietary value-adds to distinguish themselves from each other.