How Apache Ranger and Chuck Norris help secure Hadoop

The Hadoop ecosystem has always been a bag of parts, each of which needs to be secured separately -- at least they did need that, until Apache Ranger came to town

How Apache Ranger and Chuck Norris help secure Hadoop
Credit: Jason H. Smith

The Hadoop security project called Ranger supposedly was named in tribute to Chuck Norris in his "Walker, Texas Ranger" role. The project has its roots in XA Secure, which was acquired by Hortonworks, then renamed to Argus before settling in at the Apache Software Foundation as Ranger.

When Hadoop started, it was a set of loosely coupled parts primarily used in the back end of the big Internet companies like Yahoo. These parts were wrapped into distributions and marketed as Hadoop by the likes of MapR, Cloudera, and Hortonworks.

Such piecemeal architecture isn't unusual in the world of open source or even in the wide world of commercial software. It does, however, result in security challenges. Some will read this as "it's insecure," but that isn't necessarily the case -- though it can be. The problem is more how do you authenticate users to all parts of this system of parts -- and once you authenticate them how do you authorize them to do only what you mean to allow them to do?

Each part of Hadoop has its own LDAP and Kerberos authentication, as well as its own means and rules of authorization (and in most cases totally separate implementations of the same). This means you get to configure Kerberos or LDAP to each individual part, then define those rules in each separate configuration. What Apache Ranger does is provide a plug-in to each of these parts of Hadoop and a common authentication repository, as well as allow you to define policies in a centralized location.

Ranger is clearly a Hortonworks-sponsored project (as opposed to a Cloudera or MapR or now Databricks). You can tell this in part by the way it's skinned (green) and in part because of what it supports. At present, Ranger supports the following:

Except for HDFS and HBase, which are supported as part of the core of Hadoop and Solr, these are some of the more "Hortonworksy" projects. In a modern deployment, you'll likely see other components, such as Spark or possibly Impala (from Cloudera). Nonetheless, Ranger is a great thing.

How Ranger works

In Ranger, for each component you work with a Repository. These repositories are based on an underlying plug-in or agent that operates with that component.

Ranger Hadoop security project

The repository manager from Hortonworks' Ranger documentation

Associated with each of these repositories is a set of policies, which are associated with the resource you are protecting (a table, folder, or column) and a group (such as administrators) and what they are allowed to do with that thing (read, write, and so on). You give each policy a name -- say, "Only the grp_nixon can read the apac_china table."

Ranger Hadoop security project

A policy creation screen from Ranger documentation

A GUI with a central view of who is allowed to do what brings much needed simplicity to the Hadoop ecosystem, but that's not all that Ranger offers. It also provides audit logging. Although this can't supplant all the application audit logging you could ever want, if you simply need to know who accessed what on HDFS or what policies were enforced where, it's probably exactly what you need.

In addition, Ranger can provide Key Management Services in order to work with HDFS's new TDE (transparent data encryption). So if you need end-to-end encryption and a clean way to manage the keys associated with it, Ranger is not a bad place to start.

Ranger looks ahead

I think the biggest hope for Ranger comes from its extensibility. You can create your own plug-ins for areas that are not covered.

If you were hoping this was the end of the story on Hadoop security, unfortunately, Cloudera has its own Apache project called Sentry (which MapR appears to also support) that covers much the same area. To be fair, Sentry was first, then Hortonworks acquired XA Secure. That said, the documentation for Sentry is virtually nonexistent, the coverage is more constrained, and the project website is in disrepair (although activity on GitHub recently picked up).

Hadoop security has come a long way. Ranger gives a fairly comprehensive, if still a little incomplete, way to manage the ecosystem. The holes that persist are mainly due to vendor competition throughout the big data world. These can be filled via the extensibility of the project, but it would be nice to see more collaboration and community in the Apache world.

From CIO: 8 Free Online Courses to Grow Your Tech Skills
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies