Security for big data analytics is challenging. Here's why: When you can't analyze in place, you need to copy that data -- at which point all the stipulations about who can see or change all manner of data under what circumstance should be replicated, too. Today, that's nearly impossible to do.
On the Hadoop/Spark side, we have only role-based, limited access control lists (ACLs) or the Wild West. But I believe there's a way forward: Adopt the policy-based approach that has arisen in the broader security market. To explore how that could work, we need to revisit the history of access control and how it evolved to produce a policy-based model.
A three-minute history of access control
In the beginning, there were usernames and passwords to keep out everyone who might want in, despite what Richard Stallman said.
There was an inherent problem with this system. The number of user/password combinations tended to explode as new applications were written, so we ended up with a different user/password for each application. Worse, some applications asked for different passwords to reach different levels of security.
We became smarter and divided up “roles” from usernames. We’d have one “user/password,” but to access the administrative functions, that user/password would also need an “admin” role, for example. However, each application tended to implement this on its own, so you still had a growing list of passwords to remember.
We became even smarter and created central systems that eventually became LDAP, Active Directory, and the like. These united the user/password in a core repository and established one place to look up the roles for a given user -- but this replaced one problem with another.
In an ideal world, each new application looks at the list of roles in Active Directory and maps them to application roles, so there's a clean, one-to-one relationship. In reality, most applications think of roles differently, and besides, simply because you’re an admin for one application doesn’t mean you should be an admin for another. In the end, you've replaced an explosion of user/password combinations with an explosion in the number of roles.
Which begs the question: Who ends up in charge of adding new roles? It tends to be either some IT-administrative or shared-HR function. Since there's a good chance none of those people with the menial task of adding roles will actually understand the application very well, this usually ends up being a “manager approval” or “rubber stamp,” and that isn’t, as they say, good.
Many applications still punt on the question of roles by using AD for authentication and having the application handle its own local role implementation. There's a lot to be said for this approach, because it's clearly the application administrator who knows who should have what level of access.
Meanwhile, there are clear rules that do not cleanly fit into a user/role system. At its simplest, because I’m a banking customer doesn’t mean I can withdraw money from any account even if I have the “canWithdraw” role. Roles often need to be associated with data, which is why we have ACLs that map to entries in our data store. That is, account 1234 has an association that identifies me as its owner and my spouse as an authorized account administrator.
However, some businesses have rules that are more complicated than “is this yours?” or “what permissions do you have on this record?” Instead, they use what you might call “contextual" or “policy-based" security rules. In other words, I might have permission only to withdraw money while I’m within the continental United States. There's no way to express this in an ACL or role-based model. Instead, we've crossed over into policy-based security.
When you can do only some things sometimes
Policy-based security exists quite often in a central repository and relies on central authentication mechanisms (LDAP, Kerberos, and so on). The difference is, instead of maintaining simple roles (such as canWithdraw), each user is associated with a set of policies. The policies are based on a set of attributes about the user, also known as attribute-based access control (ABAC). Those policies cannot be centrally enforced as they are entirely application-dependent.
There are already standards for supporting this, derived in part from defense and other select industries. One such standard is eXtensible Access Control Markup Language (XACML), which allows you to express sets of policies. Enforcement is usually application-based, using some sort of algorithm or rule system. XACML is a pretty comprehensive standard for expression and even handles exceptions like conflicts in policy or two algorithms enforcing one policy.
Often these ABAC-driven policies, as in the case of RBAC, are based on data rather than application function alone (you can access the schematic for the F-22 only while you’re in the United States working for this particular company and a citizen in good standing). One of the first steps in applying policy is often identifying and “tagging” the data to which the policy rules should apply.
Why you should care about advanced security
Clearly, using ABAC-style policies and XACML is a hefty step over RBAC. You should have the motivation to do this, if only to avoid a big, fat $100 million fine. I mean, $100 million here and $100 million there, and before long it adds up to real money.
Also, some organizations have complex rules and ownership of data. As these companies increasingly move to become data-driven and can’t analyze everything in place, but instead require centralization, they’ll need a system that goes beyond the common RBAC models of today. Moreover, to make that feasible, they’ll need tagging and libraries that allow them to apply policies expressed in something like XACML as well as the tools to manage the policy centrally while applying it locally where meaningful.
When we look at today’s big data offerings, such as Ranger and Sentry, nothing comes close to answering this call. Even solutions for RDBMS-based systems tend to be proprietary, expensive, and often incomplete. Organizations doing high security with complex security rules are forced to implement this on their own. Heck, data tagging tools are still in their infancy for big data systems like Hadoop.
In other words, there's a big opportunity here for the vendors who can figure it out. Clearly, the defense industry is the first customer, because it's already doing it out of necessity. As more companies create central data repositories for big data analysis, the need for policy-based security is only going to grow.