A year ago a Deutsche Bank survey of CIOs found that “CIOs are now broadly comfortable with [Hadoop] and see it as a significant part of the future data architecture.” They're so comfortable, in fact, that many CIOs haven’t thought to question Hadoop’s built-in security, leading Gartner analyst Merv Adrian to query, “Can it be that people believe Hadoop is secure? Because it certainly is not.”
That was then, this is now, and the primary Hadoop vendors are getting serious about security. That’s the good news. The bad, however, is that they’re approaching Hadoop security in significantly different ways, which promises to turn big data’s open source poster child into a potential pitfall for vendor lock-in.
Can’t we all get along?
That’s the conclusion reached in a Gartner research note authored by Adrian. As he writes, “Hadoop security stacks emerging from three independent distributors remain immature and are not comprehensive; they are therefore likely to create incompatible, inflexible deployments and promote vendor lock-in.” This is, of course, standard operating procedure in databases or data warehouses, but it calls into question some of the benefit of building on an open source “standard” like Hadoop.
Ironically, it’s the very openness of Hadoop that creates this proprietary potential.
It starts with the inherent insecurity of Hadoop, which has come to light with recent ransomware attacks. Hadoop hasn’t traditionally come with built-in security, yet Hadoop systems “increase utilization of file system-based data that is not otherwise protected,” as Adrian explains, allowing “new vulnerabilities [to] emerge that compromise carefully crafted data security regimes.” It gets worse.
Organizations are increasingly turning to Hadoop to create “data lakes.” Unlike databases, which Adrian says tend to contain “known data that conforms to predetermined policies about quality, ownership, and standards,” data lakes encourage data of indeterminate quality or provenance. Though the Hadoop community has promising projects like Apache Eagle (which uses machine intelligence to identify security threats to Hadoop clusters), the Hadoop community has yet to offer a unified solution to lock down such data and, worse, is offering a mishmash of competing alternatives, as Adrian describes:
[T]he principal Hadoop distributors on whom enterprises rely for the effective operation of their Hadoop stack are using data security as an opportunity for product differentiation. As a result, these distributors are pursuing different paths, different open- and closed-source software projects, and targeting different points of vulnerability. The maturity of their approaches also differs, which complicates the decisions made by potential buyers and increases the likelihood of incompatibility between them and other related software from third-party vendors that may choose to support competitors' offerings instead — raising questions about lock-in by organizations who wish to tackle security concerns now.
Big data security, in short, is a big mess.
Love that lock-in
The specter of lock-in is real, but is it scary? I’ve argued before that lock-in is a fact of enterprise IT, made no better (or worse) by open source ... or cloud or any other trend in IT. Once an enterprise has invested money, people, and other resources into making a system work, it’s effectively locked in.
Still, there’s arguably more at stake when a company puts petabytes of data into a Hadoop data lake versus running an open source content management system or even an operating system. The heart of any business is its data, and getting boxed into a particular Hadoop vendor because an enterprise becomes dependent on its particular approach to securing Hadoop clusters seems like a big deal.
But is it really?
Oracle, after all, makes billions of dollars “locking in” customers to its very proprietary database, so much so that it had double the market share (41.6 percent) of its nearest competitor (Microsoft at 19.4 percent) as of April 2016, according to Gartner’s research. If enterprises are worried about lock-in, they have a weird way of showing it.
For me the bigger issue isn’t lock-in, but rather that the competing approaches to Hadoop security may actually yield poorer security, at least in the short term. The enterprises that deploy more than one Hadoop stack (a common occurrence) will need to juggle the conflicting security approaches and almost certainly leave holes. Those that standardize on one vendor will be stuck with incomplete security solutions.
Over time, this will improve. There’s simply too much money at stake for the on-prem and cloud-based Hadoop vendors. But for the moment, enterprises should continue to worry about Hadoop security.