On the surface, data lakes seem like a smooth idea. Instead of planning and building complex integrations and carefully constructed data models for analytics, you simply copy everything by default to commodity storage running HDFS (or Gluster or the likes) -- and worry about schemas and so on when you decide what job you want to run.
Yet making a full copy of your data simply to analyze it still gives people -- particularly those in authority -- pause. The infrastructure for a data lake may cost pennies on the dollar compared to a SAN, but it’s a chunk of change nonetheless, and there are many security, synchronization, and regulatory issues to consider.
That’s why I think this year’s buzzphrase will be “analyze in place” -- that is, using distributed computing technologies to do the analysis without copying all of your data onto another storage medium. To be truthful, “analyze in place” is not a 100 percent accurate description. The idea is to analyze the data “in memory” somewhere else, but in collaboration with the existing system rather than replacing it entirely.
The core challenges for analyzing in place will be the following:
- The load on the source system
- The latency or speed of getting data from the source system into memory for analytics
- The complicated security rules necessary to make all of this work
- The costs of cooperation
The chief benefit of analyzing in place is to avoid an incredible feat of social engineering. (My department, my data, my system, my zone of control, and my accountability, so no, you can’t have a full copy because I said so and I have the authority to say no and if I don’t I’ll slow-walk you to oblivion because I’ve outlasted kids like you for decades.) Also, you get security in context, simpler operations (not having another storage system to administer), and more.
There are plenty of good reasons to use a distributed file system -- and, frankly, SANs were a big fat farce upon us all. However, "I just want to analyze the data I already store elsewhere" may not always be one of those reasons.
Latency and load
There’s no way around load and latency cost. If I can analyze terabytes of data in seconds, but can move only a gigabyte at a time to my Spark cluster, then the total operation time is those seconds plus the copy time. Also, you still have to pick the data out of the source system.
We use the phrase "predicate pushdown" to mean if you don’t have an index on your RDBMS, then the time to get the data to your analytics system will be equal to the time it takes for your hair to fall out. Essentially your analytics system is passing the "where" clause along to the source system.
Right now, if you’re doing this in Spark, you need to balance the predicate pushdown (that is, network optimization) against pulling your source system over (execution costs). It’s exactly as if there is no magic and we're doing a giant query against the source system and copying it into another cluster’s memory for further analysis.
Sometimes to make this work you may have to shore up the source system. That may take time -- and by the way, it's not a sexy big data project. It may be a fatter server for the Oracle team. As a guy with a sales role for a consultancy, I get a headache from this because I hear “long sales cycle.” It may also be costlier than the lake approach; it will definitely be costlier in terms of labor. The deal is that in many cases, analyze in place is the only approach that can work.
Security and the federation
Handling security when you have permission to use the analytics system, permission to use a source system -- and potentially multiple source systems -- is complicated. I can see the FSM-awful Kerberos rules now! Up until now, big data projects have tended to skirt around this by getting a copy and designing a new security system around it or simply “pay no attention to the flat security model we got an exception for.”
In the brave new world of analyze in place, we’ll use terms like “federated” and “multitiered” to cover up “painful” and “complex,” but there is another word: “necessary.” Our organizational zones of control exist for a reason. I don’t know all the rules or reasons surrounding data in, say, the order management system, but the people running the order management system do.
That source system has not only the role-based access (RBAC) but all of the policy-based rules. (Yes, you are an employee in HR with the general permission to do so, but no, you can’t look up your boss’s data or change your own salary.) When our system asks to let so-and-so look at XYZ, the source system knows enough to decide whether that's OK. Replicating that security "over there" is next to impossible. Moreover, it has to happen at runtime or the security rules could get "stale."
Cooperation is hard
Creating a data lake is easier than analyzing in place, because you only have to get the authorization for the overall project, get the executive mandate to get data from the source systems, then use that to steamroll all the little fiefdoms while you suck down their data in what is essentially root mode. There are many social reasons to do this, because often these fiefdoms build up walls where no walls are needed.
Analyze in place means working together to decide who gets what and when. It means making changes on both sides to allow the security model to work (and probably deploying more awful Kerberos things). It also means identifying technical fears versus personal fears (does this new system make me obsolete?) and addressing them. It means bringing the people in charge of the source systems up to speed.
You may have to overcome 1,000 other points of resistance instead of only one. Rather than "you can’t copy all my data" and "the boss said I could," you have 1,000 little things you can’t do or haven’t been done -- and you have to work through or work around them. In other words, you’re going to have a lot of meetings.
Yet cooperation is an opportunity. Maybe some queries in other systems have been sucking down the source system from a performance perspective, and if you make a friend here, you might find a new client for your big data analytics project. Maybe someone in marketing is always asking for something they can’t get, so you have a whole new project.
You can’t always win cooperation -- the other side needs to have motivation, too. Executive sponsorship may be key, but it isn’t enough. Try steamrolling someone who has outlasted the last five waves of technical change. You may find the obstructionist on the source system side wants to learn more about this new technology. You may find that they feel threatened and need assurance there is still a place for them in this new world. You may also have to give something (maybe there's a use case for the long-discounted mainframe team; maybe your budget gets them a little extra memory or storage).
One thing many people forget is to actually ask for help. No one survives for long on merely protecting their lines of control; they must add value somewhere. Like politicians "asking for your vote," the big data project leader is "asking for your help (and your data)."
More than one way to analyze in place
I’ve mostly, to illustrate analyze in place, alluded to having Spark query your database via JDBC. Don’t you wish it were that simple?
It probably isn’t. There are may be pipelines of data and various transformations along the way. This is where tools like Nifi and Kettle, as well as ETL solutions like Bedrock will probably have to grow to the task.
Sometimes the project will look a lot more like an ETL project than, say, querying via Spark (but we won't store what we’re querying or we’ll include credentials along the way and do this at runtime). Eventually, batch and real-time will come together, and most items will be real-time. Batch will only be for finding answers to questions you didn’t think to ask before. You could consider a lot of real-time or streaming analytics systems to be analyze in place, where they subscribe to events from the source system. I suspect we’ll see more of this where we respect the source system and don’t store so many copies.
The answer is always “sometimes”
Extremes don’t tend to happen in the real world. The answer probably isn’t "let’s build a data lake" or "let’s analyze in place"; it's probably a combination and purely situational. That predicate pushdown may not scale or the load on your operational system of 1,000 little queries from analysts and “self-served” people may be too great. We might want a lake there.
Yet in other places the security rules might be advanced and the fines too great to not have one tried-and-true security system and let the source system enforce the rules while we work on the data in memory. We probably need to store our results to even analyze in place somewhere, and that might look something like an EDW, a data lake, or a big bucket on HDFS.
Yes, we’ll build more data lakes this year, but between new technologies like Spark and Nifi, and with new uses for secure data becoming important, I suspect many people will insist on having their cake (data securely stored in source systems) and eating it too (analyzing it with technologies like Spark).