Loosely speaking, a data lake is the big data version of an operational data store, plus a network storage appliance, plus data processing/query engines, all combined -- typically in a Hadoop cluster augmented by database engines. The concept of the data lake is simple: “pour” in it every record from any source you can find, and make all this data available to anyone who needs access. This way, there is no "loading bias" or "transformation bias" as to which data is useful and in which form. Everyone finds what they need.
At least that's the theory. But is the data lake really helping IT to get more agile, or is it actually slowing things down and making IT harder?
Yes, it is helping with agility
The first and foremost benefit of the data lake is that, because all data is poured into the lake, usually with very little latency, it is therefore available for any usage: analytical, operational, etc. In the pre-data-lake world, if you wanted to use data from system A for your analytics, you needed to request that an ETL job be designed, developed and deployed to get records from system A into your operational data store or other target structure. And the first iteration would rarely be correct, especially if you were exploring a new analysis angle.
With the data lake, it's like querying the operational systems directly. Expect that your queries won't risk crashing these systems. And you get a quasi-unified layer to access all the various data -- the second benefit. I am saying quasi-unified because there are still several distinct data processing frameworks and interfaces on top of Hadoop, but it's still easier than doing data joins between an ERP, a database and log files (for example).
Where it makes IT harder
The data lake does however create issues that make things harder for IT.
- Processes born in the data lake are harder to operationalize. This is the flip side of agility. A number of projects using the data lake can be viewed as disposable, and will never graduate from the business or data science team that created them. But when a project shows value, it becomes important for IT to regain control of it before it becomes mission-critical. The early prototype must be hardened, secured, instrumented. Source data loading must be solidified. Because none of this is typically taken into account during the research phase, it usually means rebuilding everything, usually under pressure for expectations are high to see the project go live fast.
- Data governance and quality are tough to control. The data lake misses metadata. Its real-time nature often precludes proper data quality to be implemented at loading time. As a result, most data lakes actually look more like data swamps! Problems arise when users of the data lake are not attuned to this lack of governance and quality. They blindly trust the data made available to them. Or they make (false) assumptions on the meaning of certain data items, with no metadata reference to help.
- Lack of security and privacy create real risks. Every application has different security policies and user types. Marketing folks should not have access to data about sales reps bonuses -- and yet they need detailed data about orders and sales productivity. Factory plant managers have no business accessing HR records for their crew -- and yet they must have access to vacation requests to manage capacity. In traditional applications, these permissions and rights have been carefully weighted to prevent leaks and liability. Security management in Hadoop is way behind security in the source systems. The risk that the data lake provides someone with access to data they should not have access to, is real.
Which is not to say that the data lake is bad. But, like any new concept, it must be used knowingly.
This article is published as part of the IDG Contributor Network. Want to Join?