We've all seen the marketing hype surrounding the data lake. Data lakes are much like Michael Corleone at the end of The Godfather. Data lakes will answer all your questions and solve all your problems. However, as with Michael’s pronouncement(s) at the end of The Godfather, there is a downside to this “offer” that marketers may think we cannot refuse. There is usually a set of stakeholders out there who are unfamiliar with Hadoop or the concept of a data lake or perhaps just not interested in changing the status quo of their organizations.
As a data architecture, you are pitching a data lake like you do one of those mountain lakes on travel websites or George Clooney movies ... lakes are cool, clear, and usually have the reflection of a snow-tipped mountain peak on their surface to show the purity of the contents within. Everyone wants to drink water from this source. However, when some people hear the concept -- data from many sources being stored without a schema for some possible future benefit -- they will think more about the concept of a data swamp rather than a pristine data lake.
Data swamps are places where unknown data sits in a Hadoop cluster. You don’t know where the data came from. You don’t know how old the data is. You have no idea what you might use the data for. Heck, the first use of this type of data before a skeptical executive more concerned with the status quo than organizational change will evoke the classic, “what's your data source? How can you verify this information? I have different experiences....”
But before you can even get to that meeting where people start to question the data from your data lake, you need to propose, build, and populate one. Here are the top three (3) objections that I often hear to “discourage” any budding data architect from attempting start their data lake initiative, and how you might answer those objections:
Aren’t data lakes just another silo to get in the way? Just like the name implies, data lakes provide the opportunity to put all that pure data into a single location. This allows for information from those new, and often voluminous, data sources to share an environment with traditional data sets and each other. This allows for data-driven organizations to discover links between data sets such as mobile and social, make new insights from the data, and potentially create new business models such as how Uber changed the personal transportation business. I would answer this objection with the advances in data integration technologies such as data virtualization and ETL/ELT/ET/ETLT, as well as the ability to share data between data management architectures. The day of “data silos” is more about “want to” than “can’t do.”
Data lakes aren’t robust enough for our needs…Hadoop isn’t even 10 years old! I would say that the above objection is provided by someone who is invested in the care, feeding, and maintenance of a data warehouse. The types of “needs” that this objection is attempting to address are data governance, quality, stewardship, and lineage. True, the data governance practices of data lakes lags behind those other data architectures based on the concept of ‘schema on write’ where you predetermine the questions before you create and populate the structure. I would answer that a data lake attempts to solve a different set of requirements. Instead of assuring the quality of the data for “regulatory quality reporting” (i.e., someone goes to jail if the numbers are wrong), data lakes are designed to allow for discovery and then the potential use for new business models. A data lake’s data quality practices are less about the syntactic quality of the data (are all the fields perfect?) and more about the semantic quality of the data (can we use this well?).
Data lakes threaten the established data management structures such as the data warehouse More often than not, I hear this one coming out the mouth of someone who sells proprietary data warehouse storage components…yes. Some in the EDW world find the presence of the data lake to be a threat to the “single version of the truth” component of the enterprise data warehouse. However, more often than not, the data that exists within a data lake isn’t the type of curated structured data that data warehouses are known for. I would answer that the data that exists within the data lake is more often the type of atomic level event data with lots of extra fields that haven’t proven themselves yet “worthy” of placement in the data warehouse. Part of this is the concept of separating the signal from the noise. Another is the concept that pouring potential petabytes of data into the EDW will cause to two things to happen. One, the data quality people will become “concerned” (okay, have a heart attack) over the data coming into the platform. Two, the storage vendor will retire early to some golf course with the purchase agreement to handle all that information
After you hear the objections a couple hundred of times, the question then becomes: is a data lake worth the time, trouble, and effort if it might devolve from the pure data sources high in the mountains if this is the type of resistance that you encounter? The answer to that is "most certainly!" The advantages of the data lake outweigh the risks. The data lake is how data-driven organizations will validate and power their new businesses.
Does your organization want to be part of the future or part of the past?
This article is published as part of the IDG Contributor Network. Want to Join?