Data lakes: Just a swamp without data governance and catalog

Most businesses’ data lakes are merely repositories of undefined data sets from multiples sources, resulting in data swamps

data lake
Thinkstock

The big data landscape has exploded in an incredibly short amount of time. It was just in 2013 that the term “big data” was added to the pages of the Oxford English Dictionary. Fewer than five years later, 2.5 quintillion bytes of data is being generated every day. In response to the creation of such vast amounts of raw data, many businesses recognized the need to provide significant data storage solutions such as data warehouses and data lakes without much thought.

On the surface, more modernized data lakes hold an ocean of possibility for organizations eager to put analytics to work. They offer a storage repository for those capitalizing on new transformative data initiatives and capturing vast amounts of data from disparate sources (including social, mobile, cloud applications, and the internet of things). Unlike the old data warehouse, the data lake holds “raw” data in its native format, including structured, semistructured, and unstructured data. The data structure and requirements are not defined until the data is needed.

One of the most common challenges organizations face, though, with their data lakes is the inability to find, understand, and trust the data they need for business value or to gain a competitive edge. That’s because the data might be gibberish (in its native format)—or even conflicting. When the data scientist wants to access enterprise data for modeling or to deliver insights for analytics teams, this person is forced to dive into the depths of the data lake, and wade through the murkiness of undefined data sets from multiple sources. As data becomes an increasingly more important tool for businesses, this scenario is clearly not sustainable in the long run.

To be clear, for businesses to effectively and efficiently maximize data stored in data lakes, they need to add context to their data by implementing policy-driven processes that classify and identify what information is in the lake, and why it’s in there, what it means, who owns it, and who is using it. This can best be accomplished through data governance integrated with a data catalog. Once this is done, the murky data lake will become crystal clear, particularly for the users who need it most.

Avoiding the data swamp

The potential of big data is virtually limitless. It can help businesses scale more efficiently, gain an advantage over their competitors, enhance customer service, and more. It may seem, the more data an organization has at its fingertips, the better. Yet that’s not necessarily the case—especially if that data is hidden in the data lake with no governance in place. A data lake without data governance will ultimately end up being a collection of disconnected data pools or information silos—just all in one place.

Data dumped into a data lake is not of business value without structure, processes, and rules around the data. Ungoverned, noncataloged data leaves businesses vulnerable. Users won’t know where the data comes from, where it’s been, with whom they can share it, or if it’s certified. Regulatory and privacy compliance risks are magnified, and data definitions can change without any user’s knowledge. The data could be impossible to analyze or be used inappropriately because there are inaccuracies and/or the data is missing context.

The impact: stakeholders won’t trust results gathered from the data. A lack of data governance transforms a data lake from a business asset to a murky business liability.

The value of a data catalog in maintaining a crystal-clear data lake

The tremendous volume and variety of big data across an enterprise makes it difficult to understand the data’s origin, format, lineage, and how it is organized, classified, and connected. Because data is dynamic, understanding all of its features is essential to its quality, usage, and context. Data governance provides systematic structure and management to data residing in the data lake, making it more accessible and meaningful.

An integrated data governance program that includes a data catalog turns a dark, gloomy data lake into a crystal-clear body of data that is consistently accessible to be consumed, analyzed, and used. Its wide audience of users can glean new insights and solve problems across their organization. A data catalog’s tagging system methodically unites all the data through the creation and implementation of a common language, which includes data and data sets, glossaries, definitions, reports, metrics, dashboards, algorithms, and models. This unifying language allows users to understand the data in business terms, while also establishing relationships and associations between data sets.

Data catalogs make it easier for users to drive innovation and achieve groundbreaking results. Users are no longer forced to play hide-and-seek in the depths of a data lake to uncover data that fits their business purpose. Intuitive data search through a data catalog enables users to find and “shop” for data in one central location using familiar business terms and filters that narrow results to isolate the right data. Similar to sites like Amazon.com, enhanced data catalogs incorporate machine learning, which learns from past user behavior, to issue recommendations on other valuable data sets for users to consider. Data catalogs even make it possible to alert users when data that’s relevant to their work is ingested in the data lake.

A data catalog combined with governance also ensures trustworthiness of the data. A data lake with governance provides assurance that the data is accurate, reliable, and of high quality. The catalog then authenticates the data stored in the lake using structured workflows and role-based approvals of data sources. And it helps users understand the data journey, its source, lineage, and transformations so they can assess its usefulness.

A data catalog helps data citizens (anyone within the organization who uses data to perform their job) gain control over the glut of information stuffed into their data lakes. By indexing the data and linking it to agreed-upon definitions about quality, trustworthiness, and use, a catalog helps users determine which data is fit to use—and which they should discard because it’s incomplete or irrelevant to the analysis at hand.

Whether users are looking to preview sample data or determine how new data projects might impact downstream processes and reports, a data catalog gives them the confidence that they’re using the right data and that it adheres with provider and organizational policies and regulations. Added protections allow for sensitive data to be flagged within a data lake and security protocols can prevent unauthorized users from accessing it.

Realizing data’s potential requires more than just the collection of it in a data lake. Data must be meaningful, consistent, clear, and most important, be cataloged for the users who need it the most. Proper data governance and a first-rate data catalog will transform your data lake from simply being a data repository to a dynamic tool and collaborative workspace that empowers digital transformation across your enterprise.

This article is published as part of the IDG Contributor Network. Want to Join?