Shopping for data: what’s fit for your purpose?

The data catalog (or data marketplace) makes finding and accessing data easier, another step toward data democratization

mobile payment online shopping desktop
Thinkstock

There’s been a shift in data strategy from defensive to offensive. Historically, data governance focused on compliance and security. It still does, but it’s expanding to address data accessibility—getting data to the people that need it to solve business problems, drive new revenue, create value, and even monetize their data.

But how do you accomplish this? In the past, I’d receive a report with data deemed relevant to my tasks, and if I needed more I’d ask someone who had the tools to get me the data. But the dramatic increase in data volume (much of it produced by automated devices) renders that method obsolete. Enter the data catalog.

Great expectations

With the rise of ecommerce, we’ve gone from browsing physical stores to browsing online catalogs. On Amazon, by setting a few simple search criteria, I can find everything from the novels of Murakami to All-Clad cookware. I can see what’s in stock and from whom. I can see how others rated the item, and what else they’ve purchased. This rich user experience has permanently changed how we shop.

It’s not surprising, then, that we expect that same experience when searching for data. Access to and understanding of data has become indispensable; jobs across all industries demand insights from data in some form. In my last post I addressed the importance of data democratization across organizations and in society at large, and that’s why what began as the data dictionary and transformed into the metadata repository is no longer adequate. Whether you call it the data catalog or the data marketplace, people now want an online location to “shop for data,” and there are characteristics that these tools must provide to be satisfactory.

Key requirements

Maybe you’re considering a data catalog for your organization and receive emails about data marketplace products or a “data bazaar.” Are any of them worthwhile? How do you choose?  

With some variation, these products all come down to how to find and access data. The first thing to consider is semantics. I’m not going to know all the permutations that developers used to identify the various attributes of the data. The semantic terms must be recognizable so they can connect me to relevant data—what business glossaries attempt to provide.

But even using the proper terms, this may be no better than a search for rock music of the 1990s on Amazon—too broad to be useful. When I shop for music, I already have certain requirements in mind beyond the genre or the artist. We have requirements for data as well, though often we don’t stop to articulate what those are. They amount to data quality requirements and articulate what is fit for purpose. Incorporating this level of insight is critical to support the data governance mission.

That’s why traditional metadata content such as data lineage and data profiling is still needed to provide insight for selection, to understand the data source, its origin, and its relationship to other data sources. But assuming we have that context, is it enough? It may be nice to know that a given source received a four-star rating, but there is no guarantee that what someone else said is of any relevance to my work.

So how do you assess the available data? You need information, and not just metadata: you need context around the data so you can not only search and filter it, but assess its fitness for your purpose.

Questions … and answers

A data catalog must allow the user to ask key questions and use those to filter results. For instance:

  • Is data complete? And not just the fields, but is the whole dataset comprehensive in relation to the business problem? (e.g., if I’m looking at the impact of weather on store inventory, I may need a dataset with weather data for all delivery routes, not just my stores).
  • Is the data consistent and valid—not as determined by the creator, but by the user?
  • Is the data understandable? If there are codes, can the user understand them?

Access to quality data is reported by some 51 percent of data scientists in a 2017 survey as the No. 1 barrier to their work. When browsing the data catalog, being able to articulate such questions, and see how the set of available data satisfies them, is critical if we are to work through hundreds or thousands of data sources. It’s not the rating of data, but the context of how well the data fits differing requirements that allows us to gauge whether it is useful or not.

The data catalog addresses the first barrier toward data democratization: finding and accessing data. A familiar, consumer-style search capability is foundational, but the ability to apply questions to the data for a given business problem is central to reduce the time required to wade through the range of data sets and quickly get to those of highest value and interest. If you’re exploring a data catalog solution, ensure that it not only captures metadata but also provides business semantics, context, and a means to evaluate the data content against your requirements.

This article is published as part of the IDG Contributor Network. Want to Join?