Who should be responsible for your data? The knowledge scientist

Organizations that recognize the importance of clean and reliable data while elevating knowledge work will move faster along the path to true data-driven decision-making

Who should be responsible for your data? The knowledge scientist
Metamorworks / Getty Images

How can you build a data-driven culture and spur digital transformation without thinking through who should be responsible for your data? Let’s do that together.

Data engineers and data scientists each occupy critical roles. Data engineers manage the data infrastructure and are in charge of designing, building, and integrating data workflows, pipelines, and the ETL process. Their goal is to provide data for data scientists’ analysis. Data scientists are those who can turn data into insights by applying statistics, machine learning, and analytical approaches. Their goal is to answer critical business questions.

Data-driven organizations require reliable, clean data to function. Without it, your AI, machine learning, and analytics are worthless. Unreliable, erroneous, and incomplete data leads to answers that can’t be trusted—hence, “garbage in, garbage out.”  

Therefore, the process of wrangling and cleaning data is crucial, often said to be 80% of a data scientist’s work. Typically, this is seen as boring, annoying grunt work people don't want to do.

However, I think this negative view is at least partly based on a major underappreciation of the significance of such work. Data wrangling and cleaning is not simply about eliminating white spaces, replacing wrong characters, and normalizing dates. Stepping back, these tasks should be viewed in the context of two key objectives:

  1. Understanding the ecosystem of people, data, and tasks in an organization
  2. Communicating and documenting that knowledge in order to generate clean and reliable data

Yes, data wrangling and cleaning can take 80% of a data scientist’s time and energy. This does not mean that 80% is wasted. While these tasks can and should be optimized for efficiency, they are part of the vital knowledge work that should be elevated within a data-driven organization. But who should be doing it?

Who should be responsible for data?

In typical organizations, the need for reliable data is constant, but the knowledge work that creates it is ad hoc. Practices and results are not documented and shared because data scientists are usually not equipped, trained, or incentivized to do so. Indeed, in our experience, a lot of the “softer” knowledge work (like conference calls, discussions, whiteboarding sessions, documentation, long Slack chats) required to create clean and reliable data is not valued by data scientists or their managers. Making matters worse, most tools are designed and provisioned for a small set of user types and teams to the exclusion of other user types and teams. Thus, the responsibility to create and manage reliable data is siloed, scattered, or even non-existent.

I argue that data scientists should not be responsible for creating and managing reliable and clean data because their responsibility is to turn data into insights. Instead, I call for a new role which must be developed to fill this critical need: the knowledge scientist.

Who is a knowledge scientist?

A knowledge scientist is a person who builds bridges between business requirements, questions, and data. The goal of the knowledge scientist is to document knowledge by gathering information from business users, data scientists, data engineers, and their environments in order to make data more useful for AI, machine learning, business intelligence, data analytics, and more.

From a hard skills perspective, knowledge scientists should work with business users and demonstrate what they have learned by using skills and techniques such as data modeling, knowledge representation, and ontology engineering. The output is a data model that represents how the business user sees the world. Knowledge scientists should align this data model with other models derived from talking to other business users.

Furthermore, while working with data engineers, the knowledge scientist should be fluent in data access and transformation methods such as query and programming languages. They should transform the data being provided by the data engineer and map it to the business meaning provided by the business user. They should be conversant in analytical and machine learning methods.

Knowledge work is people work. From a soft skills perspective, the knowledge scientist should have excellent communication skills that can be applied to both the business user and the data engineer. The knowledge scientist should be both a “people person” and a “geek.”

The knowledge science discipline has its roots in the knowledge engineering approaches of the 1980s and 1990s. In that world, skills such as knowledge acquisition, knowledge elicitation, and knowledge specification were taught and used. These are lost arts in industry today, particularly in the data science context. I believe that revisiting these approaches will be a key part of developing both the instructional curriculum and the tooling needed to support the knowledge scientist.

The organizations which identify the central importance of clean and reliable data while elevating knowledge work will be at the forefront of digital transformation and will move faster along the path to creating a data-driven organization. Who are the knowledge scientists in your organization?

Related:

Copyright © 2019 IDG Communications, Inc.

InfoWorld Technology of the Year Awards 2023. Now open for entries!