Data is the lifeblood of AI, but how do you collect it?

AI and lots of good data go hand in hand, but it can be a challenge for companies to aggregate it

Artificial intelligence and digital identity
Thinkstock

When it comes to artificial intelligence (AI), there is no such thing as data overload. In fact, it’s quite the opposite—the more data, the better. Because AI systems have the ability to process enormous amounts of data, and their accuracy increases along with data volume, the demand for data continues to grow.  

Consider, for example, an AI program designed to identify the cause of defective medical devices produced during the manufacturing process. As with any AI application, the software looks for patterns in the data using algorithms developed by data scientists. To try to solve this problem, suppose that the AI program receives and sorts through production data from different days of the week, times of day, machines and operators. But maybe those factors were not causing the defects, and instead they were caused by the rising temperature in the room. Only by providing as much data as possible to address different variables, can companies most effectively and efficiently determine the actual cause of the problem.

So, what’s the best way to get the data you need?

Your data is the crown jewel

If someone asked you for your customer and prospect data, would you give it to him or her? Your response would likely be, “Absolutely not.” Your data is the crown jewel of your organization. It includes valuable information on key targets, their preferences, and motivations.

What if you are trying to conduct predictive analytics to determine the likelihood of customers purchasing goods or services in the next six months based on their historical and real-time product usage, or to find why your quarterly sales figures missed the mark? Generic data just won’t do. To get the answers to these types of company-focused questions, you really need the data that is most relevant to you—your own constantly updated data.

The challenge with that, however, is that it might not be as easy to get access to internal data as you might think. Given the tendency of organizations to keep their information siloed, it can be difficult to know what internal data is residing in different departments and systems—let alone to collect it (and this is not counting third-party cloud application data). A key first step is to conduct a data audit to find out what type of data you have and where it is located within your organization. Then you can work with the various departments and business units to gain access to it.

Data quality and quantity go hand in hand

While your internal data may be most appropriate to address a specific problem, the question becomes, do you have the volumes of internal data needed to solve it? Most often the answer is, probably not. That’s where the combination of internal and external data comes in. To supplement your internal information, it’s important to identify the external data that is most relevant to your company and your business challenge. 

In addition to augmenting your internal data to solve a specific business problem, there are instances when external data alone would be able to address universal issues, such as determining general consumer buying patterns.

Regardless of whether you are using external data to supplement your internal data or as the primary source to answer a more common problem, there are several ways to aggregate it: through pre-packaged data, public crowdsourcing and private crowds.

  • Prepackaged data. While prepackaged data can offer a quick, out-of-the-box way to collect data, sometimes it can end up taking more time or effort than you planned. Out-of-the-box sounds good in theory, but with prepackaged data companies often need to develop APIs for integrations, write code or make other customizations.
  • Public crowdsourcing. Organizations have been turning to public crowds for years to get help with natural disasters or crises, for example, by asking the general public to search images for survivors. Similarly, the city of Boston is using crowdsourcing to get the public’s help in reporting potholes. And, companies are using crowdsourcing services, such as Amazon Mechanical Turk, to distribute the work involved in collecting and preparing data, including image recognition, data normalization, and algorithm training for machine learning, among other tasks.
  • Private crowds. Companies that require confidentiality agreements to work on their data or want more accuracy and faster turnaround than public crowds offer are turning to private crowds of data specialists and other professionals for their help with the same data collection, identification, labeling preparation and training tasks.

Facing the challenges

Regardless of the source of your data, there are challenges involved in aggregating data that is relevant to a business problem, analyzing it and gaining insights from it, for several reasons, including:

  • It’s hard to figure out what data you need. Working backwards from a problem you want to solve, you need to determine the type of data you need to gather. Companies are not always sure of what they need or how to get that information.
  • You need specialized expertise to build the algorithms. Once you have the data, you have to determine the best attributes for the data model and build an algorithm that effectively answers your business question.
  • Data training is a process that never ends. You need to constantly add new and updated information to improve an algorithm, get better insights and more accurately predict outcomes.

Knowledge is power and there’s a lot of knowledge trapped in your internal data as well as external sources. To best unlock that knowledge, you have to consider the type of data you need, where to look for it, how to get it, and how to build the right data models to analyze your business questions. And just as importantly, you need to continually update your data to re-train and enhance the algorithms. There’s certainly a lot that goes into data collection, but it’s worth it. As the lifeblood of AI, data is critical to helping you get the business insights you need to move your business forward.

This article is published as part of the IDG Contributor Network. Want to Join?