AI: the challenge of data

While AI has been getting all the press, the elephant in the room is training data

elephant in the room5
Thinkstock

In the last few years, AI has made breathtaking strides driven by developments in machine learning, such as deep learning. Deep learning is part of the broader field of machine learning that is concerned with giving computers the ability to learn without being programmed. Deep learning has had some incredible successes.

Arguably, the modern era of deep learning can be traced back to the ImageNet challenge in 2012. ImageNet is a database of millions of images categorized using nouns such as “strawberry,” “lemon,” and “dog.” During this challenge, a convolutional neural network (CNN) could achieve an error rate of 16 percent (before that, the best algorithm could only achieve a 25 percent error rate).

One of the biggest challenges of deep learning is the need for training data. Large volumes of data are needed to train networks to do the most rudimentary things.  This data must also be relatively clean to create networks that have any meaningful predictive value. For many organizations, this makes machine learning impractical. It’s not just the mechanics of creating neural networks that’s challenging (although this is itself a hard task), but also the way to organize and structure enough data to do something useful with it.

There is an abundance of data available in the world—more than 180 zettabytes (1 zettabyte is equal to 1 followed by 21 zeros) predicted by 2025. Ninety-nine percent of the data in the world is not yet analyzed, and more than 80 percent of it is unstructured, meaning that there is plenty of opportunity and hidden gems in the data we are collecting. Sadly, however, much of this data is not in any state to be analyzed.

So, what can enterprises do?

You need to think about data differently from how you do today. Data must be thought of as a building block for information and analytics. It must be collected to answer a question or set of questions. This means that it must have the following characteristics:

  • Accuracy: While obvious, the data must be accurate.
  • Completeness: The data must be relevant, and data that is necessary to answer the question asked must be present. An obvious example of incomplete data would be a classroom where there are 30 students, but the teacher calculates the average for only 15.
  • Consistency: If there is one database indicating that there are 30 students in a class and a second database showing that there are 31 in the same class then this is an issue.
  • Uniqueness: If a student has different identifiers in two separate databases, this is an issue as it opens the risk that information won’t be complete or consistent.
  • Timeliness: Data can change, and the AI model may need to be updated.

Beyond the data itself, there are severe constraints that can impede analytics and deep learning, including security and access, privacy, compliance, IP protection, and physical and virtual barriers. These constraints need to be thought about. It doesn’t help the enterprise if it has all the data but the data is inaccessible for various reasons. Often, steps need to be taken such as scrubbing the data so that no private content remains. Sometimes, agreements need to be made between parties that are sharing data, and sometimes technical work needs to happen to move the data to locations where it can be analyzed. Finally, the format and structure of the data needs to be considered. Recently, I was looking at the currency rates from the Federal Reserve going back 40 years for a personal project and then, in one of those head-slapping moments, I realized that there was a discontinuity from 1999 onwards: The euro had replaced most European currencies. There was a way I could mitigate the problem, but it was deeply unsatisfying. Legacy data might be plentiful, but may be incompatible with the problem at hand.

The moral of the story is that we are deluged with data, but often the conditions do not allow the data to be used. Sometimes, enterprises are lucky, and with some effort, they can put the data into good shape. Very often, enterprises will need to rethink how to collect or transform data to a form that is consumable. Agreements can be made to share data or merge data sets, but completeness issues often remain.

As noted earlier, the key to success is to start with a question and then structure the training data or collect the right data to answer the question. While immense barriers remain in collecting training data, there is clearly a push by enterprises toward higher quality data evinced by the growing influence of data scientists. I am very optimistic that the corpus of high-quality training data will improve, thus enabling a wider adoption of AI across enterprises of all sizes.

This article is published as part of the IDG Contributor Network. Want to Join?