In an age of fake news, is there really such a thing as fake data?

The pitfalls and benefits of using synthetic data to train AI algorithms

brain-shaped thought bubble showing flow of alphabetic characters
erhui1979 / Getty Images

Deloitte Global predicts that medium and large enterprises will increase their use of machine learning in 2018, doubling the number of implementations and pilot projects underway in 2017. And, according to Deloitte, by 2020, that number will likely double again.

Machine learning is clearly on the rise among companies of all sizes and in all industries and depends on data so they can learn. Training a machine learning model requires thousands or millions of data points, which need to be labeled and cleaned. Training data is what makes apps smart, teaching them life lessons, experiences, sights, and rules that help them know how to react to different situations. What a developer of an AI app is really trying to do is simulate the experiences and knowledge that take people lifetimes to accrue.

The challenge many companies face in developing AI solutions is acquiring all the needed training data to build smart algorithms. While companies maintain data internally across different databases and files, it would be impossible for a company to quickly possess the amount of data that is needed. Only tech savvy, forward-thinking organizations that began storing their data years ago could even begin to try.

As a result, a new business is emerging that essentially sells synthetic data—fake data, really—that mimics the characteristics of the real deal. Companies that tout the benefits of synthetic data claim that effective algorithms can be developed using only a fraction of pure data, with the rest being created synthetically. And they claim that it drastically reduces costs and save time. But does it deliver on these claims?

Synthetic data: buyer beware

When you don’t have enough real data, just make it up.  Seems like an easy answer, right? For example, if I’m training a machine learning application to detect the number of cranes on a construction site, and I only have examples of 20 cranes, I could create new ones by changing the color of some cranes, the angles of others and the size of them, so that the algorithm is trained to identify hundreds of cranes.  While this may seem easy and harmless enough, in reality, things are not that easy. The quality of a machine learning application is directly proportional to the quality of the data with which it is trained. 

Data needs to work accurately and effectively in the real world. Users of synthetically derived data have to take a huge leap of faith that it will train a machine learning app to work out in the real world and that every scenario that the app will encounter has been addressed. Unfortunately, the real world doesn’t work that way.  New situations are always arising that no one can really predict with any degree of accuracy. Additionally, there are unseen patterns in the data that you just can’t mimic.

Yet, while accumulating enough training data the traditional way could take months or years, synthetic data is developed in weeks or months. This is an attractive option for companies looking to swiftly deploy a machine learning app and begin realizing the business benefits immediately. In some situations where many images need to be identified quickly to eliminate manual, tedious processes, maybe it’s okay to not have a perfectly trained algorithm—maybe providing 30 percent accuracy is good enough.

But what about the mission- or life-critical situations where a bad decision by the algorithm could result in disaster or even death? Take, for example, a health care app that works to identify abnormalities in X-rays. Or, an autonomous vehicle operating on synthetic training data. Because the app is trained only on what it has learned, what if it was never given data that tells it how to react to real-world possibilities, such as a broken traffic light?

How do you make sure you’re getting quality data in your machine learning app?

Because the use of synthetic data is clearly on the rise, many AI software developers, insights-as-a-service providers and AI vendors are using it to more easily get AI apps up and running and solving problems out of the gate. But when working with these firms, there are some key questions you should ask to make sure you are getting quality machine learning solutions.

Do you understand my industry and the business challenge at hand?

When working with a company developing your machine learning algorithm, it’s important that it understands the specific challenges facing your industry and the critical nature of your business. Before it can aggregate the relevant data and build an AI solution to solve it, the company needs to have an in-depth understanding of the business problem.

How do you aggregate data?

It’s also important for you to know how the provider is getting the data that may be needed. Ask directly if it uses synthetic data and if so, what percentage of the algorithm is trained using synthetic data and how much is from pure data. Based on this, determine if your application can afford to make a few mistakes now and then. 

What performance metrics do you use to assess the solution?

You should find out how they assess the quality of the solution. Ask what measurement tools they use to see how the algorithm operates in real-world situations. Additionally, you should determine how often they retrain the algorithm on new data.

Perhaps most important, you need to assess if the benefits of using synthetic data outweigh the risks. It’s often tempting to follow the easiest path with the quickest results, but sometimes getting it right—even when the road is longer—is worth the journey.

This article is published as part of the IDG Contributor Network. Want to Join?