Data in, intelligence out: Smart apps are only as good as the data

Effective machine learning and predictive analytics depend on vast amounts of relevant data and continuous learning

Data is the fuel that drives machine learning and predictive analytics, and these apps are only as good as the data that goes into them. So, you almost can’t have enough data points when designing and operating cognitive solutions -- although there are people who may disagree with me.

Some experts believe that data scientists should just focus in on a certain set of data and that collecting too much data could waste precious time rolling out new solutions and maybe even be counterproductive.

But, there’s a good reason to collect as many data points as possible, because today’s smart apps need vast amounts to create predictive algorithms that work -- and ultimately make them even smarter.

So how does this happen? It all revolves around the scientific process and statistical mathematics. The standard process is to begin with a set of questions to explore and then develop a hypothesis to predict what will happen. You would feed into the system a comprehensive set of data points -- your variables -- and test it to see if the hypothesis is correct. If not, you would remove variables, and add others to see what happens, and so on.

The problem with using a limited set of data is that you don’t know which variables might be important, and you might be missing the one that would provide your answer. For example, if you wanted to develop an app to predict which medical devices might fail, you might be looking at physical characteristics, such as the width of the device, or the materials used in its manufacture, and if you only collect that data, that’s all you can test for. But what if those variables aren’t applicable and the real problem was the temperature in the room when critical components of the device were being soldered during manufacturing? By collecting as many variables as you can, you are increasing your testing opportunities to find the right solution the first time.

Why is there resistance?

Given that the science and math bear out the need for more data points, why is there resistance to this approach? A large part of this has to do with how the data is collected. Many data scientists are still using manual methods for data collection -- manually feeding the variables into their systems during the testing phase -- which is labor-intensive and time-consuming. But now that solutions like Microsoft Azure automate this process, using different algorithms, and identifying which approach is best – there’s no reason to limit the amount of data.

Additionally, there may be concern that this level of automation through machine learning will kill jobs, but because of the fluidity of the marketplace, that’s not the case -- which leads to my next point.

Prepare for change

In keeping with the words of wisdom from the ancient Greek philosopher, Heraclitus, “The only thing that is constant is change.” Machine learning is a living program, that constantly needs to change to improve the user experience, responding to evolving security threats or adapting to changes in the environment that could cause your app to malfunction.

There is enough continual work to be done to keep the data scientists busy, so on the contrary, unlike software programs of the past, machine-learning based apps are never done, and will always require more data, more testing, more refinements, and so on. So at least the data scientist’s job is secure.

The role of the data scientist

So, what should data scientists be doing? The data scientist should be forward thinking -- considering the type of sensors and variables that should be added to capture more relevant data. They also should be constantly collecting data and looking at variables that they may not have yet and suggesting to software developers what to add.

To that end, it’s important that data scientists work closely with systems designers and engineers to ensure that they are programming existing systems to capture the data that is needed from the start.

When it comes down to it, you can never have too much of good, relevant data since you just never know where the answer to your question will come from. Capture the data, test it, change the data points, test it again, repeat. It’s this continuous process that will ensure that your smart apps become smarter and your users have a better experience.

This article is published as part of the IDG Contributor Network. Want to Join?