Not all machine learning is created equal

'Machine learning' products could be using techniques that are far less sophisticated than typically associated with the term

Artificial Intelligence
gengiskanhg (CC BY-SA 3.0)

Barely a week goes by when you don't see a product or service advertised as powered by "machine learning." When a label is so broadly employed, it risks being devalued -- or used to describe something that barely qualifies as machine learning.

Call it "machine-learning washing" -- the liberal use of the label to allow a product to ride the current bandwagon.

Little doubt exists machine learning is a very real and powerful technology, and everyone from Microsoft to IBM is making hay with it. What complicates the picture is how a professed technology employs machine learning -- via the more complex techniques we commonly associate with the label or a low-end concept that has more in common with basic statistics.

More than one kind of machine learning

Yann LeCun, the New York University professor recently hired by Facebook as head of AI research, described machine learning in an email as "a set of techniques that allow a computer to acquire or improve its ability to perform a task by automatically extracting knowledge from data." But the definition allows for a lot of leeway.

Gartner analyst Alexander Linden noted in an email, "to a very large extent, machine learning is a rebranding of predictive analytics and data mining." He co-authored a Gartner study, "Machine Learning Drives Digital Business," that described how machine learning is a spectrum of approaches: "The simplest types are linear regressions or scorecards; more advanced forms are decision trees and neural nets; and the most cutting-edge types today are ensemble models and deep neural nets."

The main problem: "Machine learning" is a generic description that encompasses a lot of different strategies, with products using techniques at the low end fitting the label in only the most rudimentary way. That makes it easier to promote a product that employs any of those strategies as one powered by "machine learning."

Linden and his cohorts pointed out that the techniques at the low end shouldn't be dismissed out of hand. "Despite its simplicity, linear regression and logistic regressions have been proven to be one of the most successful models in machine learning. Purists in statistics may reject this being called machine learning, but the concepts are the same, and the models often perform well."

But the term remains associated with highly sophisticated techniques. LeCun noted that shifts in both terminology and the problem space associated with machine learning have caused confusion for people trying to get a grasp on the term.

The basic definition of ML [machine learning] hasn't changed much over time, but the set of ML techniques and the set of tasks that ML can solve has changed considerably over time. In the '60s and '70s, ML was known as "pattern recognition." In the '80s, there was a community working on "symbolic" ML, which used reasoning and logic to learn (it was essentially a failure). In the late '80s/early '90s, ML meant "neural networks." Then it moved to "kernel methods," "graphical models," "boosting," "trees," "Bayesian non-parametrics," and various other classes of methods. In the last several years, neural nets have made a huge come back under the banner of "deep learning."

"Secure because math"

Alex Pinto, chief data scientist of the MLSec Project, is also skeptical of how machine learning is easily misused as a sales buzzword.

In a paper presented at Black Hat USA 2014, Pinto noted the "great number of start-ups with 'cy' and 'threat' in their names that claim that their product will defend or detect more effectively than their neighbors' product 'because math.' ... Indeed, math is powerful, and large-scale machine learning is an important cornerstone of much of the systems that we use today. However, not all algorithms and techniques are born equal."

Pinto's paper goes on to say many of the algorithms and techniques are not novel, as LeCun also pointed out. What's more, the algorithms may not be the most important part; rather, the data fed into an ML algorithm -- especially one that performs unsupervised learning -- is crucial.

"One of the biggest truths of machine learning of any sort is that your model design, that is, the features you are extracting in order to feed the prediction engine, is of greater importance then the actual algorithms that are being used," he wrote. "[I]f a data source feeds a security decision process, attackers will want to manipulate that data source to its advantage, in a practice that is not unlike clearing your traces in logs after breaking in."

Copyright © 2015 IDG Communications, Inc.