Machine learning: Demystifying linear regression and feature selection

The time-tested technique for predicting numbers, and the role of domain knowledge in machine learning.

Businesspeople need to demand more from machine learning so they can connect data scientists’ work to relevant action. This requires basic machine learning literacy -- what kinds of problems can machine learning solve, and how to talk about those problems with data scientists. Linear regression and feature selection are two such foundational topics.

Linear regression is a powerful technique for predicting numbers from other data. Imagine you have an imperative to predict basketball scores from game statistics, and you miraculously know absolutely nothing about basketball. The fact that a hoop is involved is news to you. You’ve found a dataset on stats.nba.com that has a bunch of statistics (free throws made, assists, blocks, three pointers), including the final score, and now you want to predict future scores given those stats.

Those of us who are not in your miraculous situation know that the answer is going to look a lot like points = free throws made + 2 * two pointers made + 3 * three pointers made.

And this is exactly what linear regression does. It finds a combination of features (columns in your table) and coefficients (numbers to multiply those columns by) that most closely match the dependent variable (the number you’re trying to predict) across the samples (rows in your table) on which you’re training the model. At its heart, regression is this simple -- just some multiplication and addition to get to a single predicted number.

Adrien-Marie Legendre and Carl Friedrich Gauss discovered regression independently in the early 1800s (which caused some controversy), and the technique is still widely used today. If you want to use machine learning to predict a number, linear regression is most often the best place to start. Applications for linear regression are wide-ranging, from the Altman Z-score for predicting business bankruptcy, to sales forecasting.

The primary goal of a linear regression training algorithm is to compute coefficients that make the difference between reality and the model’s predictions consistently small. Oftentimes one has a parallel goal of simplicity -- the model shouldn’t use all the available features, especially if there are dozens or hundreds.

This is accomplished with regularization, which applies a penalty to the training algorithm for non-zero coefficients. In the basketball case, for example, rebounds and steals should not directly factor into the final score (their coefficients should be zero), despite the fact that they are correlated with a higher score.

But there’s a catch. What if you accidentally collect irrelevant data?

Instead of field goals and free throws you have a table with a points column plus columns for the number of hot dogs and sodas sold at the game, and a column for how many times MAKE SOME NOISE!!! came pumping over the PA. Your modeling efforts are going to be fruitless.

This catch is not specific to linear regression. It applies to any machine learning model in any domain -- if the features available aren’t related to the phenomenon you’re trying to model, your modeling will at best fail and at worst produce spurious results. Garbage in, garbage out.

This fundamental (and quite reasonable) limitation of any machine learning technique is addressed by feature selection: choosing a good set of features upon which to build models. In basketball we know that there is a direct causal relationship between shots made and points. Unfortunately in business, those clear relationships are often difficult to come by. We might not know what they are, we might not be able to measure them, or they might be obfuscated by sources of randomness like measurement error.

For example, imagine you’re trying to predict the next quarter’s gross revenue for a widget manufacturing company. There are two types of widget, A and B, that sell for $10 and $5 respectively. If you knew the final sales numbers for the two widget types you’d have: revenue = $10 * A widgets sold + $5 * B widgets sold. Without that information you need to rely on less reliable, indirect features, like the previous quarter’s sales, preorders, seasonal effects, and so on.

Much of the art in data science is understanding the problem domain well enough to build up a clean set of features that are likely related to what you want to model. And this argues strongly for the involvement of business leaders and experts in the data science process. The domain insight necessary for success in machine learning resides within the business layer of your organization somewhere, probably in much greater fidelity than in the data science team itself.

Demonstration

To illustrate the interaction between feature selection and linear regression, I scraped 500 rows of game logs from stats.nba.com, placed them in a .csv file on one of our test Hadoop clusters, and built linear regression models to predict total points on three different feature sets: all the features from the logs, just the relevant features (free throws, field goals, three pointers), and all the features other than the relevant features.

screen shot 2016 05 05 at 12.59.36 pm Image by Alpine Data / CC BY-SA 2.0

The visual workflow to build all three models using Alpine. The data is stored on HDFS as a CSV file, and the blue nodes use Spark to train linear regression models on the separate feature sets.

For the model trained on perfect features, the result was as you’d expect: free throws made + 2 * field goals made + three pointers made (the field goals column represents both two and three pointers).

The model trained on all features was close but not spot on: 1.0103 * free throws made + 2.0051 * field goals made + 1.0059 * three pointers made + very small contributions from the other features, and a .4475 intercept (the intercept is the value returned by the model if all features are zero). This fits the data almost perfectly, but as we know, all the other features aren’t relevant. Regularization could cause this model to be identical to the previous, at the cost of additional methodological complexity.

The model trained on only the irrelevant features was able to predict the score of every game within 16 points, and the bulk within 6 points. But it was far less accurate than the previous two models. The three top features were field goal percentage, offensive rebounds, and turnovers (with a negative coefficient).

Linear regression is an excellent place to start when using machine learning to predict numbers. Combined with relevant features, it’s a slam dunk.

Copyright © 2016 IDG Communications, Inc.