Data scientists use statistical analysis tools to find non-obvious patterns in deep data. But they know the universe is full of spurious correlations. Big data simply intensifies the problem.
Because, as the range of sources and the diversity of predictors continues to grow, the number of relationships that can potentially be modeled begins to approach infinity. As David G. Young pointed out, “predictive variables sometimes aren’t ....We’ve all seen variable interactions that change the significance, curvature, and even the sign of an important predictor.”
Thus, if you’re looking for a particular correlation in your data, you can probably find it if you’re clever enough to combine only the right data, specify only the right variables, and analyze at using only the right algorithm. Once you’ve hit on the right combination of modeling decisions, the patterns you seek may pop out like a genie from Aladdin’s lamp.
Yet the fact that you’ve supposedly discovered this correlation doesn’t mean it actually exists in the underlying real-world domain you’re investigating. It may simply be a figment of your specific approach to modeling the data you have at hand. You may have no fraudulent intent, and you may otherwise adhere to standard data-scientific methodologies, but you may choose to go no further if it appears you’ve already struck the pay dirt insight you were seeking.
If you’re a data scientist, the fact that you don’t realize you’re looking at non-existent statistical patterns may simply stem from the fact that you’re human. Confirmation bias is a vulnerability to which everybody tends to fall prey from time to time. Even the most brilliant statistical analysts can make honest mistakes of math and logic.
As Nobel laureate economist Daniel Kahneman stated in his book "Thinking, Fast and Slow," humans, educated and otherwise, are innately tuned to “see patterns where none exists.” Conversely, we are also frequently unable to see the deep statistical, probabilistic patterns that really exist in the world around us, especially when the patterns feel counterintuitive.
If you’re a data scientist who prides yourself on your data-driven exploration, spurious correlations are a dangerous honeypot. The pressure to accept these false patterns as gospel may stem from an unconscious desire to validate the cherished assumptions upon which your career and employer’s business model depend. You don’t want to think that you’re doing shoddy data science that rubber-stamps a convenient fiction. But it can be hard to justify deeper investigations into data that could risk uncovering inconvenient truths.
How can data scientists reduce the likelihood that, when exploring big data, they might inadvertently accept spurious statistical correlations as facts? Here are a some useful methodologies in that regard:
- Ensemble learning: This approach determines whether multiple independent models -- all using the same data set but trained on different samples, employing different algorithms, and calling out different variables -- converge on a common statistical pattern. If they do, you can have greater confidence that the correlations they reveal have some casual validity. An ensemble-learning algorithm does this by training on the result sets of the independent models and using voting, averaging, "bagging," "boosting," and other functions to reduce variance among the patterns revealed in the various constituent models.
- A/B testing: This approach determines which alternative models -- between which some variables differ but others are held constant -- best predict the dependent variable of interest. Typically, in real-world experiments involving live data, successive runs of incrementally revised A and B models -- sometimes known as “champion” and “challenger” models -- converge on a set of variables with the highest predictive value.
- Robust modeling: This approach simply involves due diligence in the modeling process to determine whether the predictions you’ve made are stable with respect to alternate data sources, sampling techniques, algorithmic approaches, timeframes, and so on. In addition, robust outlier analysis is very important because, as Vincent Granville noted a few years ago, the increasing incidence of outliers in larger data sets can create spurious correlations, which might obscure the true patterns in the data or "reveal" patterns that don't exist.
When applied consistently in your work as a data scientist, these approaches can ensure that the patterns you’re revealing actually reflect the domain you’re modeling. Without these in your methodological kit, you can’t be confident that the correlations you’re seeing won’t vanish the next time you run your statistical model.