Big data's pitfall: Answers that are clear, compelling, and wrong

Doing big data right takes sophisticated techniques to ensure ad hoc results are reliable

1 2 3 Page 2
Page 2 of 3

Then there's the most obvious challenge: Making sure the analyses do what the analyst thinks they're doing. Traditional reports come from data extracted from carefully designed data stores by professionals who understand how it's all structured. They're programmed by IT and distributed to users. The most important ones are subjected to independent audits to make sure they do what they're supposed to.

The question of how to perform quality assurance on big data analytics hasn't yet been answered. While there are ways to make sure the results are reliable, making use of them is labor-intensive and time-consuming.

One alternative, to illustrate: Extract a random sample of data, small enough to analyze one row at a time if need be and big enough to be statistically significant. Load it into Excel. Save the query. Perform your analysis semi-manually using Excel. Triple-check the answer, and have a friend check it too. This is just like programming: It's easy to miss your own mistakes. Spotting the mistakes made by others is much easier.

Next: With the same query, run the same analysis using your BI tool. If it gives you the same answer you just might have put it together right.

This isn't, by the way, purely theoretical. Not that many years ago a client relied on a BI report to determine if a new process it was piloting was an improvement over the old one.

According to the report, it wasn't, leading the team to hypothesize a number of different root causes. After two months of chasing their tails, it turned out the initial database query that provided the data being analyzed was inappropriate for the use to which it was being put, resulting in systematically wrong results. The new process was fine. The process metrics were faulty.

Big data danger No. 2: Spurious correlations

Logicians know that correlation doesn't prove causation. Statisticians know that one out of every 20 correlations that are significant at the 0.05 level are, by definition, due to random chance. They also know that every analysis to which data are subjected reduce the degrees of freedom by one. (So I'm told -- I'm not a professional statistician. If you have analysts mining big data who don't understand this principle any better than I do, it's time to send them to statistics school.)

Back in college, when I took Psych 101, I learned there's a statistically significant correlation between the length of a person's big toe and their intelligence quotient. While there's a thin possibility a causal relationship exists -- perhaps genetic pleotropy -- most cognitive psychologists ignore this correlation as an unimportant anomaly.

The question is this: After mining your company's data, would your executives be wise enough not to include toe measurements in its applicant screening program, right alongside drug testing? If so, could they do this without descending into gut-trusting? That's a fine line to tread, but fail to tread it properly and the possibility of very bad "data-driven decisions" is significant.

1 2 3 Page 2
Page 2 of 3