Big data's pitfall: Answers that are clear, compelling, and wrong

Doing big data right takes sophisticated techniques to ensure ad hoc results are reliable

The cloud's dark secret is integration -- most implementations don't include it, whether IaaS, PaaS, or SaaS. Big data has its own dark secret -- that, as Mark Twain once pointed out, it ain't what you don't know that gets you into trouble. It's what you do know that ain't so.

Big data will make decision makers more certain. Making them more right? That's more doubtful.

[ InfoWorld's Andrew Lampitt looks beyond the hype and examines big data at work in his new blog Think Big Data. | Download InfoWorld's Big Data Analytics Deep Dive for a comprehensive, practical overview. | For more of Bob Lewis' continuing IT management wisdom, check out his Advice Line newsletter. ]

There are three pieces to this puzzle, even beyond the big data cultural challenges mentioned last week: quality assurance, spurious correlations, and the well-known but often-ignored challenges associated with statistically analyzing data that weren't collected with the analyses in mind. Each presents grave dangers to a company's decision-making health.

Big data danger No. 1: Quality assurance

Just because someone has access to a database and a bunch of BI tools with which to mine it doesn't mean they know what they're doing. Even with all of the right technical skills, it's easy to generate convincing-looking statistics that are wrong.

Start with how easy it is to misunderstand what kind of data you have. Part of the point of big data technologies is that they're more flexible and take less up-front planning and analysis than old-school IT reports or even not-quite-so-old-school data warehouses.

In the pre-big-data era, IT delivered carefully constructed data views to users and user analysts, reducing the risks of misunderstanding the data being analyzed. With big data, this responsibility is shifting at the same time data structuring is becoming more ad hoc and therefore easier to get wrong. Get this wrong and analysts will start their work with data sets that are, in one way or another, problematic -- for example, by having an invisible systematic bias. Nothing will fix this.

A second issue is data quality. Statisticians apply a number of tests to their data sets to make sure they're suitable for the intended analyses. If no one in your company knows how to use words like "heteroskedasticity" and "stationarity," you should probably hire one before you use big data to make any big decisions.

Simple example: Some time-series data include both a cyclical component and a linear trend component. Sales data, for example, might include both seasonality and overall growth. Perform a standard regression analysis on this kind of data, and the technical term for the result is "wrong." (In case you care: You need Box-Jenkins analysis for this kind of data.)

Knowing to test -- and how to test -- for data quality is a key reason smart companies investing in big data are including the cost of professional statisticians and "data scientists" (I'm sure there's a difference; I'm just not sure what it is) in their big data budgets.

1 2 3 Page 1
From CIO: 8 Free Online Courses to Grow Your Tech Skills