Many people assume that big data means bigger is always better. People tend to approach the "bigger is better" question from various philosophical perspectives, which I characterize thusly:
[ Big data demands nonstop experimentation. | Download InfoWorld's Big Data Analytics Deep Dive for a comprehensive, practical overview of this hot topic. | Cut to the key news for technology development and IT management with our once-a-day summary of the top tech happenings. Subscribe to the InfoWorld Daily newsletter. ]
- Faith: This is the notion that, somehow, greater volumes, velocities, and/or varieties of data will always deliver fresher insights, which amounts to the core value of big data analytics. If we're unable to find those insights, according to this perspective, it's only because we're not trying hard enough, we're not smart enough, or we're not using the right tools and approaches.
- Fetish: This is the notion that the sheer bigness of data is a value in its own right, regardless of whether we're deriving any specific insights from it. If we're evaluating the utility of big data solely on the specific business applications it supports, according to this outlook, we're not in tune with the modern need of data scientists to store data indiscriminately in data lakes to support future explorations.
- Burden: This is the notion that the bigness of data is not necessarily better or worse, but it is simply a fact of life that has the unfortunate consequence of straining the storage and processing capacity of existing databases, thereby necessitating new platforms (such as Hadoop). If we're not able to keep up with all this burdensome new data, or so this perspective leads us to believe, the core business imperative is to change over to a new type of database.
- Opportunity: This is, in my opinion, the right approach to big data. It's focused on extracting unprecedented insights more effectively and efficiently as the data scales to new heights, streams in faster, and originates in an ever-growing range of sources and formats. It doesn't treat big data as a faith or fetish, because it acknowledges that many differentiated insights can continue to be discovered at lower scales. It doesn't treat data's scale as a burden, either, but as simply a challenge to be addressed effectively through new database platforms, tooling, and practices.
Last year, I blogged on the hardcore use cases for big data in a discussion that was exclusively on the "opportunity" side of the equation. Later in the year, I observed that big data's core "bigness" value derives from the ability of incremental content to reveal incremental context. More context is better than less when what you're doing is analyzing data in order to ascertain its full significance. Likewise, more content is better than less when you're trying to identify all of the variables, relationships, and patterns in your problem domain to a finer degree of granularity. The bottom line: More context plus more content usually equals more data.
Big data's value is also in its ability to correct errors that are more likely to crop up at smaller scale. In that same post, I cited a third party who observed that, for a data scientist, having less data in their training set means they're susceptible to several modeling risks. For starters, at smaller scales you're more likely to overlook key predictive variables. You are also more likely to skew the model to nonrepresentative samples. In addition, you're more likely to find spurious correlations that would disappear if you had a more complete data sets revealing the underlying relationships at work.