In the pecking order of big data sexiness, production applications and exploratory data science sandboxes get all the attention. That's why, if you're not a big data pro, you've probably ignored the critical role of big data in the "landing area" of your data management and analytics infrastructure.
The role of a big data landing area is deliberately vague. It's clearly not the production front-end access and sandboxing layer where you run your fast queries, do your interactive exploration, and build and score your predictive models. It's clearly not the production hub layer, where you store your core system-of-reference data, manage metadata, and enforce data governance standards.
[ InfoWorld's Andrew Lampitt looks beyond the hype and examines big data at work in his new blog Think Big Data. | Download InfoWorld's Big Data Analytics Deep Dive for a comprehensive, practical overview. ]
But in many ways, the big data landing area is the foundation for these production and development systems.
Your big data landing area serves any of several critical roles. It might be where you acquire and collect data sourced from operational systems, prior to delivering it to other operational systems downstream. It might be where you aggregate, match, merge, cleanse, transform, and enhance data acquired from sources, prior to delivering it to hubs or front-end marts. Or it might be where data that originated in any or all of these operational systems -- transactional, analytic, or content management -- spends the rest of its days in a historical archive.
Diving into the archive
Let's focus on the archive, which is where big data lives when it's no longer needed to support core production applications but still has value for compliance, e-discovery, security, diagnostics, and other support applications.
The traditional definition of an archive is that of a repository of historical data no longer required by operational applications. By this definition, it's obvious that many archives inexorably evolve into the biggest big data platform in many IT shops.
Archives may in fact be the first database in your organization that achieves big data status, in terms of growing to petabytes and storing heterogeneous information from a wide variety of sources. The fact that the archive's purpose is to persist historical data for as-needed retrieval and analysis means it needs to be optimized for fast query, search, and reporting.
In fact, queryable archiving has been a big data killer app for a good while. Telcos have long done call-detail record analysis on massively scalable archival platforms. Security incident and event monitoring, as well as antifraud applications often demand huge databases that persist and correlate event data pulled from system-level security, identity, and other systems. Many IT log analysis applications -- for troubleshooting, diagnostics, and optimization -- run on databases that scale from the low terabytes into multipetabyte territory. Comprehensive time-series analysis of customer, inventory, logistics, and other trends must correlate large amounts of archival data with most recent data provided from operational systems.
The right tool for the job
Clearly, the role of the queryable archive is natural for data-at-rest platforms such as Hadoop. But the same could be said for various NoSQL platforms if they've been architected for scale and speed on archiving particular types of data. Likewise, don't count out your RDBMS for queryable archiving of structured data.
Depending on your requirements, you might deploy one or multiple archives for different big data sets, with different underlying platforms optimized for each. Whatever you decide to do, the key criterion is whether your big data platform(s), deployed as archives, support fast execution of all the expected types of queries that might be performed against all the data stored and managed there.