Big data in the cloud has so many potential functional service layers sprawling across so many nodes, clusters, and tiers that it's easy to feel overwhelmed.
Take a deep breath. Your first step should be to plan a comprehensive cloud data virtualization infrastructure. Virtualized cloud analytics is the big data paradigm for the new era. As an integration approach, it ensures unified access, modeling, deployment, optimization, and management of big data as a heterogeneous resource.
[ Download InfoWorld's Big Data Analytics Deep Dive for a comprehensive, practical overview of this hot topic. | Cut to the key news for technology development and IT management with our once-a-day summary of the top tech happenings. Subscribe to the InfoWorld Daily newsletter. ]
Data virtualization, like any virtualization, is an approach that allows you to access, administer, and optimize a heterogeneous infrastructure as if it were a single, logically unified resource. This enables you to abstract the external interface from the internal implementation of some service, functionality, or other resource.
Data virtualization's centerpiece is an abstraction layer, such as any of the SQL-virtualization approaches that support logically unified access, query, reporting, predictive analytics, and other applications against disparate back-end data repositories, such as relational, Hadoop, NoSQL, and so forth. Of course, data virtualization may in turn rely on other layers of infrastructure virtualization, such as storage and server platforms, in some cases spread across geographic locations and multiple cloud environments.
However many layers you're discussing, virtualization is the epitome of unsexy data topics. But it's fundamental if you want your big data cloud platform to address the following business imperatives:
- An advanced-analytic resource of elastic, fluid topology
- An all-consuming resource that ingests information originating in any source, format, and schema
- A latency-agile resource that persists, aggregates, and processes any dynamic mix of at-rest and in-motion information
- A federated resource that sprawls within and across value chains, spanning both private and public clouds
- A seamless interoperability resource that lets you change, scale, and evolve back-end data platforms without breaking interoperability with existing tools and applications
Yes, that's a tall order. Clearly, data virtualization and its virtualized underpinnings are much easier to talk about than to do. Plus, it is not cheap to implement, administer, or optimize.
Cloud-based big data will require virtualized infrastructures of growing complexity. It's no surprise that most data professionals approach this messy topic in much the same way that astronomers attempt to map the universe's dark matter. They know it's an essential, albeit tedious, chore. Truth be told, big data professionals would much prefer to point their strategic telescopes toward the sexy orbs -- Hadoop, NoSQL, and so on -- that shine brightest in the new technology firmament.
As the range of your cloudy big data applications grows, you'll almost certainly have to go further down the virtualization path. The stubborn heterogeneity of hybridized big data clouds will push you in that direction. Within your private clouds, constant big data platform churn will require a virtualization fabric that bridges new approaches with your legacy investments. Churn will stem from your ongoing platform modernization and migration efforts, from your need to incorporate innovative, fit-for-purpose platforms into your cloud, and from vendors' product-enhancement cycles. Unless you put all of your big data initiatives on a "one size fits all" public cloud service, you'll need to virtualize access to public, private, and hybrid cloud architectures in various shifting combinations.
Clearly, the extent to which you'll go the data-virtualization route will depend on the complexity of your business requirements and big data environment. It will also depend on your tolerance for risk, complexity, and headaches.
In the coming years, as more complex analytic models, rules, and information converge on the big data cloud, that platform will become a centerpiece of virtualized access, execution, and administration. Within this new world, MapReduce will be the key (but not the only) development framework. Instead, MapReduce will form part of a broader, but still largely undefined virtualization architecture for inline analytics and transactional computing.
Nobody yet has stepped forward to outline the layers, interfaces, and abstractions that will glue the cloud big data universe together from end to end. That's yet another tall order.