Cloud is everybody's big, white, fluffy field of dreams. When somebody says their big data strategy is to "put it all into the cloud," you can't be sure whether they're a visionary or simply repeating something a guru told them at an industry conference.
The practical overlap between the big data and cloud paradigms is so extensive that you could validly claim you're doing cloud-based big data with an existing on-premise Hadoop, NoSQL, or enterprise data warehousing environment. Keep in mind that cloud is widely understood to include "private" deployments in addition to or in lieu of public cloud, SaaS, and multitenant hosted environments.
[ Download InfoWorld's Big Data Analytics Deep Dive for a comprehensive, practical overview of this booming field. | Stay on top of the cloud with InfoWorld's "Cloud Computing Deep Dive" special report. Download it today! | For the latest news and happenings, subscribe to the Cloud Computing Report newsletter. ]
But if you limit your practical definition of "cloud" to public subscription services, you can get to the heart of the issue: identifying which big data applications are better suited for public cloud/SaaS deployments versus on-premises deployments (such as those involving pre-optimized hardware appliances or virtualized server clusters).
Put another way: When can you boost the scalability, elasticity, performance, cost-effectiveness, reliability, and manageability of big data by letting an external service provider manage it for you? Here are several clear use cases for big data in public clouds.
Enterprise applications already hosted in the cloud: If, like many organizations -- especially small and midmarket businesses -- you use cloud-based applications from an external service provider, much of your source transactional data is already in a public cloud. If you have deep historical data on that cloud platform, it might already have accumulated in big data magnitudes. To the extent the service provider or one of its partners offers a value-added analytics service -- such as churn analysis, marketing optimization, or off-site backup and archiving of customer data -- it might make sense to leverage that rather than host it all in-house.
High-volume external data sources that require considerable preprocessing: If, for example, you're doing customer sentiment monitoring on aggregated feeds of social media data, you probably don't have the server, storage, or bandwidth capacity in-house to do it justice. That's a clear example of an application where you'd want to leverage the social media filtering service provided by a public-cloud-based, big-data-powered service.
Tactical applications beyond your on-premises, big data capabilities: If you already have an on-premises big data platform dedicated for one application (such as a dedicated Hadoop cluster for high-volume ETL on unstructured data sources), it might make sense to use a public cloud to address new applications (say, multichannel marketing, social media analytics, geospatial analytics, query-able archiving, elastic data-science sandboxing) for which the current platform is unsuited or for which an as-needed, on-demand service is more robust or cost effective. In fact, a public cloud offering might be the only feasible option if you need petabyte-scale, streaming, multistructured, big data capability ASAP.
Elastic provisioning of very large but short-lived analytic sandboxes: If you have a short-turnaround, short-term data science project that requires an exploratory data mart (aka sandbox) that's an order-of-magnitude larger than the norm, the cloud may be your only feasible or affordable option. You can quickly spin up cloud-based storage and processing power for the duration of the project, then just as rapidly deprovision it all when the project is over. I call this the "bubble mart" deployment model, and it's tailor-made for the cloud.
If you're already doing any of this, the strategic question on cloud-based big data is not where do you start. As cloud-based big-data services mature and continue to improve in price-performance, scalability, agility, and manageability, the question will be where do you stop. By the end of this decade, as more and more applications and data move to the public cloud, the idea of building and running your own big data deployment may seem as impractical as designing your own servers today.