At first glance, big data analytics seems to be the perfect sort of workload for the public cloud.
Particularly for batch jobs, the hyperscalable infrastructure of the cloud is ideal -- pay for the powerful server clusters you need while crunching data and stop paying when you’re done. No wonder that after EC2 and S3, one of the first major services AWS added was its Hadoop-based Elastic MapReduce service, followed by Redshift data warehousing a couple of years later.
But there’s a problem: Big data doesn’t like to be moved. The bandwidth to do so costs money, and as analytics veer ever closer to real time, the barrier to keeping cloud data and on-premises data in synch grows higher.
Here’s where the idea of copy data management and virtualization comes in. More companies are looking to the public cloud for backup and DR. So instead of data just sitting in the cloud waiting for disaster so it can restored on premises, why not use virtual copies of that data for big data analytics or dev and test in the cloud?
For the most part, cloud backup and DR has primarily been a small business proposition -- while large enterprises that want to maintain high availability have created dedicated backup datacenter sites where data is replicated frequently at high cost. In neither case has the data been used for anything except restoration in the event of calamity.
Although still relatively small, Actifio is the best known company pitching the idea of using a single, continuously updated copy of enterprise data and creating virtual copies for DR, backup, and analytics -- as well as for dev and test in a cloud environment. Founded in 2009, Actifio secured a $100 million round of funding in March 2014, led by Tiger Global Management. Actifio has partnered with IBM, SunGuard, and others to provide a platform where a single “golden copy” of the data can be virtualized and leveraged in multiple ways.
Virtual data management addresses a key enterprise pain point. Not only is the volume of enterprise data growing at a ridiculously rapid pace, but data warehousing, Hadoop analytics, and accelerated application development are together demanding copies of that data and putting an ever greater burden on storage infrastructure. If you can have a single copy created for backup/DR purposes, and create virtual rather than physical copies for analysis and dev and test, you can reduce the spend on storage infrastructure -- whether it resides on premises or in the cloud.
It seems like only a matter of time to me before AWS, Google, and Microsoft get into the cloud data virtualization management game as well. Yes, particularly with data subject to regulation, there will be governance issues to worry about. But copy data management in the cloud has tremendous potential, because big data analytics lends itself to public cloud infrastructure and because dev and test is already one of the top uses of the public cloud.
At the same time, although it’s the early days, streaming analytics and continuous capture of data from the Internet of things are beginning to take shape. And the general consensus is that the cloud provides the best platform for such a widely distributed architecture.
An interesting aspect of all this is that if you’ve already copied your data into the cloud, at what point do you no longer feel the need to keep primary storage on premises? To that degree, copy data management is yet another milestone on the enterprise’s long road to the public cloud.