We’ve all heard the cry of data consolidation. After all, data within enterprises is strewn all over the place. So combining that sprawl of data would allow us to mine it more easily from a single location. The cloud provides a nice location to host such combined data, considering the cost is a fraction of what it is to host it locally.
We’ve taken this path before: Early data warehouses migrated operational data to a single data store that was structured and combined to enable business intelligence. Huge batch jobs took place every night or every week, rolling up and aggregating the data for this "single source of truth" instance.
But let’s not live in 1996, where that was really the only technically viable option.
These days, we can access distributed data, rather than relocate it to a common repository with a common, sometimes destructive data structure. Today, if you have dozens of operational data stores you can access those data stores as if they were one consolidated database, even if they use inconsistent models (such as NoSQL versus SQL) or if the data is unstructured.
In fact, that 1996 approach is a bad one today, causing problems we don't need to have. The more you copy the data, the more likely you are to have data inconsistency. And copying data means using more storage, thus spending more money. You may also need to buy more database licenses. Also, as you try to scale the consolidated database, you’ll find that the complexity of that central database spins out of control.
Distributed data, whether in the cloud or locally hosted, is a scary concept to many in enterprise IT. Working with distributed data does require a great deal of planning as well as a strong understanding of database access and abstraction approaches. Moreover, working with distributed data in the cloud makes things a bit more complex because of the different database technologies used in the cloud versus in traditional data centers.
However, that price is worth paying, especially as the new database technologies are becoming more pervasive and so should be learned anyhow. Ultimately, the distributed approach is cheaper than going back to that 1996 approach.
My fear is that the low cost of cloud will cause many to just continue to replicate data as if it were 1996. Please resist that urge.
Instead, get good at distributed data now.