Never put everything in one database basket, even if it's Hadoop

Those who recommend putting everything in a Hadoop data lake forget some obvious lessons of database history

Elephants never forget, they say, though I doubt pachyderms are the savants the proverb has led us to believe. I know a specific elephant -- named Hadoop -- who can't seem to remember the recent history of the EDW (enterprise data warehouse) market upon which it's encroaching. Specifically, some in the Hadoop arena seem to be repeating some aspects of the positioning overreach that long bedeviled that market.

I'm referring to the dubious notion that Hadoop can and should be the central consolidation hub for all your business' analytic data.

[ 18 essential Hadoop tools for crunching big data | Download InfoWorld's Big Data Analytics Deep Dive for a comprehensive, practical overview of this hot topic. | Cut to the key news for technology development and IT management with our once-a-day summary of the top tech happenings. Subscribe to the InfoWorld Daily newsletter. ]

For years, before the big data era started in earnest, the EDW arena pushed this "all in one basket" notion. Though the notion of a single-version-of-the-truth repository for all analytical subject domains makes sense in the abstract, few customers saw any compelling need to spend the money, time, and resources to consolidate disparate analytic databases onto a single platform. Many companies have consolidated some core system-of-record data in EDWs, but it's still common everywhere to see enterprises dedicate tactical data warehouses, data marts, operational data stores, OLAP cubes, and other analytic databases for specific regions, lines of business, applications, and users.

Resistance to the concept of a single "enterprise data hub" will endure in the age of Hadoop. In fact, you can read that skepticism in the tone of Loraine Lawson's recent article on an equivalent dream -- that of a Hadoop-centric "data lake." Lawson likens the concept to that of a "Big Rock Candy Mountain," a "data-centered architecture, where distributed computing comes trickling down the rock and they hung the jerk that invented data silos." Citing Edd Dumbill's "data lake" discussion, she says, "And to prove it's more than just a developer's dream, he points out that Google and Facebook developers 'live the dream fully.'"

I don't get the logic of Dumbill's statement. Doesn't pointing to developers confirm it is indeed just a developer's dream? And singling out developers at two firms that were among Hadoop's earliest developers and users, and whose companies have built their respective Web services on that platform, doesn't show that this dream lives outside Silicon Valley.

In fact, the zeitgeist among actual users in the big data era has begun to shift toward a "hybrid" deployment model that blends EDW, Hadoop, NoSQL, in-memory, and other data platforms within a heterogeneous, cloud-enabling infrastructure.

Within the context of a hybrid architecture, this "data lake" dream seems to be specific to one big data deployment role: an exploratory "sandbox" that is the data-consolidation and statistical-modeling hub for teams of data scientists who need to sift through petabytes of multi-structured data. Data scientists everywhere are flocking to Hadoop as their all-data "sandbox," as I previously discussed.

I have no quibbles with one aspect of this "data lake" vision: That Hadoop is becoming a key application-development and -execution platform for big data analytics. As I have stated, data scientists are the pivotal application developers in the age of big data, and, as I also have discussed, Hadoop is rapidly evolving into a general-purpose distributed job-execution layer capable of executing a wide range of jobs that were developed in other languages.

But that's not the same as claiming Hadoop will be the only such platform. In fact, every big data platform -- Hadoop, MPP EDWs, NoSQL, in-memory, and streaming, for example -- acts as an application development and  execution platform. The notion that any one of them will be the entire ocean for all analytic-centric application development is flat-out wrong.

This story, "Never put everything in one database basket, even if it’s Hadoop," was originally published at InfoWorld.com. Read more of Extreme Analytics and follow the latest developments in big data at InfoWorld.com. For the latest developments in business technology news, follow InfoWorld.com on Twitter.

Recommended
Join the discussion
Be the first to comment on this article. Our Commenting Policies