When you have a big enough hammer, everything begins to look like the same kind of nail.
That's one of the potential problems with Hadoop 2.0, the greatly reworked big data processing framework that's been at the center of a whole storm of developer and end user interest. Cloudera in particular has plans to make it into a hammer for all kinds of nails.
There's no question that Hadoop 2.0 is a major leap over its predecessor. Instead of being a mere batch data processing framework for MapReduce jobs (limited, boring), it's now turned into a general framework for deploying applications across a multi-node system, with MapReduce just being one of the many possible things that can be run across those nodes (flexible, exciting).
Cloudera's clearly excited by the possibilities inherent in such an arrangement. During a keynote presentation at the O'Reilly Strata-Hadoop World conference in New York City this past Tuesday, the company described an "enterprise data hub" powered by Hadoop, one where all manner of data could be funneled in, processed in place, and extracted as needed.
Sounds great, but how feasible is it? Especially given Hadoop's status as the shiny new big data toy on the block? Such a hub may be a long way off for any company that's late to the big data party and has only just now found a place forits multi-mega-terabyte data farms to live. Turning those "silos" (as Cloudera refers to legacy data repositories, with a near-audible sniff) into Hadoop installations isn't trivial.
The single biggest obstacle to making all that happen isn't Hadoop itself, although that's still a fairly major obstacle. In talking with vendors and users alike at Strata-Hadoop, it's clear Hadoop is still seen on all sides as a bucket of parts that needs major lifting and welding to be fully useful.
The most fruitful uses of Hadoop have been through the third parties that turn it into a ready-to-deploy product -- not just Cloudera or its quasi-rival Hortonworks, but cloud providers like Microsoft (a major Hortonworks partner), Amazon, SoftLayer, Rackspace, and just about every other name-brand cloud outfit. And few of them have yet to offer the kinds of really high-level abstraction we associate with powerful software tools, where the likes of Puppet or Python scripting are options rather than requirements.
The sheer number of moving parts and pointy edges that pop up out of Hadoop, even for smaller deployments, is still intimidating. A panel given by Dan McClary (principle product manager, Oracle) about Oracle building Hadoop appliances shed a lot of light on how much blood has to be shed, even by the likes of Oracle, to make Hadoop into a deliverable product. McClary was fairly sure over time Hadoop's rough edges would get sanded down by back-pressure from the community and vendors alike, but that time had definitely not arrived yet.
But the single biggest obstacle remains moving apps into Hadoop. The new infrastructure within Hadoop for applications, YARN, is far more open-ended than before, but it isn't trivial to rewrite an application to run there. It's not impossible there could be jury-rigs to accelerate that process -- e.g., some kind of virtualization wrapper that would allow apps to be arbitrarily shoehorned into the framework -- but that's not trivial work either.
Small wonder, then, that a great deal of work right now is being done to make Hadoop play well with existing apps -- connectors, data funnels, and the like. Very little of the discussion I encountered focused on moving existing apps into Hadoop, although few disagreed that it would happen eventually; most of it revolved around taking one's existing analytics and connecting them to Hadoop. There are, I imagine, far more people who want to do that than there are people who want to scrap everything and start over.
That said, the sheer level of bustle at the O'Reilly conference was a tipoff as to how soon that might happen. By this time next year, when the conference moves to the far-larger Javits Convention Center in Manhattan, some of Cloudera's pronouncements may seem a little less wildly optimistic. But until then, the trend right now is toward using Hadoop as a complement to existing big-data systems, not as a forklift upgrade for them.
This story, "Cloudera pitches Hadoop for everything. Really?," was originally published at InfoWorld.com. Get the first word on what the important tech news really means with the InfoWorld Tech Watch blog. For the latest developments in business technology news, follow InfoWorld.com on Twitter.