Cloudera CEO: We're taking Hadoop beyond MapReduce

In an exclusive interview, the voluble CEO of Cloudera, Mike Olson, holds forth on the company's new Impala project and the boundless potential of Hadoop

From the number of times you've heard the word "Hadoop," you'd think it referred to some magic elixir for making sense of big data. In reality, Hadoop is an open source framework for distributed data storage and processing -- with enormous analytics potential for those who know how to use it.

To demystify Hadoop, and to get a personal perspective from one of the leading lights in the space, IDG Enterprise Chief Content Officer John Gallant and InfoWorld Editor in Chief Eric Knorr turned to Mike Olson, CEO of Cloudera. The hour-long interview, an edited version of which appears below, is part of the ongoing IDG Enterprise CEO Interview Series.

[ Also on InfoWorld: What Hadoop can and can't do. | Harness the power of Hadoop with InfoWorld's 7 top tools for taming big data. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]

Olson began his career in the '80s and '90s building and then later selling and managing companies that developed relational database products. In 2000 he became CEO of Sleepycat Software, makers of the open source embedded database engine Berkeley DB. He later negotiated the sale of Sleepycat to Oracle in 2006. Olson stayed with Oracle as vice president of embedded technologies for two years; shortly after departing, he "stumbled across" Hadoop. "When I saw how it was being used in the consumer Internet...I got excited and thought there would be an opportunity in traditional enterprises," he says.

As it turned out, three other entrepreneurs -- Christophe Bisciglia (Google), Amr Awadallah (Yahoo), and Jeff Hammerbacher (Facebook) -- all felt inspired to start a Hadoop venture at roughly the same time. "We banded together in the summer of 2008 to create just one, rather than four, such companies," says Olson. A year later, Doug Cutting, co-creator of the Hadoop project itself, joined Cloudera as chief architect.

With an impeccable pedigree, and customers the likes of Chevron, eBay, Monsanto, Morgan Stanley, and Samsung, Cloudera is widely considered the leader among Hadoop pure plays. We began the interview by asking Olson about the relevance of Hadoop to enterprise customers.

Q: Hadoop is coming up in conversation everywhere. What is it our readers should absolutely understand about Hadoop?

A: Hadoop is an open source platform based on the work at Google, originally for indexing the Web and digesting user behavior. Hadoop is the open source version of that.

Hadoop consists of two pieces. There is large, reliable, cheap, scale-out storage. Gang a bunch of Dell or Hewlett-Packard computers together, all of them with local disk, and store as much data as you want. If you need to store a little more data, buy a couple more boxes -- really incrementally scalable.

The software that manages that storage, HDFS, the Hadoop Distributed File System, it's smart. You've got a thousand computers ganged together, you know for sure one is going to fail. No problem. The software has been designed to watch for a disk that goes south or a server that goes offline. Multiple copies of the data are stored around; automatically they're re-replicated. You can lose individual pieces and not lose data, and as you need to expand your pool you just add a couple more boxes. You don't need to move all the data to the new thing; it's all automatic.

What makes Hadoop really interesting is the second component: an engine called MapReduce, the data processing layer. The idea is, you've got all this data spread out among all these machines. Used to be in the database world, if you wanted to ask a question, you got the data off the disk and moved it to the question-answering component. MapReduce is different. You've got data spread across 10 or 100 or 1,000 computers, so when you want to ask a question you send that question out to all the computers and they look at the little fragment of data they've got and they shoot their answers out. Then you collect the answers at the end.

At base, Hadoop is large-scale distributed storage and a way to spread computation out, to spread questions out across that storage. You can literally ask a question of a petabyte of data, 1,000 terabytes of data, and get an answer -- now. That's what Hadoop is.

The MapReduce framework can do brute-force data processing -- clean my data, summarize my data, analyze or digest it. It can also run powerful algorithms. We want to use machine learning over all of that user behavior in the weblogs. I want to understand what it is, John, that you like to browse on the Yahoo website. Then I want to see what Mike likes to browse on the Yahoo website. To the extent we like the same thing, I want to recommend content for you based on stuff that Mike looked at.

The MapReduce framework is really good at very complicated workloads -- machine learning, natural language processing -- that can be pointed at a bunch of different business problems. That's why Hadoop has done so well.

Q: How are people misunderstanding Hadoop?

A: A knock against Hadoop is that not all problems are MapReduce problems. It's this big batch data processing algorithm. It's really heavyweight and you've got to be a specialist to use it. It's batch mode, so you ask a question, you wait a while before you get any answers back. Not all data access is like that.

It has been a misconception about Hadoop that the combination of HDFS and MapReduce are your only choices. Those are the two pieces that Google built first. But once you've got your data spread out among a thousand machines, you can imagine other things you'd like to do with it. Wouldn't you want to ask interactive speed questions about it?

We've built Cloudera Impala, a new open source project that we launched a few weeks back. We've built an interactive speed, high-performance, distributed query engine. It actually is another example of software originally designed at Google.

The idea is you've got one big pool of storage. Put anything you like in there and you can use the MapReduce framework to do the analytics and the data processing. You can equally now use the Impala framework to ask interactive-speed queries, and you can share data between those two; you can run queries that create results that you then MapReduce. You can use MapReduce to analyze data that you then query. That combination is huge.

What really matters about it isn't that Impala makes SQL queries fast on Hadoop, although that's a big deal that opens up a lot of new use cases. What's really important here is that Hadoop is a general-purpose data storage and processing engine. We expect the community will create new engines that get at that same data in new ways.

Q: Essentially you'll be able to provide a portfolio of ways to look at that giant pool of data once it's pulled together.

A: Indeed. We have an advantage. We have a very large installed base.

Q: How big is the installed base?

A: Many, many thousands of production clusters. The thing about open source software is you never really know the answer to that question. But "considerable" is the answer.

In general, they deploy it in data centers that have other infrastructure. Nobody stands up Hadoop by itself. It's usually next to a relational database and maybe in service of a document system, so all these pieces are together. We can watch what they do. We talk to them. We ask them basically: what is it you bolted onto the side of your Hadoop cluster that you needed to do to solve a problem?

We saw over and over again the following issue with the installed base: They loved MapReduce because it let them do a bunch of stuff, but they'd run all these analytic jobs and then they blast the stuff into a MySQL database or a relational database from one of the big vendors, just so they can get interactive queries out of it. Ah, we thought -- what if we could put that in there? We can look at what the installed base is doing right now for those other things nailed to the side of a Hadoop cluster.

We'll be rolling out more of these engines in the coming quarters. Again, I don't think we'll be the only company, but I will say we recognized, maybe a little ahead of the rest of the market, that there was nothing magical about MapReduce as the one engine, that there would be lots of choices.

Q: When you hear people talking about big data, what is it that you think they're not getting about the concept?

A: That's a loaded question. I'll confess, we've benefitted from [the big data hype], but I think we need to be a little bit careful about it too. Big data is a loaded term right now. It is the solution to all problems. If you believe what you read, it's why Obama won the election, because he had better big data chops than anybody else.

Q: Do you think big data played a part?

A: Look, I think good analytics over large volumes of data really did change the kinds of decisions that certain groups made and that translated into an advantage. But let's be honest, big data won't make you any taller, it won't make you any younger, it won't make you any prettier. It's a hugely valuable asset if approached in the right way, if used to solve a real problem. But Hadoop isn't some pixie dust that's going to drive your profits up just because it's Hadoop.

Big data has a few properties. You've probably heard "volume, variety, velocity." You can have a lot of data, you can have a lot of different kinds of data, you can have data showing up real fast. Any one of those can be trouble, right? If you've got two or more of those problems -- a huge amount of highly variable data or wicked amounts of data showing up at a furious clip -- then Hadoop is really the only choice on the market. It's the only thing that was designed to scale in all three of those ways.

It's a misconception to think that if I don't have many, many terabytes, then Hadoop's not useful to me. [If you have] modest amounts of data that are highly variable and you might want to combine them into a single analysis, well, MapReduce sings at that stuff. The key is to understand the question you want to answer, the problem you want to solve with the data, and then to be sure that you've got the data that does that.

One example I really like is a customer of ours named Explorys Medical. Imagine you get sick and go to the doctor. You see the doc, he looks down your throat, he takes your pulse, he records a bunch of stats about you, maybe orders a prescription. You go home, a week goes by and you're not feeling any better. You call him up. Okay, we're going to bring you back, send you to the lab, get some more tests done, maybe some imaging, maybe some blood, change your prescription a little bit. Ah, still not better? Let's go see a specialist.

You know, if you're ill and it takes a while to get better, the trajectory of your treatment can span a few weeks and can have lots of different data types in it. All by itself, if you capture all that information, that would be a good picture of your treatment -- how it progressed, how you finally got better. Imagine, though, that it wasn't just you. Imagine if you could do that for all the patients in a hospital. Now you can start to reason. Dr. Jones seems to have a pretty good outcome with patients with this condition, or this drug seems to be pretty effective with men, not women.

Explorys gangs together that information, not for a hospital but for thousands of hospital clients. Their goal is to recommend effective treatments to doctors that allow hospitals to deliver that care cost-effectively. No matter how you feel about the election that we just had, I think we'll all agree that healthcare costs are out of control. Using big data [is the way] to reason about that stuff. Think about it, blood test results and images and doctor prescriptions and blood pressure, building a model based on all that, the machine learning required can only happen in Hadoop.

Q: Do you think people have unrealistic expectations of Hadoop? Isn't it more like a platform on which you write applications to get the results you want? Out of a raw Hadoop installation, there's not a lot you can do without people with specialized knowledge and the ability to develop applications.

A: What you say was absolutely true three years ago and it was pretty true one year ago. I'm going to tell you that things are much better today. I won't say that they are all the way better yet, but you can go to MicroStrategy or Informatica or SAS -- big analytic tool providers -- and those folks have now integrated with Cloudera or with Hadoop generally. Business users and analysts can begin to get tooling that doesn't require them to write MapReduce jobs or write Impala queries.

The availability of skills in the market is a gate on adoption of these platforms. Let's be honest, the way we're going to solve this is with software. We're going to see better tools come into existence. In addition to those big established vendors, there are new vendors coming out now, companies like Platfora and Continuity and WibiData. These companies are building tools purpose-designed for Hadoop that know there are 500 computers under the hood that know how to do machine learning and can visualize complex data in new ways.

1 2 3 Page 1
Page 1 of 3