Knorr: But Hadoop is batch processing, not real-time.
Maritz: That's why Hadoop is the beginning, not the end. Hadoop today is HDFS plus MapReduce. In the future it's going to be HDFS with MapReduce plus relational query, plus transactions, plus complex event processing. All of these additional ways of working with data are going to be added on top of the HDFS substrate, and they'll all be pulling information out of and pushing information into this big "data lake" at the bottom, which is a phrase they've started to hear more and more customers talk about. The data lake is an important notion, because if there's one commonsense thing about getting value out of data, it's that the more Balkanized your data is, the harder it is to get value out of it.
Knorr: Yeah. Would you be open to also looking into other types of NoSQL data stores like Cassandra or whatever?
Maritz: Yes, absolutely. But what you don't want to have to do is say, "Every time there's a particular view on the data, I have to create a new repository for that, and then Balkanize my data." So we think that this notion of using HDFS as the common substrate on which these data modalities get rebuilt is the direction of the future. So when we put Pivotal together, we very explicitly looked inside the EMC and VMware family and said, "Who knows how to work with a large number of machines over an underlying persistence store?"
It turns out we had two people who had been living and watching that movie for some time, which is the Greenplum team, because what they did is essentially pull the query processor out of Postgres ... and put back into Postgres a query processor that is parallelized, that knows how to do query execution over a large number of machines working in parallel. They just happened to do it on top of a Postgres substrate. That's what the Greenplum database is.
So we realized that they'd come to this realization themselves, and said, "Hey, if we take that query processor out, we can, instead of applying it to Postgres, we can apply it to HDFS." That's a hard thing to do, by the way. They spent ten years working on parallel query, which is a notoriously hard problem to work on, but they've gone a long way down that road.
So all of a sudden that technology was available to be re-manifest on top of HDFS, which is what Pivotal HD is. The other team that had seen elements of this movie before was GemFire, which works in very high-end event processing and transaction processing. Again, they had problems where the event rate in transactions was too big to handle in a traditional big-iron clustered approach.
Knorr: Huge memory type stuff.
Maritz: Yes. The key thing is a lot of memory space; it's in memory but scale-out. We're talking about hundreds of machines that you can throw at the problem. We're taking GemFire's expertise at how to handle large numbers of transactions in memory, scale out in memory, lots of machines, taking advantage of what that free, cheap quantity that the cloud gives you and applying that to an HDFS substrate. So we're starting to build out that suite of data modalities on top of HDFS and saying that you can take HDFS, do MapReduce, do relational query, do transaction processing, do high-speed event ingest as a suite of data capabilities on top of a number of common underlying substrate.
Knorr: It sounds to me like developing the complex event processing, as you were saying in the context of the Internet of things, is absolutely critical.
Maritz: Yes. And that was clearly one of the reasons why General Electric got interested in us, because they're looking to build a new generation of applications where a lot of them will be having to deal with the Internet of things.