Q&A: Hortonworks CTO unfolds the big data road map

Hortonworks' Scott Gnau talks about Apache Spark vs. Hadoop and data in motion

Q&A: Hortonworks CTO unfolds the big data road map
Kate Ter Haar (CC BY 2.0)

Hortonworks has built its business on big data and Hadoop, but the Hortonworks Data Platform provides analytics and features support for a range of technologies beyond Hadoop, including MapReduce, Pig, Hive, and Spark. Hortonworks DataFlow, meanwhile, offers streaming analytics and uses technologies like Apache Nifi and Kafka.

InfoWorld Executive Editor Doug Dineley and Editor at Large Paul Krill recently spoke with Hortonworks CTO Scott Gnau about how the company sees the data business shaking out, the Spark vs. Hadoop face-off, and Hortonworks' release strategy and efforts to build out the DataFlow platform for data in motion.

InfoWorld: How would you definite Hortonworks' present position?

Gnau: We sit in a sweet spot where we want to leverage the community for innovation. At the same time, we also have to be somewhat the adult supervision to make sure that all this new stuff, when it gets integrated, works. That gets to one core belief that we have, that we really are responsible for a platform and not just a collection of tech. We've modified the way that we bring new releases to market such that we only rebase the core. When I say "rebase the core," that means new HDFS, new Yarn. We only rebase the core once a year, but we will integrate new versions of projects on a quarterly basis. What that allows us to do, when you think about when you rebase the core or when you bring in changes to the core Hadoop functionality, there's a lot of interaction with the different projects. There's a lot of testing, and it introduces instability. It's software development 101. It's not that it's bad tech or bad developers. It introduces instability.

InfoWorld: This rebasing event, do you aim to do that at the same time each year?

Gnau: If we do it annually, yes, it will be at the same time each year. That would be the goal. The next target will be in the second half of 2017. In between, up to as frequently as quarterly, we will have nonrebasing releases where we'll either add new projects or add new functionality or newer versions of projects to that core.

How that manifests itself is in a couple of advantages. Number one is we think we can get newer stuff out faster in a way that's more consumable because of the stability that it implies for our customers. We also think conversely, that our customers will be more amenable to staying closer to the latest release because it's very understandable what's in and what changed.

The example I have for that is we recently did the 2.5 release, and basically in 2.5, there were only two things we changed: Hive and Spark. It makes it very easy if you think about a customer who has their operations staff running around doing change management. Inside of it, we actually allowed for the first time that customers could choose a new version of Spark or the old version of Spark or actually run both at the same time. Now if you're running change management, you're saying, "OK, I can install all the new software, and I can default it to run on the old version of Spark, so I don't have to go test anything." Where I have feature functionality that wants to take advantage of the new version of Spark, I can simply have them use that version for those applications.

InfoWorld: There's been talk that Spark is displacing Hadoop. What's happening as far as Spark versus Hadoop?

Gnau: I don't think it's Spark versus Hadoop. It's Spark and Hadoop. We've been very successful and a lot of customers have been very successful down that path. I mentioned that even in our new release where, when the latest version of Spark came out, within 90 minutes of it being published to Git, it was in our distribution. We're highly committed to that as an execution engine for the use cases where it's popular, so we've invested not only in the packaging, but also with the contributions and committers we have, and in tools like Apache Zeppelin, which enables data scientists and Spark users to create notebooks and be more efficient about how they share algorithms and how they optimize the algorithms that they're writing against those data sets. I don't view it as either/or but more as an "and."

In the end, for business-critical applications that are making a difference and are customer-facing, there is a lot of value behind the platform from a security, operationalization, backup and recovery, business continuity, and all those things that come with a platform. Again, I think the "and" becomes more important than the "or." Spark is really good for some workloads and really horrible for others, so I don't think it's Spark versus the world. I think it's Spark and the world for the use cases where it makes sense.

InfoWorld: Where does it make sense? Obviously you're committed to Hive for SQL. Spark also offers a SQL implementation. Do you make use of that? This space is interesting in that all these platform vendors want to offer every tool for basically every kind of processing.

Gnau: There are Spark vendors that want to offer only Spark.

InfoWorld: That's true. I'm thinking of Cloudera, you and MapR, the established Hadoop vendors. These platforms have lots of tools, and we'd like to understand which of those tools are being used for what sorts of analytics.

Gnau: Simplistic, interactive on reasonably small sets of data fit Spark. If you get into petabytes, you're not going to be able to buy enough memory to make Spark work effectively. If you get into very sophisticated SQL, it's not going to run. Yes, there are many tools for many things, and ultimately there is that interactive, simplistic, memory resident on small data use case that Spark fits. With any of those parameters, when you start to get to the bleeding edge of any of those parameters it's going to be less effective, and the goal is to have that then bleed into Hive.

InfoWorld: How opinionated can you be about your platform and how free are you in deciding you are no longer going to support a tool or are retiring a tool?

Gnau: The hardest thing any product company can do is retire a product, the most horrid thing in the world. I don't know that you will see us retire a whole lot, but maybe there will be things that get put out to pasture. The nice thing is that there is still a live community out there, so even though we may not be focused on trying to drive investment because we're not seeing demand in the market, there will still be a community [that] can go out and pick up things, so I see it more as an out to pasture.

InfoWorld: To take one example, Storm is still obviously a core element and I assume that's because you've decided it's a better way to do stream processing than Spark or others.

Gnau: It's not a better way. It provides windowing functions, which are important to a number of use cases. I can imagine a world where you'll write SQL and you'll send that SQL off, and we'll grab it and we'll actually help decide how it should run and where it should run. That's going to be necessary for the thing itself to be sustainable.

There are some capabilities along those lines that we're doing here and there as placeholders, but I think as an industry, if we don't make it simpler to consume, there will be a problem industry-wide, regardless of whether we're smart or Cloudera is smart, whatever. It will be an industry problem because it won't be consumable by the masses. It's got to be consumable and easy. We're going to create some tools that will help you decide how you deploy and help you manage where you can have an application that thinks they're talking to an API versus I've got to run Hive for this and HBase for this and having to understand all those different things.

InfoWorld: Can you identify technologies that are emerging that you expect to be in the platform in the coming year or so?

Gnau: The biggest thing that is important is the whole notion of data in motion versus data at rest. When I say "data in motion," I'm not talking about just streaming. I'm not talking about just data flow. I'm talking about data that's moving and how do you do all of those things? How do you apply complex event processing, simple event processing? How do you actually guarantee delivery? How do you encrypt and protect and how do you validate and create provenance, all the provenance in data in motion? I see that as a huge bucket of opportunity.

Obviously, we made the acquisition of Onyara and released Hortonworks DataFlow based on Apache NiFi. Certainly that's one of the most visible things. I would say that's it is not NiFi alone and what you would see inside of our Hortonworks DataFlow is that includes NiFi and Storm and Kafka, a bunch of components. You'll see us building out DataFlow as a platform for data in motion, and we already have and will continue to invest along those lines. When I'm out and about and people say, "What do you think about streaming?" I say, well, streaming is a very small subset of the data-in-motion problem. It's an important thing to solve. but we need to think about it as a bigger opportunity because we don't want to solve just one problem and then have six other problems that prevent us from being successful. That's going to be driven by devices, IoT, all the buzzwords out there.

InfoWorld: In this data-in-motion future, how central or how important is a time series database, a database built to store time series data as opposed to using something else?

Gnau: Time series analytics are important. I would submit that there are a number of ways that those analytics can be engineered. Time series database is one of the ways. I don't know that a specific time series database is required for all the use cases. There may be other ways to get to the same answer, but time series and the temporal nature of data are increasingly important, and I think you will see some successful projects come up along those lines.


Copyright © 2016 IDG Communications, Inc.

How to choose a low-code development platform