Storing data in Hadoop generally means a choice between HDFS and Apache HBase. The former is great for high-speed writes and scans; the latter is ideal for random-access queries -- but you can't get both behaviors at once.
Hadoop vendor Cloudera is preparing its own Apache-licensed Hadoop storage engine: Kudu is said to combine the best of both HDFS and HBase in a single package and could make Hadoop into a general-purpose data store with uses far beyond analytics.
Fast writes, fast updates, fast reads, fast everything
Kudu was created as a direct reflection of the applications customers are trying to build in Hadoop, according to Cloudera's director of product marketing, Matt Brandwein.
These applications are typically constructed by organizations that want to "integrate data quickly, data that changes, and access that data very quickly for analytics ... The problem is, today, there isn't a good storage back end for them to do that."
HDFS allows for fast writes and scans, but updates are slow and cumbersome; HBase is fast for updates and inserts, but "bad for analytics," said Brandwein.
Kudu is meant to do both well. Written in C++ rather than Java, it uses its own file format and was "built from the ground up to leverage modern hardware." Rather than bounce back and forth between HDFS or HBase, applications can use Kudu as a single unified data store. (Integration for Spark and Cloudera's Impala are planned too.)
Though Cloudera is behind the project, Brandwein made it clear there is "nothing Cloudera-specific about [Kudu]." The project is intended to be released as open source and eventually put under the governance of the Apache Software Foundation, in the same manner as Hadoop's other major components.
Replacement or enhancement?
If all this sounds like a straight-up replacement for HDFS or HBase, Brandwein noted that wasn't the immediate intention. Instead, Kudu is meant to complement and run side by side with the storage engine because some applications may get more immediate benefit out of HDFS or HBase.
Last week, before the official release of the news, VentureBeat speculated about Kudu's possible implications for the rest of the big data industry. It "could present a new threat to data warehouses from Teradata and IBM’s PureData ... It may also be used as a highly scalable in-memory database that can handle massively parallel processing (MPP) workloads, not unlike HP’s Vertica and VoltDB."
This isn't likely to happen overnight, in the same way Kudu isn't likely to become a rip-and-replace substitute for HDFS or HBase. Teradata, in particular, decided it was better to have Hadoop as an ally -- it entered into partnerships with Hortonworks and added Hadoop support for many of its appliances.
Data warehouses still have markedly different needs and applications than Hadoop, so the two benefit when they work together rather than when one tries to subsume the other. Kudu will need time to come out of beta and provide a compelling use case for switching production systems, but it'll take more time for the existing data warehouse market to feel a genuine existential crisis.