Why run SQL on NoSQL? Speed, says Splice Machine

Splice Machine, makers of a RDBMS solution for Hadoop, claims it can outperform conventional databases that need both scale and speed

If they call it a NoSQL database, why would you want to run SQL queries on it?

Counterintuitive as it sounds, that's exactly the idea being explored by a company called Splice Machine. Its product, also called Splice Machine, is a transactional RDBMS system that uses Hadoop as a data store, and it's now available for download and general use as a public beta.

So does this mean all RDBMSes as we know them are facing a rip-and-replace? Probably not. From talking to Splice Machine and one of the folks who have already implemented it, it's clear the touted benefits of Splice Machine are intended for larger workloads where scale matters, not as a general substitute for the MySQLs of the world.

Splice Machine is essentially a Hadoop implementation of the Java-powered Apache Derby database project. Hadoop was built to run Java apps across clusters of machines, and so Splice Machine simply applies the Hadoop distributed-application method to Derby database workloads. The resulting system runs standard ANSI SQL-99 queries, but Splice Machine (the company) provides services for handling specific flavors of SQL, such as Oracle PL/SQL or Microsoft T-SQL.

But, again: If people are using Hadoop and NoSQL generally to get away from the strictures of RDBMS systems, why implement such things inside Hadoop all over again?

Monte Zweben, CEO and founder of Splice Machine, sees it as a best-of-both-worlds situation: "Creating an ACID-compliant RDBMS on top of Hadoop [brings] the ease of data access through SQL, the reliability of real-time updates of ACID transactions, and a 10x price/performance improvement over traditional RDBMSs using [Hadoop's] parallelized scale-out technology."

Another reason for Hadoop data access via SQL was what Monte described as the "parking lot" problem. Many of Splice Machine's customers, he noted, are using Hadoop as a default place to park data, making it easy to deposit data there but not as easy to retrieve it. "A Hadoop RDBMS provides robust SQL access to that data in Hadoop, as well as even supporting real-time applications." The varieties of apps run by existing Splice Machine customers mostly involve analytics systems like Cognos, Tableau, and Unica.

For further details, I turned to one such Splice Machine customer, marketing services company Harte Hanks, and its managing director of product innovation, Rob Fuller. An early adopter of Splice Machine, Harte Hanks has been using the product for about 15 months.

Rob's workload involved a 20 terabyte Oracle database that had been ported to Splice Machine, which Rob claims delivered an order of magnitude better performance than Oracle. One figure he cited was queries that delivered 2.5 billion rows and took 183 seconds on Oracle now took 20 seconds in Harte Hanks's proof-of-concept Splice Machine environment. Another query, which involved a complex set of joins, took 32 minutes in Oracle; in Splice Machine, it took 9.

The main long-term benefit Rob cited was the scale out that Hadoop provided. "Oracle does perform pretty well by default," he said, "but the challenge is when you outgrow your servers. You don't get much [scaling] benefit for adding an Oracle RAC, as you tend to bottleneck on the SAN."

For Harte Hanks, scaling out on Splice Machine presented some major benefits over Oracle. One was automatic balancing between clusters; another was avoiding the costly licensing issues that Oracle presented when a user wanted to add nodes. Scaling Oracle in general is a costly proposition, with extra products and heavy lifting required.

That said, Splice Machine isn't a drop-in replacement for an existing database. One problem, as mentioned above, is that queries need to be adapted to ANSI SQL-99 if they aren't already compatible. Another is that any applications with custom-stored procedures need to have that code rewritten in Java to be used in Splice Machine. Some of this work can be automated -- Monte estimates about a 70-95 percent accuracy rate for those tools -- but still, it isn't trivial work. (Splice Machine does offer such porting as part of the support package it sells with the software.)

Splice Machine also isn't the only product out there that allows SQL querying for Hadoop. Hadapt, Cloudera, and Greenplum all have similar offerings. Cloudera in particular is positioning itself hard as a single-solution vendor, where customers can deposit data freely into Hadoop and then extract it later in a variety of ways. Its open source SQL solution for Hadoop, Impala, is also built on an Apache database component, the Hive data warehousing system, that uses an embedded version of Apache Derby. But Splice Machine boasts being able to allow full interaction with data (transactions, changes, and deletions), not just analysis and reporting.

Before Splice Machine had any of its products available for public use, it had already raised $19 million over the course of two rounds of funding since 2012 -- an echo of the general level of fervor and curiosity over anything Hadoop. Pivotal and Cloudera have also raised tens of millions each in third-party investments, and might well be fixing to have IPOs. But Splice Machine stands out from them for being an actual Hadoop application rather than just another Hadoop vendor. And it's an application set to appeal to both existing Hadoop users and those still on the fence about what Hadoop has to offer them.

This story, "Why run SQL on NoSQL? Speed, says Splice Machine," was originally published at InfoWorld.com. Get the first word on what the important tech news really means with the InfoWorld Tech Watch blog. For the latest developments in business technology news, follow InfoWorld.com on Twitter.

Copyright © 2014 IDG Communications, Inc.