MetaMatrix was one of a few “database virtualization” products that aimed to make all of your RDBMSes look like one giant unified schema. This is the sort of idea that makes perfect sense to people with no experience dealing with databases.
When I first heard of MetaMatrix, I was like, “Wow, that is the worst idea ever. There is no way it will perform well enough to avoid making me very, very angry.” Indeed, without sufficient scale, it didn’t.
Part of the problem was that when MetaMatrix was developed, it was almost a solution looking for a problem. These days, I’ve found the problem.
A typical big data project usually involves some semi-structured data, a bit of unstructured (such as email), and a whole lot of structured data (stuff stored in an RDBMS). On one hand, you want to do analytics that will cross data sources (schemas) and even types (RDBMS plus log files, for example). On the other hand, the cost of constructing these systems is relatively high.
While the cost of Hadoop on a per-node basis is pretty inconsequential, the cost of understanding all of the schemas, getting them into Hadoop, and structuring them well enough to perform the analytics is another story. In other words, big data is not costly, but ETL is. It isn’t the technology, but the massive IT integration project necessary to get there.
The second pain point is that any time you load from one system to another, you inject latency. The unwashed advocate will say, “That’s why I have real-time analytics like Spark” (the unwashed won’t mention Storm), but you’re not realistically going to integrate row-level updates with Spark, Storm, or Kafka. Likely, this will be some kind of scheduled batch and CDC process. Even if you did integrate using streaming, that would still inject latency (though less of it).
The third pain is that, often, the data is coming from an operational system. You’re not “moving” to Hadoop, you’re adding a downstream repository. You already have a schema, so why do you need to “Squoop” it into Hadoop? Well, one reason is that MapReduce or your DAG implementation (Spark, Tez, or Flink) or Hive do not know how to query directly against the RDBMS. Even if they did, the RDBMS wouldn’t scale the way that Hadoop needs.
What you need is an architecture that reads the data natively as it is, makes it available to Hadoop, and scales at least in parallel to the Hadoop infrastructure. Ideally it would be directly integrated to use only the Hadoop infrastructure (which means using HDFS and the rest).
So far, integration of Hadoop with these systems is a bit backward. People are integrating Hive into database virtualization with JDBC/ODBC instead of integrating directly into the Hadoop infrastructure. The drawback is that If you had a few hundred Hadoop nodes to process your data, you may in turn need a few hundred database virtualization nodes to process your RDBMS data in conjunction with your Hadoop data. Ideally you’d have one set of seamlessly scaleable infrastructure for both. Moreover, you lose the advantage of some of the tooling on Hadoop to process the data together (Pig) and are stuck with bare SQL and traditional high-level languages (Java).
For the moment database virtualization helps address the cost of ETL with a parallel scalable architecture that can query Hive and your RDBMSes together. It can help mitigate some of the pain of latency with integrated caching semantics. It lets you query your data where it lies.
We are still waiting for the next-generation database virtualization product that scales as part of your Hadoop infrastructure and allows you to essentially query your RDBMS virtualized inside of Hive (Impala, Spark) with no ETL, but I see the evolution of that technology as virtually inevitable if people start to rediscover the value of database virtualization. I think it is fair to say that database virtualization technologies like Red Hat’s Teiid might be one of the next trends to come in big data.