The NoSQL buzzword has been metastasizing for several years. The excitement about these fast data stores has been intoxicating, and we're as guilty as anyone of seeing the groundbreaking appeal of NoSQL. Yet the honeymoon is coming to an end, and it's time to start balancing our enthusiasm with some gimlet-eyed hard truths.
Don't get us wrong. We're still running to try the latest experiment in building a simple mechanism for storing data. We still find deep value in MongoDB, CouchDB, Cassandra, Riak, and other NoSQL standouts. We're still planning on tossing some of our most trusted data into these stacks of code because they're growing better and more battle-tested each day.
[ Also on InfoWorld: NoSQL standouts: New databases for new applications | First look: Oracle NoSQL Database | Get a digest of the key stories each day in the InfoWorld Daily newsletter. ]
But we're starting to feel the chafing, as the NoSQL systems are far from a perfect fit and often rub the wrong way. The smartest developers knew this from the beginning. They didn't burn the SQL manuals and send nastygrams to the sales force of their once devoted SQL vendor. No, the smart NoSQL developers simply noted that NoSQL stood for "Not Only SQL." If the masses misinterpreted the acronym, that was their problem.
This list of gripes, big and small, is thus an attempt to document this fact and to clear the air. It's meant to set things straight now so that we can do a better job understanding the trade-offs and the compromises.
One of the first gripes people have about SQL systems is the computational cost of executing a JOIN between two tables. The idea is to store the data in one and only one place. If you're keeping a list of customers, you put their street addresses in one table and use their customer IDs in every other table. When you pull the data, the JOIN connects the IDs with the addresses and everything remains consistent.
The trouble is that JOINs can be expensive, and some DBAs have concocted complex JOIN commands that boggle the mind, turning even the fastest hardware to sludge. It was no surprise that the NoSQL developers turned their lack of JOINs into a feature: Let's just keep the customer's address in the same table as everything else! The NoSQL way is to store key-value pairs for each person. When the time comes, you retrieve them all.
Alas, people who want their tables to be consistent still need JOINs. Once you start storing customers' addresses with everything else about them, you often end up with multiple copies of those addresses in each table. And when you have multiple copies, you need to update them all at the same time. Sometimes that works, but when it doesn't, NoSQL isn't ready to help with transactions.
Wait, you say, why not have a separate table with the customer's information? That way there will only be one record to change. It's a great idea, but now you get to write the JOIN yourself in your own logic.
Let's say you're OK to live without JOINing tables because you want the speed. It's an acceptable trade-off, and sometimes SQL DBAs denormalize tables for just this reason.
The trouble is that NoSQL makes it hard to keep the various entries consistent. There are often no transactions to make sure that changes to multiple tables are made together. For that, you're on your own, and a crash could ensure that tables turn inconsistent.
The earliest NoSQL implementations thumbed their nose at these transactions. They would offer data listings that were consistent, except when they weren't. In other words, they went after the lowest-value data where errors wouldn't make any material difference.
Now some NoSQL implementations offer something approaching a transaction. Oracle's NoSQL product, for instance, offers transactional control over data written to one node and lets you choose a flexible amount of consistency across multiple nodes. If you want perfect consistency, you have to wait for each write to reach all nodes. Several other NoSQL data stores are experimenting with adding more structure and protection like this.
Many NoSQL programmers like to brag about how their lightweight code and simple mechanism work extremely quickly. They're usually right when the tasks are as simple as the insides of NoSQL, but that changes when the problems get harder.
Consider the old challenge of a JOIN. Once NoSQL programmers start generating their own JOIN commands in their own logic, they start to try to do this efficiently. SQL developers have spent decades developing sophisticated engines to handle JOIN commands as efficiently as possible. One SQL developer told me he was trying to synchronize his code with the spinning hard disk so that he would request data only when the head was just above the right spot. This may seem extreme, but SQL developers have been working on similar hacks for decades.
There's no doubt that programmers spend days pulling out their hair trying to structure their SQL queries to take advantage of all of this latent intelligence. It may not be simple to tap, but when the programmer figures it out, the databases can really sing.
A sophisticated query language like SQL always has the potential to outshine an unsophisticated query language like those found in NoSQL. It may not matter with simple results, but when the action becomes complex, the SQL is being executed on the machine right next to the data. It has little overhead fetching the data and doing the work. A NoSQL server usually has to ship the data to where it's going.
In theory, SQL is supposed to be a standard language. If you use SQL for one database, you should be able to run the same query in another compliant version. This claim may work with a few simple queries, but every DBA knows that it can take years to learn the idiosyncrasies of SQL for different versions of the same database. Keywords are redefined, and queries that worked on one version won't work with another.
Each tool doesn't just have its own idiosyncrasies, it sports an entirely different philosophy and way of expressing it. There are no easy ways to switch between data stores and you're often left writing tons of glue code just to give yourself the option of switching in the future. This may not be too difficult when you're stuffing pairs of keys and values into the system, but it can grow increasingly aggravating the more complexity you introduce.
One of the great ideas from the NoSQL model is not requiring a schema. In other words, programmers don't need to decide in advance which columns will be available for each and every row in a table. One entry may have 20 strings attached to it, another may have 12 integers, and another might be completely blank. The programmers can make the decision whenever they need to store something. They don't need to ask permission of the DBA, and they don't need to fill out all the paperwork to add a new column.
All that freedom sounds intoxicating, and in the right hands it can speed development. But is it really a good idea for a database that might live through three teams of developers? Is it even workable for a database that might last beyond six months?
In other words, the developers might want the freedom to toss any old pair into a database, but do you want to be the fifth developer to come along after four have chosen their own keys? It's easy to imagine a variety of representations of "birthday," with each developer choosing his or her own representation as a key when adding a user's birthday to an entry. A team of developers might imagine almost anything: "bday," "b-day," "birthday".
The NoSQL structure offers no support to limit this problem because that would mean reimagining the schema. It doesn't want to harsh on the mellow of the totally cool developers. A schema would get in the way.
The fact is that adding a column to a table isn't a big deal, and the discipline might actually be good for the developer. Just as it helps to force developers to designate variable types, it also helps to force developers to designate the type of data attached to a column. Yes, the DBA may force the developer to fill out a form in triplicate before attaching that column, but it's not as bad as dealing with a half-dozen different keys created on the fly by a programmer.
Let's say you don't want all of the data in all of the rows, and you want the sum of a single column. SQL users can execute a query with the SUM operation and send one -- just one -- number back to you.
NoSQL users get all of the data shipped back to them and can then do the addition themselves. The addition isn't the problem because it takes about the same amount of time to add up the numbers on any machine. However, shipping the data around is slow, and the bandwidth required to ship all that data can be expensive.
There are few extras in NoSQL databases. If you want to do anything but store and retrieve data, you're probably going to do it yourself. In many cases, you're going to do it on a different machine with a complete copy of the data. The real problem is that it can often be useful to do all of the computation on the machine holding the data because shipping the data takes time. But tough for you.
Sure, you can get your NoSQL stack up and running on your server. Sure, you can write your own custom code to push and pull your data from the stack. But what if you want to do more? What if you want to buy one of those fancy reporting packages? Or a graphing package? Or to download some open source tools for creating charts?
Sorry, most of the tools are written for SQL databases. If you want to generate reports, create graphs, or do something with all of the data in your NoSQL stack, you'll need to start coding. The standard tools come ready to snarf data from Oracle, Microsoft SQL, MySQL, and Postgres. Your data is in NoSQL? They're working on it.
And they'll be laboring on it for a bit. Even if they jump through all of the hoops to get up and running with one of the NoSQL databases, they'll have to start all over again from the beginning to handle the next system. There are more than 20 different NoSQL choices, all of which sport their own philosophy and their own way of working with the data. It was hard enough for the tool makers to support the idiosyncrasies and inconsistencies in SQL, but it's even more complicated to make the tools work with every NoSQL approach.
This is a problem that will slowly go away. The developers can sense the excitement in NoSQL, and they'll be modifying their tools to work with these systems, but it will take time. Maybe then they'll start on MongoDB, which won't help you because you're running Cassandra. Standards help in situations like this, and NoSQL isn't big on standards.
NoSQL shortcomings in a nutshell
All of these NoSQL shortcomings can be reduced to one simple statement: NoSQL tosses away functionality for speed. If you don't need the functionality, you'll be fine, but if you need it in the future, you'll be sorry.
Revolutions are endemic to tech culture. A new group comes along and wonders why the last generation built something so complex, and they set out to tear down the old institutions. After a bit, they begin to realize why all of the old institutions were so complex, and they start implementing the features once again.
We're seeing this in the NoSQL world, as some of the projects start adding back things that look like transactions, schemas, and standards. This is the nature of progress. We tear things down only to build them back again. NoSQL is finished with the first phase of the revolution and now it's time for the second one. The king is dead. Long live the king.