Cassandra, CouchDB, MongoDB, Redis, Riak, Neo4J, and FlockDB reinvent the data store
Was it just two or three years ago when choosing a database was easy? Those with a Cadillac budget bought Oracle, those in a Microsoft shop installed SQL Server, those with no budget chose MySQL. Everyone in between tried to figure out where they belonged.
Those days are gone forever. Everyone and his brother are coming out with their own open source project for storing information. In most cases, these projects are tossing aside many of the belts-and-suspenders protections that people expect from the classic databases. There are enough of them now that some joker started calling them NoSQL and claiming, perhaps tongue-in-cheek, that the acronym stood for Not Only SQL.
[ Also on InfoWorld: Bossie Awards 2011: The best open source software of the year | Get the key insights on open source news and trends from InfoWorld's Technology: Open Source newsletter. ]
Everywhere you look, there are new NoSQL databases -- or "data stores," if you're one of those who feel that the word "database" can only be used by proper relational software offering belts-and-suspenders compliance with the ACID rules. Some of these new databases are quite sophisticated, while others are deliberately bare bones. But all are intended to deliver high performance by trading away the power of a relational database. Let the banks and their nervous programmers worry that Aunt Millie's pension check is deposited correctly, they say. You can't get your kicks if you're constantly checking everything in triplicate.
In most cases, the NoSQL rebels succeeded in building something that's blazingly fast and fairly scalable -- but only by abandoning traditional crutches. Old school DBAs are shaking their heads and chuckling through the presentations because they're sure the whippersnappers are going to stumble over the problems the veterans have already fixed. But the whippersnappers don't care because they have different project needs in mind. They're aiming at new targets.
What is surprising is how different the NoSQL projects are turning out to be. Whereas the old relational space largely converged on a set of features and a standard language, these new databases are all built by people going in their own direction. The packages may take basic pairs of keys and values, but they're tuned for different use cases. The major variations aren't in the format of the data but in how often it's replicated, cached, and sharded.
For instance, do you store data that is often retrieved, such as a person's email address? Or is the data squirreled away for a rainy day, as with log files? Do you expect many users with a small amount of data or just a few users with large volumes of data? Can your users survive if you lose one of their rows of data, or will they start to sue?
There's a different project for each answer. In the past, each architect would tweak the configuration of MySQL or Oracle in a different way. Now, the architect chooses a completely novel project.
There are great advantages to this Babelization if the needs of your project fit the abilities of one of the new databases. If they line up well, the performance boosts can be incredible because the project developers aren't striving to build one Dreadnought to solve every problem.
The experimentation is also fun because the designers don't feel compelled to make sure their data store is a drop-in replacement that speaks SQL like a native. They're coming up with new query languages and making different decisions about storing things like binary data. It has all the buzz of innovation.
The downsides to this experimentation are often ignored in all of the excitement. In the past, developers built nice cross-database libraries to smooth the differences and make it easier to switch. Many Java developers, for instance, write code that rests on the JDBC libraries. The databases are pretty close to interchangeable. None of these old libraries work with these new databases that thumb their nose at the old orthodoxies. Although many of the projects share similar approaches, moving from one to another can require more than a few lines of rewriting. (This may be changing because there's at least one project aiming to create Hibernate-bindings for the major NoSQL databases. Hibernate OGM could make it possible to point your Hibernate-based application at any of the NoSQL databases it's able to support.)
To make matters worse, many ancillary items are missing. There's a very fertile category of report-generating tools that work with any database out there, a fact that was made possible by the boring, old standard query language. None of the new databases will work with these tools out of the box, and they may never work without plenty of sweat. There may be hundreds or perhaps thousands of packages that work with SQL and only a few that are able to help NoSQL packages.
There are some indications that this kind of interchangeability will take a long time to emerge in the NoSQL space, if it appears at all. All of that experimentation is generating features that don't overlap very easily. Although it wouldn't be hard to write basic routines that abstract away the keys and values, anything more sophisticated would start to fray quickly. The query languages, for instance, are very different.
To get some feel for the options in this incredibly fertile space, I spent some time installing these packages and inserting some rows. It was, for the most part, simple, fun, and relatively fast. All of the packages are relatively stable and useful -- for the right projects. But none of them are as feature-rich or sophisticated as the best commercial SQL tools.
NoSQL databases: Cassandra
Facebook needed something fast and cheap to handle the billions of status updates, so it started this project and eventually moved it to Apache where it's found plenty of support in many communities. It's not just for Facebook any longer. Many of the committing programmers come from other companies, and the project chair works at DataStax.com, a company devoted to providing commercial support for Cassandra.
The heritage of the Cassandra project is obvious because it's a good tool for tracking lots of data, such as status updates at Facebook. The tool helps create a network of computers that all carry the same data. Each machine is meant to be equal to the others, and all of them should end up being consistent once the data propagates around the P2P network of nodes, though it's not guaranteed. The key phrase is "eventual consistency," not "perfect consistency." If you've watched your status updates disappear and reappear on Facebook, you'll understand what this means.
The tool runs in Java as a separate process waiting for interaction. There's already a collection of higher-level libraries for Java, Python, Ruby, and PHP, as well as some of the other languages.
Using Cassandra seems relatively simple, but I still found myself getting hung up on several barriers, such as defining a keyspace (which acts as a namespace but for the columns). Getting up to speed takes more than a few minutes because there are more than just the basic routines for storing collections of values. Cassandra is happy with a sparse matrix where each row stores only a few standard columns, and it builds the indices with this in mind.
Much of the complexity in the API is devoted to controlling just how quickly the cluster of nodes moves toward consistency. You can specify the speed of synchronization for columns and collections of values called supercolumns.
Getting everything running is now fairly well documented, but getting it running quickly requires a fair amount of both hardware and operating system tuning. The biggest bottleneck is the commit log. Optimizing the way that this is written to disk is the most important part of improving writes. Speeding up the extraction of data involves paying attention to the pattern of reads. Did your old, fancy database do this for you fairly automatically? Ah, don't complain. It's fun to think about the hardware and how it affects your software.
NoSQL databases: CouchDB
CouchDB stores documents, each of which is made up of a set of pairs that link key with a value. The most radical change is in the query. Instead of some basic query structure that's pretty similar to SQL, CouchDB searches for documents with two functions to map and reduce the data. One formats the document, and the other makes a decision about what to include.
I'm guessing that a solid Oracle jockey with a good knowledge of stored procedures does pretty much the same thing. Nevertheless, the map and reduce structure should be eye-opening for the basic programmer. Suddenly a client-side AJAX developer can write a fairly complicated search procedure that can encode some sophisticated logic.
There's a burgeoning community growing around CouchDB. All of the major languages now have client libraries that simplify the interaction with the database and make it possible to store your data. They don't always expose all of the power of the query function, but that's not necessary for every service. There are also companies like Couchbase that bundle CouchDB into commercial product offerings. Cloudant offers the database as a hosted service and partners with companies like CloudBees to support the code running on the Cloudant cloud. It's getting easier and easier to use CouchDB like a service.
NoSQL databases: MongoDB
Well, that's simplifying things a bit. The big difference is that MongoDB will create indices for the columns of your database and return queries faster when the indices are correctly constructed. That's part of your job, by the way. You want to anticipate which indices your users will need.
There's also a fair number of extra tools for working with the database. PHPMoAdmin, a cousin of the MySQL tool PHPMyAdmin, is just one of almost a dozen tools for admins. The proliferation of these tools is gradually erasing one of the standard reasons for sticking with a classic database. As I found more of them, I noticed that everything was more comfortable.
NoSQL databases: Redis
Like CouchDB and MongoDB, Redis stores documents or rows made up of key-value pairs. Unlike the rest of the NoSQL world, it stores more than just strings or numbers in the value. It will also include sorted and unsorted sets of strings as a value linked to a key, a feature that lets it offer some sophisticated set operations to the user. There's no need for the client to download data to compute the intersection when Redis can do it at the server.
This approach leads to some simple structures without much coding. Luke Melia tracked the visitors on his website by building a new set every minute. The union of the last five sets defined those who were "online" at that moment. The intersection of this union with a friends list produced the list of online friends. These sorts of set operations have many applications, and the Redis crowd is discovering just how powerful they can be.
Redis is also known for keeping the data in memory and only writing out the list of changes every once and a bit. Some don't even call it a database, preferring instead to focus on the positive by labeling it a powerful in-memory cache that also writes to disk. Traditional databases are slower because they wait until the disk gets the information before signaling that everything is OK. Redis waits only until the data is in memory, something that's obviously faster but potentially dangerous if the power fades at the wrong moment.
The project leaders are still exploring how to expand the project, an intriguing decision because there's more than one official version of Redis from the main team. There's even one official build of Redis that comes with a Lua interpreter and a disclaimer saying that "there is no guarantee that scripting works correctly or that it will be merged into future versions of Redis!" Projects like these are never boring.
Redis providers are starting to appear. OpenRedis promises it's "launching soon." Meanwhile, Redis Straight Up charges just $19 per month, plus all of the costs from Amazon's cloud. The service handles the configuration and passes the costs on to you.
NoSQL databases: Riak
Riak is one of the more sophisticated data stores. It offers most of the features found in others, then adds more control over duplication. Although the basic structure stores pairs of keys and values, the options for retrieving them and guaranteeing their consistency are quite rich.
Those of you who signed up for the Windows 10 upgrade but changed your mind may be able to crawl out
You may be better off sticking with Win7 or Win8.1, given a wide range of Win10 trade-offs and...
Samsung's throwing another phablet into the ring, but this one's curved on both sides
New sources are stepping up questions about Oracle's stewardship of the Java development platform
What you omit from your resume is just as important to job search success as what you include
Some apps on some iPads support full split-screen capabilities, so be prepared for a variable user...
The latest Start menu has few of Win7-era customizations -- but many new tricks worth knowing