Which freaking database should I use?

In the era of big data, good old RDBMS is no longer the right tool for many database jobs. Here's a quick guide to choosing among NoSQL alternatives

Page 2 of 3

Key-value pair databases
Key-value pair databases include the current 1.8 edition of Couchbase and Apache Cassandra. These are highly scalable, but offer no assistance to developers with complex datasets. If you essentially need a disk-backed, distributed hash table and can look everything up by identity, these will scale well and be lightning fast. However, if you find that you're looking up a key to get to another key to get to another key to get to your value, you probably have a more complicated case.

Key-value pair databases

There are a number of different permutations of key-value pair databases. These are basically different trade-offs on the CAP theorem and different configurations of storage and memory use. Ultimately, you have some form of what is basically a hash table.

This is fine for flat parts lists so long as they don't composite. This is also fine for stock quotes, "right now," or other types of lists where that key has meaning and is the primary way you're going to look up the value. Usually these are combined with an index, and there is a way to query against the values or generate a list of keys, but if you need a lot of that, you probably should look elsewhere.

Column family/big table databases
Most key-value stores (including Cassandra) offer some form of grouping for columns and can be considered "column family" or "big table" as well. Some databases such as HBase were designed as column family stores from the beginning. This is a more advanced form of a key-value pair database. Essentially, the keys and values become composite. Think of this as a hash map crossed with a multidimensional array. Essentially each column contains a row of data.

According to Robin Schumacher, the vice president of products for DataStax, which sells a certified version of Cassandra, "A popular use case for Cassandra is time series data, which can come from devices, sensors, websites (e.g., Web logs), financial tick data, etc. The data typically comes in at a high rate of speed, can come from multiple locations at once, adds up quickly, and requires fast write capabilities as well as high-performance reads over time slices."

You can use also use MapReduce on these, so they can be good analytical stores for semi-structured data. These are highly scalable, but not usually transactional. If the relationships between the data are as important as the data itself (such as distance or path calculations), then don't use a column family/big table database.

Document databases
Many developers think document databases are the Holy Grail since they fit neatly with object-oriented programming. With high-flying vendors like 10gen (MongoDB), Couchbase, and Apache's CouchDB, this is where most of the vendor buzz is generated.

Frank Weigel from Couchbase pointed out to me that the company is moving from a key-value pair database in version 1.8 to a document database in 2.0. According to him, the "document database is a natural progression. From clustering to accessing data, document databases and key-value stores are exactly the same, except in a document database, the database understands the documents in the datastore." In other words, the values are JSON, and the elements inside the JSON document can be indexed for better querying and search.

The sweet spot for these is where you're probably already generating JSON documents. As Max Schireson, president of 10gen told me, you should consider a document database if your "data is too complex to model in a relational database. For example, a complex derivative security might be hard to store in a traditional format. Electronic health records provide another good example. If you were considering using an XML store, that's a strong sign to consider MongoDB and its use of JSON/BSON."

This is probably your operational store -- where data being collected from users, systems, social networks, or whatever is being collected. This is not likely where you are reporting from, though databases such as MongoDB often have some form of MapReduce available. While at least in MongoDB, you can query on anything, you will not generally achieve acceptable performance without an index.

| 1 2 3 Page 2
From CIO: 8 Free Online Courses to Grow Your Tech Skills
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies