NoSQL databases break all the old rules

Amazon SimpleDB, CouchDB, Google App Engine, and Persevere may have a better way of storing data for your Web app

So you've got some data to store. In the past, the answer was simple: Hook up an official database, pour the data into it, and let the machine sort everything out for you while you spend your time writing big checks to the database manufacturer. Now things aren't so cut and dry. A fresh round of exciting new tools is tacking the two letters "db" onto a pile of code that breaks with the traditional relational model. Old database administrators call them "toys" and hint at terrible dangers to come from the follies of these young whippersnappers. The whippersnappers just tune out the warnings because the new tools are good enough and fast enough for what they need.

The non-relational upstarts are grabbing attention because they're willfully ignoring many of the rules that codify the hard lessons learned by the old database masters. The problem is that these belts-and-suspenders strictures often make it hard to create really, really big databases that suck up all of the cycles of a room full of machines. Because all Web application designers dream of building a startup that needs a really big room filled with machines to hold all of the data of all of the users, the rules need to be bent or even broken.

[ For a brief look at more alternative databases, see Open source and SaaS offerings rethink the database. Catch InfoWorld's cloud computing reviews and analysis: Cloud versus cloud: Amazon, Google, AppNexus, and GoGrid | Inside Amazon Web Services | App builders in the sky | Windows Azure Services Platform gives wings to .Net | What cloud computing really means. ]

The first thing to go is the venerable old JOIN. College students used to dutifully work through exercises that taught them how to normalize the data by breaking the tables up into as many parts as practical. Disk space was expensive then, and a good normalization expert could really pack in the data. The problem is that JOINs are really, really slow when the data is spread out over several machines. Now that disk space is so cheap and many of the data models don't benefit as much from normalization, JOINs are easy to leave behind.

The next trick is to start using phrases like "eventual consistency." Amazon's documentation for SimpleDB includes this inexact promise: "Consistency is usually reached within seconds, but a high system load or network partition might increase this time." The new twerps really get those codgers steamed when they talk about how all of the computers in the cluster will get around to replicating the data and giving consistent answers when the machines are good and ready. For the kids, consistency is akin to cod liver oil or making your bed in the morning.

This distinction between immediate and eventual consistency is deeply philosophical and depends on how important the data happens to be. The old guard who start reaching for their heart medication at the news of these new databases are usually bank programmers who want to make sure that the accounts balance at the end of the day. After all, the bank's brilliant leaders can't turn around and "invest" the cash in subprime mortgages if there's one penny missing after a failed database transaction. At least they're not hauling the DBAs before Congress to explain where the cash went.

But many modern Web sites will sail on without a hiccup if some transaction fails. I see glitches on Facebook regularly. The world won't end if some snarky, anonymous comment on Slashdot disappears. None of these sites cares if the accounting is as good as a bank's, and they don't really need all of the power of a traditional database. (Some wags suggest that banks put the money from an Oracle license into a fund to compensate the people who actually lose money on a failed transaction from one of these newfangled data stores.)

To get an understanding of this expanding tier of non-traditional databases, I took a few out for a ride and built up some test applications with them. The field was surprisingly diverse despite the fact that the offerings are so stripped down that they really don't have more than three major commands: Insert, Update, and Delete. Some offer clustering. Some are available only as a service. Some have grand pretensions to take over the entire server stack. Some play better with AJAX tools than others. None of them is right for everyone, and all of them are completely wrong for the bankers out there. (See the sidebar, "Open source and SaaS offerings rethink the database.")

I also excluded a few interesting tools because of space or just because they were slightly different. Sun, for instance, is now bundling a version of a relational database called Derby with its Java VM. Oracle has its own embedded tool once known as Sleepycat's Berkeley DB but now called the Oracle Embedded Database. Some programmers are even creating very low-rent libraries that write the objects directly to the disk. One project, Prevayler, brags that all of the code from one version could fit legibly on a T-shirt. These products are also stretching the meaning of the two letters "db," but they didn't fit in this comparison.

Amazon SimpleDB
SimpleDB is one of the most advanced and most cloud-like components of Amazon's great push into cloud computing services. Once you sign up and get your secret password, you can ship off some Web service XML filled with pairs of keys and values to SimpleDB and it will store the data for you -- well, as long as you keep paying the bills shown on the meter. You don't need to think about installing anything or backing it up. Amazon hides all of that work for you behind its Web service wall.

SimpleDBcomes with two levels of hierarchy on top of the piles of data pairs. The top level is the "domain" and the second level is the "item." After you choose the domain and item names, you pour in the pairs. SimpleDB's comparatively feature-rich API includes the ability to sort the data and even count the number of items that match the query. You can even write queries that exclude values that don't start with a certain string. This may not sound like much to someone who uses SQL Server or Oracle, but some of these low-rent databases can't even sort the data in the result set.

SimpleDBis meant to be used with Amazon's Simple Storage Service (S3), because each of the values in the pairs is limited to 1,024 bytes. That's enough for many strings, but it's not enough for many content engines. So you store a pointer to the data in S3. There are a few libraries like an extension of the Java Persistence Architecture that straddle the two clouds and handle this pointer juggling for you.

There are other limitations that can lead you to start doing JOIN-like things with multiple calls. Each query can only run 5 seconds. The answer can only hold 250 items. Each item can have only 250 pairs. Some people half joke about concatenating multiple values with keys like "description1," "description2," and "description3." There are many simple work-arounds for the limitations, but they start to make you wonder whether SimpleDB is supposed to make your life easier or harder.

Amazon is beginning to rewrite the APIs to push for more and better authentication. Come September 2009, calls to the SimpleDB (and a few other services) will run through SSL, providing both security and authentication. Amazon is also enhancing the signature mechanism to use more sophisticated hashing algorithms that pack together more of the request. This is just one of the ways that Amazon is slowly rolling out small improvements.

The company is also creating more libraries that make it simpler to use the service. There are dozens of packages that work with all of the major languages and some of the minor ones. The documentation is extensive. It's usually possible to start up and begin storing your data in little time.

The price is now easier to handle. There's a "free tier" of service that lets you burn up to 25 hours of computation time per month -- enough, Amazon estimates, to run a basic logging tool that processes less than 2 million requests a month. Plus, Amazon recently slashed the price for storage from $1.50 to 25 cents per gigabyte. The company appears committed to keeping the charges transparent so users will have the right incentives to structure their consumption.

Amazon has one of the more advanced terms of service. There are plenty of clauses that work through some of the problems you might encounter, and several caught my untrained eye. For instance, Amazon claims, "We may delete, without liability of any kind, any of your Amazon SimpleDB Content that has not been accessed in the previous 6 months." This may be perfectly acceptable for the people who are taking the system out for a spin with test data and not paying for it, but the phrasing suggests a bit of the omnipotence that Amazon probably feels it needs to keep its datacenter running.

There are other squishy issues. For instance, the terms of service include a long list of forbidden data, such as promoting illegal activities and discriminating on the basis of "race, sex, religion, nationality, disability, sexual orientation, or age." Imagine you're running a Web site for some church campaigning against gay marriage. That sounds like it might be dinged for discriminating against sexual orientation. But let's say you're campaigning for gay marriage by protesting these churches. Are you discriminating on the basis of religion?

I feel sorry for the lawyers who are going to parse the complaints, but at least they can rest easy knowing they can pretty much delete your data "for any reason or for no reason." Whew. If you're just using the free service, Amazon doesn't have to give you any notice, but it promises a 60-day notice if you're a paying customer. You can get your data back -- if you pay the storage charges that keep accruing.

Google App Engine
Google App Engine isn't a database per se. It's a cloud for distributing Python applications, and it comes with its own database hidden away inside. It's not really possible to access the database without going through the application layer first. But it's not hard to wrap up a database call and format the data for the request, so it might be proper to think of App Engine as a database with a layer of embedded procedures that are written in Python.

This extra layer of customizability is often quite useful. Many of the complaints about the other toy databases revolve around how a missing feature makes it impossible to find the right data. If you want to add a bit more functionality to the database here, you can whip up many of the features locally in Python. If you want a JOIN, you can synthesize one in Python and probably customize the memory cache at the same time. This is especially useful for Web applications that let users store their data in the service. If you need to add security to restrict each user to the right data, you can code that in Python too.

The App Engine data store is much more structured than Amazon's SimpleDB, and it gets much of this structure from Python's object model. You don't store key-value pairs, but Python objects, and those are defined with something that's pretty similar to an SQL schema. You can set the type of each column, make some of them required, and then ask for indexing across the columns that you'll need. The transaction mechanism is also deeply entwined with Python because each transaction is really just a Python function. This is a bit of a simplistic statement because there is a list of restrictions on what can happen inside this function (including rules such as each item can be updated only once). The good news is that the Google team is building special transaction methods that abstract away some of the common behavior (such as "Create or Update" a row).

Searching is deliberately set up to be SQL-like; in fact, Google offers its own SQL-like language, GQL, that's parsed into queries. There's also a Python-based set of methods that can be chained together to handle the data selection and querying. You don't need to waste the cycles parsing the query.

It's worth pointing out that the Python stack includes a number of features that aren't found in the best of databases. There's a library for manipulating image files by cropping and even a Google-esque "I feel lucky" function that will fix up the picture with some magic formula. If you want to e-mail someone, you can. You can also store data as Google documents, spreadsheets, and calendar items. It may seem like just a database at first, but it's easy to get sucked into the Google stack.

Until a few weeks ago, App Engine was beta and using it was free. It's still free as long as you stay within some basic quotas. After that, Google is charging with a mechanism that's pretty similar to Amazon's. The price for storage is cheaper (12 cents per gigabyte per month), but the charge for bandwidth is about the same (10 cents per gigabyte coming in.)

1 2 Page 1
Page 1 of 2