NoSQL databases break all the old rules

Amazon SimpleDB, CouchDB, Google App Engine, and Persevere may have a better way of storing data for your Web app

Google's terms of service carry a different set of responsibilities than Amazon's. You're required to formulate a privacy policy and guard the data of your users. If your users violate copyright rules, you must respond to DMCA (Digital Millennium Copyright Act) takedown notices or Google will do it for you. Google retains the right to delete any content at any time for any reason: "You agree that Google has no responsibility or liability for the deletion or failure to store any Content and other communications maintained or transmitted through use of the Service."

These terms have become more focused over the years. Google now promises to give you 90 days to get your data off of the servers if it decides to cancel your account -- something it can do "for any reason." Many of the changes I've noticed over that time seem to be focused on DMCA issues, which tie everyone in knots up and down the chain.

It's an interesting question what would happen if you decide to leave Google or Google asks you to leave. Google distributes a nice development tool that makes it easy for you to test your applications on your local machine. There's no technical reason why you couldn't host your service on your own server with these tools, except you would lose some of the cloud-like features. The data store included for testing wouldn't replicate itself automatically, but it seems to do everything else on my local machine. As always, there are some legal questions because "license is for the sole purpose of enabling you to use and enjoy the benefit of the Service."

Apache CouchDB
There's no reason why you need to work with a cloud to enjoy these new services. CouchDB is one of the many open source projects that build a simple database for storing key-value pairs. The project, written in Erlang, is supported under the aegis of the Apache Software Foundation. You can install it on any server by downloading the source files and compiling them. Then there are no more charges except paying for the server.

The CouchDB is similar to Amazon's tool, but it has some crucial differences. You still store key-value pairs as rows, but these pairs can be any of the standard JSON (JavaScript Object Notation) data types like Booleans and numbers. These values aren't limited to being 1,024-byte strings, something that makes it possible to store long values and even things like images. All of the requests and responses are formatted as JavaScript. There are no XML-based Web services, just JSON.

The biggest differences come when you're writing queries. CouchDB lets you write separate map functions and reduce functions using JavaScript. A simple query might just be a map function with a single "if" clause that tests to see whether the data is greater or less than some number. The reduce functions are only required if you're trying to compute some function across all of the data found by the map function. Counting the number of rows that are found is easy to do, and it's possible to carry off arbitrarily cool things as well, because the map function is limited only by what you can specify in JavaScript. This can be very powerful, although try as I might, I couldn't figure out any non-academic uses beyond counting the number of matches. The documentation includes one impressive reduction function that computes statistics, but I don't know if CouchDB is really the right tool for that kind of thing. If you need complex statistics, it may be better to stick with a traditional database with a traditional package for building reports with statistics.

There are still some limitations to this project. While the front page of the project calls it "a distributed, fault-tolerant, and schema-free document-oriented database," you won't get the distribution and fault-tolerance without some manual intervention. The nice AJAX interface to CouchDB includes a form that you can fill out to replicate the database. It's not automatic yet.

There are plans for an access control and security model, but these are not well-documented or even apparent in the APIs. They are designed to use pure JavaScript instead of SQL or some other language, which is a nice idea. You don't give or take away permissions to read documents; you just write a JavaScript function that returns true or false.

This approach isn't as limiting as it might seem. As I was working with these databases, I soon began to see how anyone could layer on a security model at the client with the judicious use of some encryption. Empowering the client reduces the need for much security work at the server, something I wrote about in Translucent Databases.

Observations like this are driving some of the more extreme users to push toward using CouchDB as the entire server stack. J. Chris Anderson, one of the committers on the project, wrote a fascinating piece arguing that CouchDB is all you need for an application server. The business logic for displaying and interacting with the data is written in JavaScript and downloaded from CouchDB as just another packet of JSON data.

In Anderson's eyes, there's no big reason to use Ruby, Python, Java, or PHP on the server when it can all be packaged in JavaScript. This may be a bit extreme because there will always be some business cases when the client machine can't be trusted to do the right thing, but they may be fewer than we know. Lightweight tools like CouchDB are encouraging people to rethink how much code we really need to get the job done.

At first glance, the Persevere database looks like most of the others. You push pairs of keys and values into it, and it stores them away. But that's just the beginning. Persevere provides a well-established hierarchy for objects that makes it possible to add much more structure to your database, giving it much of the form that we traditionally associate with the last generation of databases. Persevere is more of a back-end storage facility for JavaScript objects created by AJAX toolkits like Dojo, a detail that makes sense given that some of the principal developers work for SitePen, a consulting group with a core group of Dojo devotees.

Persevere is not like some of the other databases in this space that seem proud of the fact that they're "schema-free." It lets you add as much schema as you want to bring structure to your pairs. Instead of calling the top level of the hierarchy a domain (SimpleDB) or a document (CouchDB), Persevere calls them objects and even lets you create subclasses of the objects. If you want to enforce rules, you can insist that certain fields be filled with certain types, but there's no recommendation. The schema rules are optional.

The roots in the Dojo team are apparent because Dojo comes with a class, JsonRestStore, that can connect with Persevere and a number of other databases. including CouchDB. (Dojo 1.2 will also connect with Amazon's S3 but not SimpleDB and Google's Feed and Search APIs but not App Engine, at least out of the box.) The "Store" is sophisticated and has some surprising facilities. When I was playing with it initially, I hadn't given the clients the permissions to store data directly. The tool stored the data locally as if I were offline and had no connection with the database. When I granted the correct permissions later, the changes streamed through as if I had reconnected.

Persevere provides a great deal of connectivity through this tight connection with Dojo. You can create grid and tree widgets, then link them directly to the JsonRestStore; the widgets will let you edit the data. Voila! You've got remote access to a database in about 20 lines of JavaScript.

I encountered a number of small glitches that were probably due more to my lack of experience than to underlying bugs. Some things just started working correctly when I figured out exactly what to do. It's not so much Persevere itself you need to master, but the AJAX frameworks you're using in front of it. The documentation from Dojo is better than most AJAX frameworks, but it will take some time for Dojo to catch up with the underlying complexity that's hidden by the smooth surface of Persevere.

Cloud or cluster
After playing with these databases, I can understand why some people will keep using the word "toys" to describe them. They do very little, and their newness limits your options. There were a number of times when I realized that a fairly standard feature from the SQL world would make life simpler. Many of the standard SQL-based tools, like the reporting engines, can't connect with these oddities. There are a great many things that can be done with MySQL or Oracle out of the box.

But that doesn't mean that I'm not thinking of using them for one of my upcoming projects. They are solid data stores and so tightly integrated with AJAX that they make development very easy. Most Web sites don't need all of the functions of a MySQL or Oracle, and JOIN-free schemas are still pretty useful for many common data structures, including one-to-many and one-to-one relationships. Even many-to-one relationships are feasible until something needs to be changed. Given that database administrators are often denormalizing the tables to speed them up, you might say that these non-relational tools just save them a step.

One of the trickier questions is whether to use a cloud or build your own cluster of machines. Both Google and Amazon offer multimachine promises that CouchDB and Persevere can't match. You've got to push the buttons yourself with CouchDB. The Persevere team talks about scaling in the future. But it can be hard to guess how good the promises of Amazon and Google might be. What happens if Amazon or Google loses a disk? What if they lose a rack? They still don't make explicit promises and their terms of service explicitly disclaim any real responsibility.

Amazon's terms, for instance, repeat this sentiment a number of times: "We are not responsible for any unauthorized access to, alteration of, or the deletion, destruction, damage, loss or failure to store any of, Your Content (as defined in Section 10.2), your Applications, or other data which you submit or use in connection with your account or the Services."

I can't say I blame Amazon or Google because who knows who is ultimately responsible for a lost transaction? It could be any programmer in the stack, and it would be practically impossible to decide who trashed something. But it would be nice to have more information. Is the data in a SimpleDB stored in a RAID disk? Is a copy kept in another geographic area unlikely to be hit by the same earthquake, hurricane, or wildfire? The online backup community is starting to offer these kinds of details, but the clouds have not been so forthcoming.

All of these considerations make it clear to me that these are still toy databases that are best suited for applications that can survive a total loss of data. They're noble experiments that do a good job of making the limitations of scale apparent to programmers by forcing them to work with a data model that does a better job of matching the hardware. They are fun, fast, and so reasonable in price that you can forget about writing big checks and concentrate on figuring out how to work around the lack of JOINs.

InfoWorld Scorecard
Manageability (25.0%)
Scalability (20.0%)
Capability (25.0%)
Availability (20.0%)
Value (10.0%)
Overall Score (100%)
Amazon SimpleDB 8.0 9.0 8.0 8.0 8.0 8.2
Apache CouchDB 7.0 7.0 7.0 8.0 9.0 7.4
Google App Engine 8.0 9.0 8.0 8.0 8.0 8.2
Persevere Server 7.0 7.0 8.0 8.0 9.0 7.7
| 1 2 Page 4