NoSQL showdown: MongoDB vs. Couchbase

Which NoSQL database has richer querying, indexing, and ease of use?

1 2 3 Page 2
Page 2 of 3

Couchbase supports cross-data-center replication (XDCR), which provides live replication of database contents of one Couchbase cluster to a geographically remote cluster. Note that XDCR operates simultaneously with intracluster replication (the copying of live documents to their inactive replica counterparts on other cluster members), and all systems in an XDCR arrangement invisibly synchronize with one another. However, Couchbase does not provide automatic fail-over for XDCR arrangements, relying instead on techniques such as using a load-balancing mechanism to reroute traffic at the network layer, in which case the XDCR group will have been set up in a master-master configuration.

Couchbase indexing and queries

Queries on Couchbase Server are performed via "views," Couchbase terminology for indexes. Put another way, when you create an index, you're provided with a view that serves as your mechanism for querying Couchbase data. Views are new to Couchbase 2.0, as is the incremental mapreduce engine that powers the actual creation of views. Note that queries really didn't exist prior to Couchbase Server 2.0. Until this latest release, the database was a key/value storage system that simply did not understand the concept of a multifield document.

To define a view, you build a specific kind of document called a design document. The design document holds the JavaScript code that implements the mapreduce operations that create the view's index. Design documents are bound to specific buckets, which means that queries cannot execute across multiple buckets. Couchbase's "eventual consistency" plays a role in views as well. If you add a new document to a bucket or update an existing document, the change may not be immediately visible.

The map function in a design document's mapreduce specification filters and extracts information from the documents against which it executes. The result is a set of key/value pairs that comprise the query-accelerating index. The reduce function is optional. It is typically used to aggregate or sum the data manipulated by the map operation. Code in the reduce function can be used to implement operations that correspond roughly to SQL's ORDER BY, SORT, and aggregation features.

Couchbase Server supplies built-in reduce functions: _count, _stats, and _sum. These built-in functions are optimized beyond what would be possible if written from scratch. For example, the _count function (which counts the number of rows returned by the map function) doesn't have to recount all the documents when called. If an item is added to or removed from the associated index, the count is incremented or decremented appropriately, so the _count function need merely retrieve the maintained value.

Query parameters offer further filtering of an index. For example, you can use query parameters to define a query that returns a single entry or a specified range of entries from within an index. In addition, in Couchbase 2.0, document metadata is available. The usefulness of this becomes apparent when building mapreduce functions, as the map function can employ metadata to filter documents based on parameters such as expiration date and revision number.

Couchbase indexes are updated incrementally. That is, when an index is updated, it's not reconstructed wholesale. Updates only involve those documents that have been changed or added or removed since the last time the index was updated. You can configure an index to be updated when specific circumstances occur. For example, you might configure an index to be updated whenever a query is issued against it. That, however, might be computationally expensive, so an alternative is to configure the index to be updated only after a specified number of documents within the view have been modified. Still another alternative is to have the view updated based on a time interval.

Whatever configuration you choose, it's important to realize that a design document can hold multiple views and the configuration applies to all views in the document. If you update one index, all indexes in the document will be updated.

Finally, Couchbase distinguishes between development and production views, and the two are kept in separate namespaces. Development views can be modified; production views cannot. The reason for the distinction arises from the fact that, if you modify a view, all views in that design document will be invalidated, and all indexes defined by mapreduce functions in the design document will be rebuilt. Therefore, development views enable you to test your view code in a kind of sandbox before deploying it into a production view.

NoSQL deep dive: MongoDB vs. Couchbase

You can manage Couchbase Server via the Web-based management console. The view of active servers, shown above, is open to a single member of the cluster. Memory cache and disk storage usage information is readily available.

Managing Couchbase

For gathering statistics and managing a Couchbase Server cluster, the Couchbase Web Console -- available via any modern browser on port 8091 of a cluster member -- is the place to go. It provides a multitab view into cluster mechanics. The tabs include:

  • Cluster overview, which has general RAM and disk usage information (aggregated for the whole cluster). Also, operations per second bucket usage. The information is presented in a smoothly scrolling line graph.
  • Server nodes, which provides information similar to the above, but for individual members of the cluster. You can also see CPU usage and swap space usage. On this tab, you can add a new node to the cluster: Click the Add Server button and you're prompted for IP address and credentials.
  • Data buckets, which shows all the buckets on the cluster. You can see which nodes participate in the storage of a given bucket, how many items are in each bucket, RAM and disk usage attributed to a bucket, and so on.

The Couchbase Web Console provides much more information than can be covered here. An in-depth presentation of its capabilities can be found in Couchbase Server's online documentation.

For administrators who would rather perform their management duties on the metal, Couchbase provides a healthy set of command-line tools. General management functions are found in the couchbase-cli tool, which lists all the servers in a cluster, retrieves information for a single node, initiates rebalancing, manages buckets, and more. The cbstats command-line tool displays cluster statistics, and it can be made to fetch the statistics for a single node (the variety of statistical information retrieved is too diverse to list here). The cbepctl command lets you modify a cluster's RAM and disk management characteristics. For example, you can control the "checkpoint" settings, which govern how frequently replication occurs.

Other command-line tools include data backup and restore, a tool to retrieve data from a node that has (for whatever reason) stopped running, and even a tool for generating an I/O workload on a cluster member to test its performance.

Couchbase Server is available in both Enterprise and Community editions. The Enterprise edition undergoes more thorough testing than the Community edition, and it receives the latest bug fixes. Also, hot fixes are available, as is 24/7 support (with the purchase of an annual subscription). Nevertheless, the Enterprise edition is free for testing and development on any number of nodes or for production use on up to two nodes. The Community edition, as you might guess, is free for any number of production nodes.

Pros and cons: Couchbase Server 2.0

 
Pros
  • Provides legacy Memcached capabilities
  • Supports spatial data and views
  • Now a true document database
Cons
  • Indexing mechanisms not well developed
  • JSON support is relatively immature
  • Does not support range sharding
Review: Visual Studio 2012 shines on Windows 8
MongoDB

MongoDB is about three years old, first released in late 2009. The goal behind MongoDB was to create a NoSQL database that offered high performance and did not cast out the good aspects of working with RDBMSes. For instance, the way that queries are designed and optimized in MongoDB is similar to how that would be done in an RDBMS. MongoDB's designers also wanted to make the database easier for application developers to work with -- for example, by allowing developers to change the data model quickly. MongoDB, whose name is short for "humongous," stores documents in BSON (Binary JSON), an extension of JSON that allows for the use of data types such as integers, dates, and byte arrays.

Two primary processes are at work in a MongoDB system, mongod and mongos. The mongod process is the real workhorse. In a sharded MongoDB cluster, mongod can be found playing one of two roles: config server or shard server. The config server tracks the cluster's metadata. (In a sharded MongoDB cluster, there must be at least three config servers for redundancy's sake.) Each config server knows which server in the cluster is responsible for a given document or, more precisely, where a given contiguous range of shard keys (called a chunk) belongs in the cluster.

Other mongod processes in the cluster run as shard servers, and these handle the actual reading and writing of the data. For fail-over purposes, two instances of a mongod process on a given cluster member run as shard servers. One process is primary, and the other is secondary. All write requests go to the primary, while read requests can go to either primary or secondary.

Secondaries are updated asynchronously from the primary so that they can take over in the event of a primary's crash. This, however, means that some read requests (sent to secondaries) may not be consistent with write requests (sent to primaries). This is an instance of MongoDB's "eventual consistency." Over time, all secondaries will become consistent with write operations on the primary. Note that you can guarantee consistent read/write behavior by configuring a MongoDB system such that all I/O -- reads and writes -- go to the primary instances. In such an arrangement, secondaries act as standby servers, coming online only when the primary fails.

The mongos process, which runs at a conceptually higher level than the mongod processes, is best thought of as a kind of routing service. Database requests from clients arrive first at a mongos process, which determines which shard(s) in a sharded cluster can service each request. The mongos process dispatches I/O requests to the appropriate mongod processes, gathers the results, and returns them to the client. Note that in a nonsharded cluster, clients talk directly to a mongod process.

MongoDB scaling and replication

MongoDB doesn't have an explicit memory caching layer. Instead, all MongoDB operations are performed through memory-mapped files. Consequently, MongoDB hands off the chore of juggling memory caching versus persistence-to-disk to the operating system. You can tweak various flush-to-disk settings for optimal performance, however. For example, MongoDB maintains a journal of database operations (for recovery purposes) that is flushed to the disk every 100ms. Not only is this interval configurable, but you can configure the system so that write operations return only after the journal has been written to disk.

Documents are placed in named containers called collections, which are roughly equivalent to Couchbase's buckets. A collection serves as a means of partitioning related documents into separate groups. The effects of many multidocument operations in a MondoDB database are restricted to the collection in which those operations are performed. MongoDB supports sharding at the collection level, which means -- should requirements dictate -- you could construct a database with unsharded and sharded collections. Of course, only a sharded collection is protected against a single point of failure.

A document's membership in a particular shard is determined by a shard key, which is derived from one or more fields in each document. The exact fields can be specified by the database administrator. In addition, MongoDB provides autosharding, which means that, once you've configured sharding, MongoDB will automatically manage the storage of documents in the appropriate physical location. This includes rebalancing shards as the number of documents grows or the number of mongod instances changes.

As of the 2.4 release, MongoDB supports both hash-based sharding and range-based sharding. As you might guess, hash-based sharding hashes the shard key, which creates a relatively even distribution of documents across the cluster. With range-based sharding (the sole sharding type prior to 2.4), a given member of a MongoDB sharded cluster will store all the documents within a given subrange of the shard key's overall domain. More precisely, MongoDB defines a logical container, called a chunk, which is a subset of documents whose shard keys fall within a specific range. The mongos process then dictates which mongod process will manage a given chunk.

Typically, you permit the load balancer to determine which cluster member manages a given shard range. However, with version 2.4, you can associate tags with shard ranges (a tag being nothing more than an identifying string). Once that's done, you can specify which member of a cluster will manage any shard ranges associated with a tag. In a sense, this lets you override some of the load balancer's decision making and steer identifiable subsets of the database to specific servers. For example, you could put the data most frequently accessed from California on the cluster member in California, the data most frequently accessed from Texas on the cluster member in Texas, and so on.

MongoDB's locking is on the database level, whereas it was global prior to version 2.2. The system implements shared-read, exclusive-write locking (many concurrent readers, but only one writer) with priority given to waiting writers over waiting readers. MongoDB avoids contentions via yield operations within locks. Predictive coding was added to the 2.2 release; if a process requests a document that is not in memory, it yields its lock so that other processes -- whose documents are in memory -- can be serviced. Long-running operations will also periodically yield locks.

1 2 3 Page 2
Page 2 of 3
InfoWorld Technology of the Year Awards 2023. Now open for entries!