NoSQL showdown: MongoDB vs. Couchbase
MongoDB edges Couchbase Server with richer querying and indexing options, as well as superior ease of use
Couchbase promotes Couchbase Server as a solution for real-time access, not data warehousing. Nor is Couchbase Server suitable for batch-oriented analytic processing -- it is designed to be an operational data store.
Though Couchbase Server is based on Apache CouchDB, it is more than CouchDB with incremental modifications. For starters, Couchbase is an amalgam of CouchDB and Memcached, the distributed, in-memory, key/value storage system. In fact, Couchbase can be used as a direct replacement for Memcached. The system provides a separate port that unmodified, legacy Memcached clients can use, as well as "smart SDK" and proxy tools that improve its performance as a Memcached server.
For example, you can use a "thick client" deployment model, which will place the continuously updated knowledge of Memcached node topology on the client tier. This speeds response, as any request for a particular Memcached object will be sent from the client directly to the caching node for that object. This thick-client approach also plays an important role in the Couchbase system's resilience to node crashes (described later).
Couchbase includes its own object-level caching system based on Memcached, though with enhancements. For example, Couchbase tracks working sets (the documents most frequently accessed on a given node) in its object cache using NRU (not recently used) algorithms. All I/O operations act on this in-memory cache. Updates to documents in the cache are eventually persisted to disk. In addition, for updates, locking is employed at the document level -- not at the node, database, or partition level (which would hobble throughput with numerous I/O waits), nor at the field level (which would snarl the system with memory and CPU cycles required to track the locks).
Couchbase accelerates access by using "append only" persistence. This is used not only with the data, but with indexes as well. Updated information is never overwritten; instead, it is appended to the end of whatever data structure is being modified. Further, deleted space is reclaimed by compaction, an operation that can be scheduled to take place during times of low activity. Append-only storage speeds updates and allows read operations to occur while writes are taking place.
Couchbase scaling and replication
To facilitate horizontal scaling, Couchbase uses hash sharding, which ensures that data is distributed uniformly across all nodes. The system defines 1,024 partitions (a fixed number), and once a document's key is hashed into a specific partition, that's where the document lives. In Couchbase Server, the key used for sharding is the document ID, a unique identifier automatically generated and attached to each document. Each partition is assigned to a specific node in the cluster. If nodes are added or removed, the system rebalances itself by migrating partitions from one node to another.
There is no single point of failure in a Couchbase system. All partition servers in a Couchbase cluster are equal, with each responsible for only that portion of the data assigned to it. Each server in a cluster runs two primary processes: a data manager and a cluster manager. The data manager handles the actual data in the partition, while the cluster manager deals primarily with intranode operations.
System resilience is enhanced by document replication. The cluster manager process coordinates the communication of replication data with remote nodes, and the data manager process shepherds whatever replica data the cluster has assigned to the local node. Naturally, replica partitions are distributed throughout the cluster so that the replica copy of a partition is never on the same physical server as the active partition.
Like the documents themselves, replicas exist on a bucket basis -- a bucket being the primary unit of containment in Couchbase. Documents are placed into buckets, and documents in one bucket are isolated from documents in other buckets from the perspective of indexing and querying operations. When you create a new bucket, you are asked to specify the number of replicas (up to three) to create for that bucket. If a server crashes, the system will detect the crash, locate the replicas of the documents that lived on the crashed system, and promote those replicas to active status. The system maintains a cluster map, which defines the topology of the cluster, and this is updated in response to the crash.
Note that this scheme relies on thick clients -- embodied in the API libraries that applications use to communicate with Couchbase -- that are in constant communication with server nodes. These thick clients will fetch the updated cluster map, then reroute requests in response to the changed topology. In addition, the thick clients participate in load-balancing requests to the database. The work done to provide load balancing is actually distributed among the smart clients.