Big data showdown: Cassandra vs. HBase
Bigtable-inspired open source projects take different routes to the highly scalable, highly flexible, distributed, wide column data store
Both are referred to as column-oriented databases, which can be confusing because it sounds like a relational database that's been conceptually rotated 90 degrees. What's even more confusing is the fact that data is apparently arranged initially by row, and a table's primary key is the row key. However, unlike a relational database, no two rows in a column-oriented database need have the same columns. As mentioned above, you can add columns to a row on the fly (after the table has been created). In fact, you can add lots of columns to a row. The exact upper limit is hard to calculate, but it's unlikely you'll hit the limit even if you add tens of thousands of columns.
Apart from characteristics derived from the Bigtable definition, Cassandra and HBase share other similarities.
Both implement similar write paths that begin with logging write operations to a log file to ensure durability. Even if the remainder of the write fails, the operation saved in the log can be replayed. The data is written next to a memory cache, then finally to disk via a large, sequential write (essentially a copy of the memory cache). The overall memory-and-disk data structure used by both Cassandra and HBase is more or less a log-structured merge tree. The disk component in Cassandra is the SSTable; in HBase it is the HFile.
Both provide command-line shells implemented in JRuby. Both are written largely in Java, which is the primary programming language for accessing either -- though client libraries are available for both in many other programming languages.
Of course, both Cassandra and HBase are open source projects managed under the Apache Software Foundation, and both are available free under an Apache version 2 license.
Clusters and contrasts
Nevertheless, in spite of all these parallels, you'll find a number of key differences.
While nodes in both Cassandra in HBase are symmetrical -- meaning that clients can connect to any node in the cluster -- the symmetry is not complete. Cassandra requires that you identify some nodes as seed nodes, which serve as concentration points for intercluster communication. Meanwhile, on HBase, you must press some nodes into serving as master nodes, whose job it is to monitor and coordinate the actions of region servers. Thus, Cassandra guarantees high availability by allowing multiple seed nodes in a cluster, while HBase guarantees the same via standby master nodes -- one of which will become the new master should the current master fail.
Cassandra uses the Gossip protocol for internode communications, and Gossip services are integrated with the Cassandra software. HBase relies on Zookeeper -- an entirely separate distributed application -- to handle corresponding tasks. While HBase ships with a Zookeeper installation, nothing stops you from using a pre-existing Zookeeper ensemble with an HBase database.
While neither Cassandra nor HBase support real transactions, both provide some level of consistency control. HBase gives you strong record-level (that is, row-level) consistency. In fact, HBase supports ACID-level semantics on a per-row basis. Also, you can lock a row in HBase, though this is not encouraged, not only because it hampers concurrency, but also because a row lock will not survive a region split operation. In addition, HBase has a "check and put" operation, which provides atomic "read-modify-write" semantics on a single data element.
DataStax OpsCenter -- included in the free DataStax Community Edition of Cassandra -- provides both cluster monitoring and management. Here the schema of a database is examined. Note that keyspaces can be edited and column families added or deleted.