Big data showdown: Cassandra vs. HBase
Bigtable-inspired open source projects take different routes to the highly scalable, highly flexible, distributed, wide column data store
Meanwhile, though Cassandra is described as having "eventual" consistency, both read and write consistency can be tuned, not only by level, but in extent. That is, you can configure not only how many replica nodes must successfully complete the operation before it is acknowledged, but also whether the participating replica nodes span data centers.
Further, Cassandra has added lightweight transactions to its repertoire. Cassandra's lightweight transaction is a "compare and set" mechanism roughly comparable to HBase's "check and put" capability; HBase also has a "read-check-delete" operation for which Cassandra has no counterpart. Finally, Cassandra's 2.0 release added row-level write isolation: If a client updates multiple columns in a row, other clients will see either none of the updates or all of the updates.
In both Cassandra and HBase, the primary index is the row key, but data is stored on disk such that column family members are kept in close proximity to one another. It is therefore important to carefully plan the organization of column families. To keep query performance high, columns with similar access patterns should be placed in the same column family. Cassandra lets you create additional, secondary indexes on column values. This can improve data access in columns whose values have a high level of repetition -- such as a column that stores the state field of a customer's mailing address. HBase lacks built-in support for secondary indexes, but offers a number of mechanisms that provide secondary index functionality. These are described in HBase's online reference guide and on HBase community blogs.
As stated earlier, both databases have "command line" shells for issuing data manipulation commands. Both HBase's and Cassandra's shells are built on the JRuby shell, so you can write scripts that employ all of the JRuby shell's resources to interact with specific APIs provided by the databases. In addition, Cassandra has defined CQL, modeled after SQL. CQL is far richer than the query language used by HBase, and it can be executed directly in Cassandra's shell.
In fact, Cassandra is moving toward CQL as the database's primary programming interface, though Cassandra still supports the Thrift API. (Thrift is language independent, but it's now considered a legacy API.) The Cassandra documentation lists drivers for Java, C#, and Python, all of which employ CQL version 3. Finally, a JDBC driver is also available for Cassandra. It uses CQL in place of SQL as its data definition and data management languages.
HBase's native Java API provides the richest functionality to programmers, though HBase also sports the language-agnostic Thrift interface, as well as a RESTful Web service interface. While the data manipulation commands of HBase are not as rich as CQL, HBase does have a "filter" capability that executes on the server side of a session and improves scanning (search) throughput.
HBase has also introduced "coprocessors," which allow the execution of user code in the context of the HBase processes. The result is roughly comparable to the relational database world's triggers and stored procedures. Cassandra currently has no counterpart to HBase's coprocessors.
Cassandra's documentation is noticeably better than HBase's, and good documentation certainly flattens the learning curve. In my experience, setting up a development Cassandra cluster is simpler than setting up an HBase cluster. Of course, this is only important for development and testing purposes.
An HBase master node hosts a Web interface on port 60010. Here you can browse information such as the node's execution history, tables managed by the node, and region servers in the master's domain.