Review: Cassandra lowers the barriers to big data
Apache Cassandra 2.0 combines NoSQL flexibility and scalability with friendly SQL-like queries
Because Cassandra is distributed, a cluster's members require a mechanism for discovering one another and communicating state information. This is where Cassandra's Gossip protocol comes in. As you might suspect, Gossip gets its name from the human activity of passing information throughout a group via apparently random, person-to-person conversations.
Certain nodes in a cluster are designated as "seed" nodes. Each second, a timer on a Cassandra node fires, initiating communication with two or three randomly selected nodes in the cluster, one of which must be a seed node. Consequently, seed nodes will tend to have the most up-to-date view of a cluster. (When a new node is added to a cluster, it first contacts a seed node.)
Cassandra works to keep Gossip communication efficient. Each node maintains two sorts of states. HeartBeatState tracks the node's version number, which is incremented any time information on the node has changed, and how often the node was restarted. ApplicationState tracks the operational state of the node (such as the current load). Nodes exchange digests of HeartBeatState information with one another. If differences are found, the nodes then exchange digests of ApplicationState info, and ultimately the ApplicationState data itself. In addition, the Gossip algorithm first seeks to resolve differences that are "farther apart" (in terms of version numbers), since those are more likely to embody the widest inconsistencies.
Working with Cassandra
RDBMS users familiar with SQL should feel right at home with CQL, the Cassandra Query Language, which can be executed from the Python-based Cassandra shell utility (cqlsh) or through any of several client drivers. Client drivers are available from websites like Planet Cassandra, where you'll find CQL-enabled drivers for Java, C#, Node.js, PHP, and others.
In the past, drivers communicated with a Cassandra cluster using a Thrift API -- Thrift being a framework for creating what amounts to language-independent remote procedure calls for client and server. Cassandra's Thrift API is now considered a legacy feature, as the CQL specification defines not only the CQL language, but an on-the-wire communication protocol as well.
CQL's syntax resembles its relational cousin's. It has
DELETE statements, and these are accompanied by
WHERE clauses. In addition, CQL's data types are what you would expect. You'll find integers, floats and doubles, blobs, and more. Of course, there are differences. For one, CQL has no
JOIN operation. And when you write a
FROM clause, you specify column families -- though, as of the latest version of CQL, the term "table" is used in place of "column family." CQL also lets you specify the desired consistency level for any operation, but its real benefit is that it is a data management language quickly grasped by relational programmers, and is independent of a specific programming API.
Installing Cassandra is reasonably straightforward, particularly if you download the DataStax Community edition, which bundles a Web-based management application called OpsCenter. I downloaded and installed the tarball version of Cassandra on my Ubuntu Linux system (the apt-get version for some reason refused to install) and found that the real work lies in configuring a Cassandra cluster. The
configuration.yaml file holds scads of tunable parameters for the node and its cluster.