Review: Cassandra lowers the barriers to big data
Apache Cassandra 2.0 combines NoSQL flexibility and scalability with friendly SQL-like queries
For example, you can set the number of tokens that will be assigned to the node, which controls the proportion of data (relative to other nodes) that the node will be responsible for. (This is useful if your cluster is composed of heterogeneous hardware because more powerful members can be configured to handle heavier loads.) Happily, for a small trial installation, you need only configure the listening IP address for the current node and the IP addresses of the cluster's seed nodes.
OpsCenter runs a server process on your management host that communicates with agent processes executing on the cluster's nodes. The agents gather usage and performance information and send it to the server, which provides a browser-based user interface for viewing the aggregated results. With OpsCenter, you can browse data, examine throughput graphs, manage column families, initiate cluster rebalancing, and so on. (As an aside, I was unable to get OpsCenter working successfully on my Linux installation. The DataStax Community Edition installation on Windows worked, but only partially, it being unable to connect to the agent service.)
While documentation -- primarily in the form of FAQs, wikis, and blogs -- exists on the Apache Cassandra site and the Planet Cassandra site, DataStax is the most comprehensive source for Cassandra documentation and tutorials. In fact, Planet Cassandra's Getting Started page more or less points you to the DataStax pages.
DataStax maintains documentation of both current and previous versions; as Cassandra is updated, you can troubleshoot any earlier installations you continue to run. The Web pages are well hyperlinked and provide plenty of diagrams. Along with video tutorials, you'll also find reference guides for Java and C# drivers, as well as developer blogs on Cassandra internals.
Until recently, Cassandra provided no transactional capabilities. However, the latest release of Cassandra (version 2.0) adds "lightweight transactions" that employ an atomic "compare and set" architecture. In CQL, this is manifested as a conditional
IF clause on
UPDATE commands. The data is modified if a particular condition is true. You can imagine a CQL
INSERT statement that will only add a new row if the row does not exist, and the presence of the transactional
IF test will guarantee that the
INSERT is atomic for the database.
Cassandra 2.0 also improves response performance with "eager retries." If a given replica is slow to respond to a read request, Cassandra will send that request to other replicas if there's a chance the other replicas might respond prior to the request timeout. With version 2.0, Cassandra now handles the removal of stale index entries "lazily." In the past, stale entries were cleaned up immediately, which required a synchronization lock. The new technique avoids the throughput-constricting lock.
While Cassandra is a complicated system, its symmetrical treatment of cluster nodes makes it surprisingly easy to get up and running. The SQL-like nature of CQL is a great benefit, making it quicker and easier for developers moving from RDBMS environments to become productive.
Nevertheless, the learning curve for Cassandra is significant. It's a good idea to set up a small to modest development cluster and do plenty of experimenting, particularly with your data schema and configuration parameters. Performance issues can become significant as the application scales up.
|Platforms||CentOS, Red Hat, Debian, Ubuntu, Mac OS X, Windows|
|Cost||Free, open source under the Apache License version 2.0|
This article, "Review: Cassandra lowers the bar to big data," was originally published at InfoWorld.com. Follow the latest developments in application development, data management, cloud computing, and open source at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.