When should you use Cassandra?
Cassandra has turned out to be especially useful for a well-defined set of applications. You'll hear about these most often:
- Time series comes up a lot when discussing column family databases. Time series data can include anything from temperature sensor data to schedules to stock prices to signal processing telemetry to epidemiology records to blogs. This sort of data is where document databases such as MongoDB tend to fail.
- Product catalogs are sometimes powered by Cassandra, although obviously, other databases play in this space. You might also use a document databases or even straight search engines (ElasticSearch or Solr), such as those based on Lucene (arguably sort of a document database itself). But people pick Cassandra for a reason, chief among them the reliable architecture. If you can trade immediate atomic consistency (less of a concern with product catalog data), you can achieve a higher level of system reliablity, meaning any node can be contacted and get to the data. This is why high-scale media services as Netflix, Hulu, and Sky in Europe use Cassandra for catalogs.
- Recommendations are another area, though once again other technologies are frequently in the mix. For example, Mahout (the machine-learrning project running atop Hadoop) can be used with Cassandra. One of the reasons to use a column-family database for recommendations instead of a graph database is scale. With a graph database, you have to do a lot of work (generally, manually sharding or partitioning) to run it at a large scale. However, what you often need for a recommendation engine is not complex data or complex relationships, but very simple row-based data. Cassandra delivers.
- Fraud and spam detection is somewhat related to recommendations and often involves time series data. Again, this may also demand a machine-learning tool like Mahout. Spammers and fraudsters are often more motivated -- and faster -- than you, so you'd better have a system that adapts quickly and can handle increasing amounts of data. For a high-scale service like eBay or Eventbrite, you don't need superconsistent/atomic reads and writes. What you need is a whole lot of them!
- Back-end storage for messaging, especially across data centers, is a big deal (so-called WAN replication). To some degree, messaging isn't a straightforward column-family case -- if it weren't for Cassandra's caching support. Column family plus cache plus WAN replication is a powerful force. This isn't for every messaging use case. But if you need to persist and read messages across multiple data centers at a massive scale in an active-active configuration as the New York Times or Comcast does, you might want to give Cassandra a spin.
What about Hbase?
HBase isn't a Cassandra replacement and Cassandra isn't merely better than HBase. They each have their strengths and weaknesses. If you already run a Hadoop-oriented shop and have extensive Hadoop expertise and infrastructure, HBase may be a more natural fit.
While Cassandra is in the Hadoop ecosystem, it rolls its own in a lot of places, many of which work to your advantage. HBase's modularity can also be read as complexity and configuration side effects, as a lot of knobs must turn at various layers that aren't always aware of each other. Cassandra is more monolithic, which in some ways means more thought out as a consistent design.