The crux of it will come down to your read-to-write ratio. Cassandra was designed for high workloads of both writes and reads where millisecond consistency isn't as important as throughput. HBase is optimized for reads and greater write consistency. To a large degree, Cassandra tends to be used for operational systems and HBase more for data warehouse and batch-system-type use cases. There are crossovers, exceptions, and places where it doesn't matter -- or where it's simply a matter of default configuration being more conservative in one than the other.
Speaking of transactions ...
Developers often get confused about when atomic consistency is really needed. Also, if you start with an RDBMS, you'll be confused because RDBMSes require more operations across more places to get the same amount of work done. That is, it isn't natural that a "person" be broken into multiple tables just because they have a variable number of phone numbers and addresses -- which is exactly what a good RDBMS schema requires. Most other database types (document, column family, you name it) would be able to handle variable amounts of phone numbers or whatever in one entity (document, table, rowkey, and so on).
Generally, any database can make an operation to a single entity consistent. Thus, many operations that require a "transaction" in an RDBMS are naturally atomic in other databases.
In any modern system, you make compromises about consistency no matter what database you're using. No RDBMS really offers long-running transactions that are lightweight enough for "Internet" or "Web scale" systems. Few people run in Read Serialized mode and open a connection per user and hold transactions open across user-think time. That would give you the most atomic consistency. Moreover, any multithreaded batch or analytical system must make some compromise with consistency.
In many cases, a millisecond (or 500) doesn't make a lot of difference. If I change a couple of rowkeys, eventually in Cassandra they'll be consistent, but I could get a phantom read. How problematic is that for, say, Netflix? It likely won't strike. Even if it does, you're unlikely to notice. Even if you do, you're unlikely to complain if there is a momentary glitch where you see a show and it doesn't have episodes, but you reload the page and suddenly it does. You're more likely to notice if every catalog change brings the system to a screeching halt. Down-to-the-millisecond consistency is less important than scale and performance, due to the nature of the data and its use case.
Choose your weapon
Why make such a trade-off when you could have perfection? Quite simply: Because you have to. Any guarantee of consistency requires some kind of lock or reduction in concurrency. That trade-off might be threads or how much we can distribute the data across multiple machines or disks or whether we can replicate it across a WAN. Forget the CAP theorem; consistency always trades off against concurrency.
Cassandra is a great tool for data sets where scale is more important than immediate consistency and you have a great deal of both reads and writes. While it has not received as much attention as other NoSQL databases and slipped into a quiet period a couple years back, Cassandra is widely used and deployed, and it's a great fit for time series, product catalog, recommendations, and other applications. If you have those sorts of problems at scale, Cassandra should do the trick.
This article, "Get to know Cassandra, the NoSQL maverick," was originally published at InfoWorld.com. Keep up on the latest news in application development and read more of Andrew Oliver's Strategic Developer blog at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.