Bigtable-inspired open source projects take different routes to the highly scalable, highly flexible, distributed, wide column data store
In this brave new world of big data, a database technology called "Bigtable" would seem to be worth considering -- particularly if that technology is the creation of engineers at Google, a company that should know a thing or two about managing large quantities of data. If you believe that, two Apache database projects -- Cassandra and HBase -- have you covered.
Bigtable was originally described in a 2006 Google research publication. Interestingly, that paper doesn't describe Bigtable as a database, but as a "sparse, distributed, persistent multidimensional map" designed to store petabytes of data and run on commodity hardware. Rows are uniquely indexed, and Bigtable uses the row keys to partition data for distribution around the cluster. Columns can be defined within rows on the fly, making Bigtable for the most part schema-less.
[ Read the full reviews: HBase is massively scalable -- and hugely complex | Cassandra lowers the barriers to big data | Bossie Awards 2013: The best open source big data tools | Get a digest of the key stories each day in the InfoWorld Daily newsletter. ]
Cassandra and HBase have borrowed much from the original Bigtable definition. In fact, whereas Cassandra descends from both Bigtable and Amazon's Dynamo, HBase describes itself as an "open source Bigtable implementation." As such, the two share many characteristics, but there are also important differences.
Born for big data
Both Cassandra and HBase are NoSQL databases, a term for which you can find numerous definitions. Generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL.
Both are designed to manage extremely large data sets. HBase documentation proclaims that an HBase database should have hundreds of millions or -- even better -- billions of rows. Anything less, and you're advised to stick with an RDBMS.
Both are distributed databases, not only in how data is stored, but also in how the data can be accessed. Clients can connect to any node in the cluster and access any data.
Both claim near linear scalability. Need to manage twice the data? Then double the number of nodes in your cluster.
Both safeguard data loss from cluster node failure via replication. A row written to the database is primarily the responsibility of a single cluster node (the row-to-node mapping being determined by whatever partitioning scheme you've employed). But the data is mirrored to other cluster members called replica nodes (the user-configurable replication factor specifies how many). If the primary node fails, its data can still be fetched from one of the replica nodes.
Installation and setup (15.0%)
Ease of use (30.0%)
Overall Score (100%)
|Apache Cassandra 2.0||8.0||8.0||7.0||8.0||9.0|
|Apache HBase 0.94.12||7.0||7.0||7.0||7.0||9.0|
Windows 7 is suddenly telling users it isn't genuine -- and it has nothing to do with Windows being...
Windows users are reporting significant problems with four more October Black Tuesday patches
The larger design is very welcome, but there's much more to the iPhone 6 than a bigger screen
Sponsored by Rackspace
Sponsored by Nuage Networks
Sponsored by Fibre Channel Industry Association
InfoWorld picks the best hardware, software, development tools, and cloud services of the year
Microsoft CEO Satya Nadella is showing the same kind of leadership that Steve Jobs used to rescue Apple...
If you’re doing one or more of these things, it might be time to step away from the IDE and take a...
Black Duck presents its Open Source Rookies of the Year -- the 10 most exciting, active new projects...