Review: HBase is massively scalable -- and hugely complex
Apache HBase offers extreme scalability, reliability, and flexibility, but at the cost of many moving parts
Apache HBase describes itself as "the Hadoop database," which can be a bit confusing, as Hadoop is typically understood to refer to the popular MapReduce processing framework. But Hadoop is really an umbrella name for an entire ecosystem of technologies, some of which HBase uses to create a distributed, column-oriented database built on the same principles as Google's Bigtable. HBase does not use Hadoop's MapReduce capabilities directly, though HBase can integrate with Hadoop to serve as a source or destination of MapReduce jobs.
The hallmarks of HBase are extreme scalability, high reliability, and the schema flexibility you get from a column-oriented database. While tables and column families must be defined in advance, you can add new columns on the fly. HBase also offers strong row-level consistency, built-in versioning, and "coprocessors" that provide the equivalents of triggers and stored procedures.
[ Also on InfoWorld: Big data showdown: Cassandra vs. HBase | Which freaking database should I use? | Bossie Awards 2013: The best open source big data tools | NoSQL showdown: MongoDB vs. Couchbase | Get a digest of the key stories each day in the InfoWorld Daily newsletter. ]
Designed to support queries of massive data sets, HBase is optimized for read performance. For writes, HBase seeks to maintain consistency. In contrast to "eventually consistent" Cassandra, HBase does not offer various consistency level settings (to acknowledge the write after one node has written it or a quorum of nodes has written it). Thus, the price of HBase's strong consistency is that writes can be slower.
HDFS -- the Hadoop Distributed File System -- is the Hadoop ecosystem's foundation, and it's the file system atop which HBase resides. Designed to run on commodity hardware and tolerate member node failures, HDFS works best for batch processing systems that prefer streamed access to large data sets. This seems to make it inappropriate for the random access one would expect in database systems like HBase. But HBase takes steps to compensate for HDFS's otherwise incongruous behavior.
Zookeeper, another Hadoop technology (though no longer used by current versions of the Hadoop MapReduce engine), is a distributed communication and coordination service. Zookeeper maintains a synchronized, in-memory data structure that can be accessed by multiple clients. The data structure is organized like a file system, though the structure's components (znodes) can be data containers, as well as elements in a hierarchical tree. Imagine a file system whose files can also be directories.