Review: HBase is massively scalable -- and hugely complex
Apache HBase offers extreme scalability, reliability, and flexibility, but at the cost of many moving parts
You can run HBase atop a native file system for development purposes, but a deployed HBase cluster runs on HDFS, which -- as mentioned earlier -- seems like a poor playground for HBase. Despite the streaming-oriented underlying file system, HBase achieves fast random I/O. It accomplishes this magic by a combination of batching writes in memory and persisting data to disk using log-structured merge trees. As a result, all random writes are performed in memory, and when data is flushed to disk, the data is first sorted, then written sequentially with an accompanying index. Random reads are first attempted in memory, as mentioned above. If the requested data is not in memory, the subsequent disk search is speedy because the data is sorted and indexed.
Working with HBase
HDFS was designed on the principle that it is easier to move computation (as in a MapReduce operation) close to the data being processed than it is to move the data close to the computation. As a result, it is not in HDFS's nature to ensure that related pieces of data (say, rows in a database) are co-located. This means it's possible that a block whose data is managed by a particular RegionServer will not be stored on the same physical host as that RegionServer. However, HDFS provides mechanisms that advertise block location and -- more important -- perform block relocation upon request. HBase uses these mechanisms to move blocks so that that they are local to their owning RegionServer.
While HBase does not support transactions, neither is it eventually consistent; rather, HBase supports strong consistency, at least at the level of a single row. HBase has no sense of data types; everything is stored as an array of bytes. However, HBase does define a special "counter" datatype, which provides for an atomic increment operation -- useful for counting views of a Web page, for example. You can increment any number of counters within a single row via a single call, and without having to lock the row. Note that counters will be synchronized for write operations (multiple writes will always execute consistent increments) but not necessarily for read operations.
The HBase shell is actually a modified, interactive Ruby shell running in JRuby, with Ruby executing in a Java VM. Anything you can do in the interactive Ruby shell you can do in the HBase shell, which means the HBase shell can be a powerful scripting environment.
The latest version of the shell provides a sort of object-oriented interface for manipulating HBase tables. You can, for example, assign a table to a JRuby variable, then issue a method on the table object using the standard dot notation. For example, if you've defined a table and assigned it to the
myTable variable, you could write (put) data to the table with something like:
myTable.put '<row>', '<col>', '<v>'
This would write the value
<v> into the row
<row> at column
There are some third-party management GUIs for HBase, such as hbase-explorer. HBase itself includes some built-in Web-based monitoring tools. An HBase master node serves a Web interface on port 60010. Browse to it, and you'll find information about the master node itself including start time, the current Zookeeper port, a list of region servers, the average number of regions per region servers, and so on. A list of tables is also provided. Click on a table and you're shown information such as the region servers that are hosting the table's components. This page also provides controls for initiating a compaction on the table or splitting the table's regions.