Review: HBase is massively scalable -- and hugely complex
Apache HBase offers extreme scalability, reliability, and flexibility, but at the cost of many moving parts
In addition, each region server node runs a monitoring Web interface at port 60030. Here you'll find lots of metrics: read and write latencies, for example, broken down into various percentiles. You can also see information about the regions managed by this region server, and you can generate a dump of the active threads on the server.
The HBase reference guide includes a Getting Started guide and an FAQ. It's a live document, so you'll find user community comments attached to each entry. The HBase website also provides links to the HBase Java API, as well as to videos and off-site sources of HBase information. More information can be found in the HBase wiki. While good, the HBase documentation is not quite on par with documentation I've seen on other database product sites, such as Cassandra and MongoDB. Nevertheless, there's plenty of material around the Internet, and the HBase community is large and active enough that any HBase questions won't go unanswered for long.
One of HBase's more interesting recent additions is support for "coprocessors" -- user code that executes as part of the HBase RegionServer and Master processes. There are roughly two kinds of coprocessors: observers and endpoints. An observer is a user-written Java class that defines methods to be invoked when certain HBase events occur. Think of an observer as the HBase counterpart to the RDBMS trigger. One observer, called a RegionObserver, can hook specific points in the flow of control of data manipulation operations like
The HBase endpoint coprocessor works much like a stored procedure. When loaded it can be invoked from an observer, for example, and thereby permits adding new features to HBase dynamically. There are various ways to load coprocessors into an HBase cluster, including via the HBase shell.
Configuring a large HBase cluster can be difficult. An HBase cluster includes master nodes, RegionServer processes, HDFS processes, and an entire Zookeeper cluster running side by side. Clearly, troubleshooting a failure can be a complex undertaking, as there are numerous moving parts to be examined.
HBase is very much a developer-centric database. Its online reference guide is heavily linked into HBase's Java API docs. If you want to understand the role played by a particular HBase entity -- say, a Filter -- be prepared to be handed off to the Java API's documentation of the Filter class for a full explanation.
Given that access is by row and that rows are indexed by row keys, it follows that careful design of row key structure is critical for good performance. Ironically, programmers in the good old days of ISAM (Indexed Sequential Access Method) databases knew this well: Database access was all about the components -- and the ordering of those components -- in compound-key indexes.
HBase employs a collection of battle-tested technologies from the Hadoop world, and it's well worth consideration when building a large, scalable, highly available, distributed database, particularly for those applications where strong consistency is important.
|Platforms||Requires Java SE version 6; can be run on Windows using Cygwin|
|Cost||Free, open source under the Apache License version 2.0|
This article, "Review: HBase is massively scalable -- and hugely complex," was originally published at InfoWorld.com. Follow the latest developments in application development, data management, cloud computing, and open source at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.