Berkeley DB adds XML smarts

Sleepycat’s Berkeley DB XML database library combines sturdy Berkley DB engine with XML doc management

Berkeley DB XML is a database library built on the venerable Berkeley DB engine. Sleepycat engineers erected a layer atop Berkeley DB, extending that engine and creating a new one that provides XML document storage, management, and querying.

So, Berkeley DB XML inherits transaction protection, multiple database access, deadlock detection, encryption, database sizes up to 25TB, and more from Berkeley DB. In addition, a single application employing the Berkeley DB XML engine can simultaneously access and freely mix XML databases and “normal” Berkeley DB databases

I downloaded Version 1.2 of Berkeley DB XML, and explored the package’s Java personality. Berkeley DB XML provides bindings for a number of popular languages: C/C++, Java, Perl, Tcl, and Python (as of Python 2.3). It comes with all the necessary JAR (Java Archive) files and DLL native libraries to build a complete DB XML application, plus API documentation.

Finally — did I forget to mention? — Berkeley DB XML is an open-source product, so you get all the source code for the engine.

Containers and Queries

Berkeley DB XML stores everything in an abstract entity called a “container,” which is analogous to an RDBMS’s database. “Everything” in DB XML’s case is synonymous with “XML documents” because a document is the engine’s atom of persistence; you cannot store or otherwise manipulate pieces of documents. Behind the scenes, Berkeley DB XML converts each document to a string and stores each string as an individual record in the underlying Berkeley DB database.

Defining containers and adding documents is reasonably simple. From a Java programming perspective, you need only surmount the small learning curve of deducing which classes map to what entities within the engine and coding the proper initialization steps before you’re doing serious work.

Berkeley DB XML queries use the XPath 1.0 XML query language standard. The call into the query subsystem takes not only the XPath query itself, but the query’s context, which consists of the namespace, result type, query variables, and a flag indicating whether the query is “eager” or “lazy.” Eager queries assemble the entire result set before returning. Lazy queries don’t complete the query processing until code steps through the result set. These are useful when the result set is large and it’s likely that the caller won’t examine all of it.

Although database storage is completely document-based, queries return either whole documents or pieces of documents. The latter query result can be difficult to untangle if the structure of your XML documents and the nature of the query return multiple pieces from within multiple documents.

Luckily, the Berkeley DB XML documentation suggests an iterative query tactic to avoid this: Program the first query to return a set of matching documents, then iterate through that set, re-issuing the query on individual documents, and examine the returned elements.

You can accelerate queries by defining indexes, and Berkeley DB XML has a flexible indexing scheme that lets you create indexes for elements (or “edges”, which are paths to elements, rather than the elements themselves) and define the index structure so that it’s optimal for the expected queries.

The engine’s query system maintains index statistics and performs cost-based analysis for query optimization. Often-repeated queries can be precompiled for even greater performance.

Meta Features

Because Berkeley DB XML manages documents, you would expect it to allow you to attach information that isn’t in the document’s content. Berkeley DB XML meets those expectations by allowing you to attach metadata in a clever way that leaves the document’s content unmolested.

When you define metadata for a document, the engine “reflects” that metadata as attributes into the document’s root element: From the query’s perspective, the metadata name/value pairs are simply XPath-queryable tag attributes.

But that perspective is an illusion — the document’s contents are unchanged. The attributes have actually been “snuck in” by the engine for the benefit of the query, so you can search for attributes without having to use a special syntax.

Berkeley DB XML’s processing of a whole XML document as a single unit does create some side effects in the way documents are accessed. As you might imagine, you cannot delete portions of a document; you can only delete the whole thing. Consequently, modifying or deleting part of an XML document is really an update operation, and an update can only be done by reading the old document, modifying it in memory, deleting its image from the container, then re-storing the updated version.

Happily, Berkeley DB XML provides an update method that does all this dirty work for you invisibly. But if your application employs transactions or locks, you have to keep in mind that lock granularity is at the document level. It’s not possible, for instance, to lock an element within an XML document. This could affect performance if you craft an app such that a lot of locking is going on and users hold each other up.

Unsung Magic

Possibly the greatest benefit of Berkeley DB XML is its masking of the Berkeley DB system complexity so that a programmer can easily add XML database capabilities to an application. Berkeley DB XML “pre-tweaks” Berkeley DB parameters for you, so you can go straight on to programming your app.

But while it hides these details, it does not make them unreachable. If you want to crawl under the hood and retune some of Berkeley DB’s parameters, you can. In fact, because all the source code is provided, if you want to crawl under the hood and reverse-engineer the entire engine, you can do that, too.

Berkeley DB XML’s real power is its foundation: The Berkeley DB system is fast and rock-solid. Even better, all the extensions available to Berkeley DB are instantly available to a Berkeley DB XML application. With free availability on a single-site installation, plenty of examples, and source code, how can you go wrong?

The only thing I missed in Berkeley DB XML was some sort of query console so that I could easily experiment with XPath queries and view the results. A Sleepycat engineer told me that, in the next release, they are providing a written sample that would incorporate many of the features of a query console. I can’t wait.

InfoWorld Scorecard
Value (10.0%)
Performance (20.0%)
Ease of use (20.0%)
Setup (15.0%)
Implementation (15.0%)
Scalability (20.0%)
Overall Score (100%)
Berkeley DB XML 9.0 9.0 8.0 8.0 9.0 8.0 8.5