Berkeley DB adds XML smarts
Sleepycat’s Berkeley DB XML database library combines sturdy Berkley DB engine with XML doc management
Berkeley DB XML is a database library built on the venerable Berkeley DB engine. Sleepycat engineers erected a layer atop Berkeley DB, extending that engine and creating a new one that provides XML document storage, management, and querying.
So, Berkeley DB XML inherits transaction protection, multiple database access, deadlock detection, encryption, database sizes up to 25TB, and more from Berkeley DB. In addition, a single application employing the Berkeley DB XML engine can simultaneously access and freely mix XML databases and “normal” Berkeley DB databases
I downloaded Version 1.2 of Berkeley DB XML, and explored the package’s Java personality. Berkeley DB XML provides bindings for a number of popular languages: C/C++, Java, Perl, Tcl, and Python (as of Python 2.3). It comes with all the necessary JAR (Java Archive) files and DLL native libraries to build a complete DB XML application, plus API documentation.
Finally — did I forget to mention? — Berkeley DB XML is an open-source product, so you get all the source code for the engine.
Containers and Queries
Berkeley DB XML stores everything in an abstract entity called a “container,” which is analogous to an RDBMS’s database. “Everything” in DB XML’s case is synonymous with “XML documents” because a document is the engine’s atom of persistence; you cannot store or otherwise manipulate pieces of documents. Behind the scenes, Berkeley DB XML converts each document to a string and stores each string as an individual record in the underlying Berkeley DB database.
Defining containers and adding documents is reasonably simple. From a Java programming perspective, you need only surmount the small learning curve of deducing which classes map to what entities within the engine and coding the proper initialization steps before you’re doing serious work.
Berkeley DB XML queries use the XPath 1.0 XML query language standard. The call into the query subsystem takes not only the XPath query itself, but the query’s context, which consists of the namespace, result type, query variables, and a flag indicating whether the query is “eager” or “lazy.” Eager queries assemble the entire result set before returning. Lazy queries don’t complete the query processing until code steps through the result set. These are useful when the result set is large and it’s likely that the caller won’t examine all of it.
Although database storage is completely document-based, queries return either whole documents or pieces of documents. The latter query result can be difficult to untangle if the structure of your XML documents and the nature of the query return multiple pieces from within multiple documents.
Luckily, the Berkeley DB XML documentation suggests an iterative query tactic to avoid this: Program the first query to return a set of matching documents, then iterate through that set, re-issuing the query on individual documents, and examine the returned elements.
You can accelerate queries by defining indexes, and Berkeley DB XML has a flexible indexing scheme that lets you create indexes for elements (or “edges”, which are paths to elements, rather than the elements themselves) and define the index structure so that it’s optimal for the expected queries.
The engine’s query system maintains index statistics and performs cost-based analysis for query optimization. Often-repeated queries can be precompiled for even greater performance.