April 09, 2004

Berkeley DB adds XML smarts

Sleepycat’s Berkeley DB XML database library combines sturdy Berkley DB engine with XML doc management

Berkeley DB XML is a database library built on the venerable Berkeley DB engine. Sleepycat engineers erected a layer atop Berkeley DB, extending that engine and creating a new one that provides XML document storage, management, and querying.

So, Berkeley DB XML inherits transaction protection, multiple database access, deadlock detection, encryption, database sizes up to 25TB, and more from Berkeley DB. In addition, a single application employing the Berkeley DB XML engine can simultaneously access and freely mix XML databases and “normal” Berkeley DB databases

I downloaded Version 1.2 of Berkeley DB XML, and explored the package’s Java personality. Berkeley DB XML provides bindings for a number of popular languages: C/C++, Java, Perl, Tcl, and Python (as of Python 2.3). It comes with all the necessary JAR (Java Archive) files and DLL native libraries to build a complete DB XML application, plus API documentation.

Finally — did I forget to mention? — Berkeley DB XML is an open-source product, so you get all the source code for the engine.

Containers and Queries

Berkeley DB XML stores everything in an abstract entity called a “container,” which is analogous to an RDBMS’s database. “Everything” in DB XML’s case is synonymous with “XML documents” because a document is the engine’s atom of persistence; you cannot store or otherwise manipulate pieces of documents. Behind the scenes, Berkeley DB XML converts each document to a string and stores each string as an individual record in the underlying Berkeley DB database.

Defining containers and adding documents is reasonably simple. From a Java programming perspective, you need only surmount the small learning curve of deducing which classes map to what entities within the engine and coding the proper initialization steps before you’re doing serious work.

Berkeley DB XML queries use the XPath 1.0 XML query language standard. The call into the query subsystem takes not only the XPath query itself, but the query’s context, which consists of the namespace, result type, query variables, and a flag indicating whether the query is “eager” or “lazy.” Eager queries assemble the entire result set before returning. Lazy queries don’t complete the query processing until code steps through the result set. These are useful when the result set is large and it’s likely that the caller won’t examine all of it.

Although database storage is completely document-based, queries return either whole documents or pieces of documents. The latter query result can be difficult to untangle if the structure of your XML documents and the nature of the query return multiple pieces from within multiple documents.

Luckily, the Berkeley DB XML documentation suggests an iterative query tactic to avoid this: Program the first query to return a set of matching documents, then iterate through that set, re-issuing the query on individual documents, and examine the returned elements.

You can accelerate queries by defining indexes, and Berkeley DB XML has a flexible indexing scheme that lets you create indexes for elements (or “edges”, which are paths to elements, rather than the elements themselves) and define the index structure so that it’s optimal for the expected queries.

The engine’s query system maintains index statistics and performs cost-based analysis for query optimization. Often-repeated queries can be precompiled for even greater performance.

Meta Features

Test Center Scorecard
20%20%20%15%15%10%
Berkeley DB XML898989
8.5
Very Good
Close

On Twitter now

Data management

Powered by Twitter

On Twitter now

White Paper

D2D Virtual Tape Library Replication Primer

This whitepaper explains the terminology and concepts behind Data Replication technologies and establishes some sizing rules through worked examples. Learn the new paradigm in disaster tolerance—protect data anywhere.

Download now »

White Paper

An Alternative to Virtualization for Datacenter Cost Savings

Server virtualization is a popular option for dealing with mounting datacenter costs. Another equally promising approach is the use of an Application Delivery Controller. Citrix NetScaler provides a low-cost way for organizations to reduce their server count and accrue cost savings from a reduction in space, cooling, power and personnel.

Download now »

White Paper

Why Your Firewall, VPN, and IEEE 802.11i Aren't Enough to Protect Your Network

The emergence of WLANs has created a new breed of security threats to enterprise networks.

Included in HP ProCurve WLAN solutions is security technology that alleviates threats from WLANs through:
* Monitoring wireless activity inside and out of the enterprise
* Classifying WLAN transmissions into harmful and harmless
* Preventing transmissions that pose a security threat to the enterprise network
* Locating participating devices for physical remediation

Download now »

White Paper

Bringing the Edge to the Data Center

Effectively address data protection challenges, implementing solutions that help store and protect business–critical data while cutting costs and improving efficiency and reliability.

Download now »

Sign up to receive Data Management Resource Alerts

Subscribe to the Technology: Data Management Newsletter

The one-stop resource center for IT professionals.

©1994-2009 Infoworld, Inc.