XML databases evolve

Open source Apache Xindice, Berkeley DB XML set solid base for content management

Attention in the wide universe of databases and content management has been drawn lately to XML and, specifically, XML databases. You’ll get a good indication of the state of XML-based content management technology by examining developments at the ground floor: the XML database libraries that form a base for larger content management applications.

Two such libraries are the targets of this review: the Apache Software Foundation’s Xindice and Sleepycat’s Berkeley DB XML. Both are open source, both are free (although the nature of “free” differs between them), and both provide standards-compliant XML document manipulation. In addition, both are powerful developer tools that place eye-opening XML document storage, query, and retrieval capabilities into the hands of eager programmers.

Apache Xindice 1.0
Apache Xindice began as the dbXML Core project, but the fruit of that labor transferred to the Xindice group sometime after 2001. Xindice’s documentation makes no bones about its intended audience: It will be of interest only to developers in need of a solution for storing and manipulating XML data.

Likewise, the Xindice Web site is clear about the package’s limitations; unlike Berkeley DB XML, Xindice does not deal well with large XML documents. Small-to-moderate documents are best for Xindice, although there’s no precise definition of a “small-to-moderate” XML document -- a megabyte or smaller is probably in the ballpark.

Installation is simple and deposits on your system the Xindice server executable, a command-line tool, documentation, source, and a number of examples. Xindice is written entirely in Java, so you’ll need a JDK 1.3 or greater installed to run the Xindice JAR (Java Archive) file.

The programming interface -- the DB XML API -- is Java as well, but Xindice does not limit itself to the Java language. It is built on a client-server architecture and supports the XML-RPC API, so remote Java clients can access the server, as can clients written in other programming languages.

Xindice arranges its storage in the form of “collections,” and all collections exist within a root instance, “/db.” Think of collections as subfolders in file systems; collections contain “subcollections” to an arbitrary depth. The “files” in this analogy are the actual XML documents. Querying and updating are typically applied collectionwide, although you can adjust the granularity to manipulate individual documents.

Command-line control
Xindice’s command-line tool is a godsend for new users. Experimenting with it provides an excellent introduction to Xindice’s capabilities and will give you a good feel for the programming API when it’s time to turn your attention to development. The command-line tool is also useful for jump-starting your database. The tool creates new collections, feeds XML documents into the collections, and even feeds whole subdirectory hierarchies into Xindice (in which case the subfolders appear in the database as subcollections).

Xindice uses XPath for querying collections and XUpdate for updating them. It would be nice if XQuery were supported, as it provides for much richer querying, but for now XQuery support is an entry on the Xindice team’s to-do list. The command-line tool is a great way to test out XPath and XUpdate expressions, but as of this writing the documentation for it is incomplete and leads one to erroneously conclude that XUpdate is not supported.

A number of sample Java programs are buried in an examples subfolder, with run scripts thoughtfully provided. A rather large Addressbook Web application is also included, although you must have an installation of Tomcat to run it. Here, as with the Xindice documentation, everything is a bit rough around the edges, and you must be willing to work your way through some mazes to avoid the occasional blind alley.

On the security front, you can password-protect a Xindice database, and it’s also thread safe, so multiple clients can connect without worry. However, there is no transaction support built into Xindice; it is an optional package in the DB XML API and may be added to the server in the future.

Xindice is an Apache project, so it progresses at a speed governed by the enthusiasm of its participants. In some cases this is remarkably prompt. But the process is inherently somewhat stochastic, so there are no guarantees concerning when important modifications or additions (such as handling larger XML files) will be made. What I’ve seen so far, however, will have me keeping a hopeful eye on the project.

Sleepycat Berkeley DB XML 2.0
Sleepycat recently released Version 2.0 of its DB XML database (see our review of an earlier edition at infoworld.com/1529). Berkeley DB XML sits on top of the venerable Berkeley DB database and inherits Berkeley DB’s transaction support, crash recovery, deadlock detection, encryption, and other features. In fact, you can freely intermix DB XML databases and “ordinary” Berkeley DB databases in the same application without having to link additional libraries into that application.

Berkeley DB XML is an open source tool, although there are licensing restrictions that vary depending on how you use and distribute applications built from the tool (details available at sleepycat.com).

Unlike Xindice, DB XML is not a client/server system; it is a library that you link into -- and that runs in the process space of -- your application. Bindings are available for numerous languages, including Java, C++, Perl, Python, Tcl (Tool Command Language), and PHP (PHP: Hypertext Processor). There are also several third-party bindings available for other languages.

Much of what’s in the new 2.0 release is the direct result of user feedback. The preceding release handled documents as single entities, imparting an upper limit on the size of the document that DB XML could handle (that upper limit was typically set by available memory, and any XML document exceeding that limit was probably a good candidate for factoring). In this release, DB XML allows you to store documents either wholesale (as before) or per node -- carved up, if you will.

When you choose per-node storage, documents are taken apart and their individual nodes are stored in separate records in the database. Consequently, available disk space is the only real upper limit on the size of a document handled. The Berkeley DB system can deal with databases ranging in sizes as large as 256TB, but only a few people will hit that ceiling.

Document options
As with Xindice, DB XML’s storage uses a collections paradigm. You associate whole-document or per-node storage for a given collection; all documents in that collection are stored similarly.

Whole-document storage is best if your documents are reasonably small (measuring 1MB or less), and you must process each document intact. Also, documents retrieved from whole-document storage are byte-for-byte identical to the document that was placed in storage. That’s important if you want to be able to verify that the content of the document has not been meddled with -- for example, if you’ve added a digital signature to the document.

Per-node storage provides faster queries and updates, because the entire document need not be read in to be processed. And, as already stated, it allows you to manage extremely large XML documents.

DB XML 2.0 also has a new command-line tool. Like Xindice’s command-line tool, it’s the perfect way to familiarize yourself with the database’s capabilities. The commands accepted by the tool have a one-to-one correspondence with the product’s API. 

Sleepycat was in the process of finishing a tutorial for the command-line tool at the time of my review. I saw an early version that was already polished enough to be useful, and can say that the tutorial promises to be a worthy guide to the neophyte DB XML user.

DB XML 2.0 supports XPath and XUpdate as well as the more robust XQuery. As with Xindice, you can use the command-line tool to familiarize yourself with the syntax of these queries and update dialects. And, like Xindice, DB XML 2.0 provides numerous examples to work through and explore.

Quite a pair
Both Apache Xindice and Sleepycat Berkeley DB XML allow you to attach indexes to your databases for the purpose of speeding queries. DB XML, however, gives you greater control over the index type, and thereby allows you to fine-tune an index for the sorts of queries likely to take place.

In addition, the DB XML command-line tool will return the amount of time taken by a query, so you can experiment with different index types and query strategies to optimize performance.

Xindice and DB XML 2.0 are top-notch database libraries, although DB XML provides a greater range of features and is polished to a more impressive sheen. Nevertheless, I expect to see the Xindice project’s feature list lengthen over time. Improvements in Xindice will only benefit the wide and growing XML database community.

InfoWorld Scorecard
Interoperability (20.0%)
Documentation (10.0%)
Scalability (20.0%)
Value (10.0%)
Setup (20.0%)
Performance (20.0%)
Overall Score (100%)
Apache Xindice 1.0 9.0 7.0 9.0 9.0 9.0 8.0 8.6
Sleepycat Berkeley DB XML 2.0 10.0 9.0 9.0 10.0 9.0 9.0 9.3
Join the discussion
Be the first to comment on this article. Our Commenting Policies