Managing your content with XML

Daisy and TeXtML CMSes take differing, yet successful, tacks

Content is the lifeblood of any organization that relies on information. If documents are lost in file cabinets or hidden away on hard drives, the knowledge they carry is buried. But when content is organized and searchable, that information lives on. It does useful work over and over again as it is referenced, consulted, and combined with other information.

The two CMSes (content management systems) in this review create organized and searchable repositories of digital documents. On first glance, both products appear similar, and, fundamentally, they are. Both, for example, make extensive use of XML. Closer inspection, however, reveals that each is designed for somewhat different uses of content.

Daisy is an open source CMS whose strength is its flexible organization and navigation capabilities. Ixiasoft’s TeXtML is a commercial CMS that takes a more straightforward approach to content organization, but excels at text search.

Daisy 1.3
Daisy is an exceptionally modular system; its designers purposely decoupled its internal organs for greater flexibility. Among those pieces is the back-end database, MySQL. There’s the repository server, which manages the storage and retrieval of documents. The OpenJMS Java messaging service informs applications of updates to the repository. Finally, the Daisy wiki front end provides dynamic, Web-based repository view and access.

The database back end and the repository server are Daisy’s core components. The OpenJMS service is more ancillary: It passes status events to apps that request notification of changes in the repository’s content or structure.

Strictly speaking, Daisy’s wiki component is merely an example of a front end for the repository. Daisy’s creators describe Daisy as a “content management framework,” precisely because it could be used to support other front ends.

Mind you, Daisywiki is not simply a sample application; it’s a fully functioning wiki, complete with a built-in editor, versioning, search pages, PDF publishing, and more.

I installed Daisy on my test system and, with the exception of a problem with Internet Explorer 5.0, I had it running within a half-hour. The installation constructs a small wiki-based Web site populated with an initial Welcome page. The installation includes all the tools for adding new documents, editing existing ones, adding and managing users, and so on. Because all the site’s pages are built from documents in a Daisy repository, Daisywiki is an excellent mechanism for exploring how Daisy works.

Inside Daisy
The internal structure of Daisy’s repository is unusual in that there is none. There are no folders or sub-folders, no collections — just a container in which documents float about like the meat and potatoes of a digital stew. All is not anarchy, though.

First, documents themselves are structured, being composed of parts and fields. A part carries binary data of a specific mime type (RTF information or image data, for example), and a field carries simple data (such as a numeric value, a date, or a string). The structure and allowed content of a document’s parts and fields is defined by the document’s type (which is specified in yet another document). So, all documents within a repository must adhere to one of the defined document types. You can define as many document types as your imagination permits.

Second, a repository includes one or more “navigation documents,” an XML-based specification that defines how users navigate through the repository. There can be more than one navigation document in a repository, effectively allowing you to define multiple repository views. Behind the scenes, navigation documents work their magic by performing a query on the repository. So, for example, one navigation document might arrange the contents by modification date; another, by title.

The Daisy API is a combination of HTTP and XML. In other words, you send commands to the Daisy repository via

HTTP, and those commands are in the form of XML embedded in the HTTP request. Hence, you can control Daisy through just about any scripting language that can “talk” HTTP; you can even handcraft commands by typing in the proper URL. If, however, you’d rather put a more robust API into the repository, Daisy provides a Java wrapper around the HTTP/XML interface.

The DQL (Daisy Query Language) is obviously derived from SQL. A query is a “select” clause, adorned with modifiers for filtering and ordering the results. Whereas in SQL those filters amount to comparisons on column values, in DQL the comparisons are performed on document fields. So, for example, to search for documents in the repository with a PictureContent field equal to “boat,” you would enter the following Daisy query: “select id, name where #PictureContent = ‘boat’.” This returns the ID number and name of the document.

Daisy’s eschewing of a repository structure appears, at first glance, to be a severe omission. Further reflection, however, reveals this weakness as a strength. In a typical CMS, a document is placed into a specific collection within the repository, but that implies a redundancy: Someone has used the document’s content to determine which collection to put the document in. If you’ve properly tagged the document, however, and if your repository server can create a view of the repository derived from those tags, then the equivalent of a collection structure can be rendered at display time. And, unlike collection-based repository servers, such a “view-based” server renders multiple, different views of the same repository. This is exactly what Daisy does, and the result is quite impressive.

Ixiasoft TeXtML
TeXtML applies the bulk of its energies to the storage, retrieval, and management of text, and does so by creating an environment awash in XML.

It’s not much of a stretch to say that TeXtML takes text documents from our universe, maps them into their equivalents in an XML universe, and uses the capabilities of that universe to provide search and management functions that would not be available otherwise. (This is not to suggest that TeXtML can handle text-only docs:

It can easily store and retrieve documents with embedded binary data.)

TeXtML uses a collections paradigm for organizing documents. Collections appear as named folders on TeXtML’s administration console, and are navigated using the standard path constructs that anyone familiar with a file system would recognize.

How documents are stored in the repository, though, is a bit complicated. As stated above, documents are mapped to XML equivalents — but that is only partly true. On the one hand, documents are stored wholesale in their native format. On the other hand, when a document is placed in the repository, it is parsed into a kind of XML doppelganger document that TeXtML uses to build indexes for the document. The TeXtML repository keeps track of the relationship between the original document and its XML shadow. (This technique of creating XML shadow documents while keeping the original available helps TeXtML significantly with its indexing chores, thus speeding queries.)

The parsing is performed by the TeXtML’s Universal Converter, which reads some 220-plus document formats. It is an optional component, but without it, the only querying you can do is on document metadata such as title, creation date, document type, and so on.

Indexes and Queries
TeXtML knows which parts of a given document are to be indexed via an index definition document. There is only one index definition document in the repository, and its content is entirely XML. So, when a new document enters the repository, it is dissected by the Universal Converter, and the index definition document is consulted to determine which elements are to be indexed. TeXtML creates indexes for full-text content, strings, numeric data, dates, and time.

TeXtML’s query language is yet another XML variant, entirely unlike XQuery. The dissimilarity is understandable. TeXtML is primarily intent on performing rapid document content search; less important is the capability to navigate an XML document’s structure using XPath-style expressions (as can happen in XQuery).

TeXtML’s demonstration download comes with a preloaded repository, as well as an application that allows you to experiment with the system’s querying capabilities. The application lets the user enter queries by filling in text boxes, generates the query invisibly, then executes it.

The installation also includes sample apps and queries, and the included programmer’s manual provides a line-by-line explanation of the VBScript programs. This is not to suggest that VBScript is your only programming avenue into TeXtML, which supports APIs for Java, native .Net, COM, and OLEDB (organic light-emitting diode B). There is also a WebDAV extension; but, at the time of this writing, the API did not support some of TeXtML’s advanced features.

Concluding Content
Daisy could certainly benefit from a smoother installation. Hopefully, a turnkey version, expected as part of the next release, will eliminate that complaint. Beyond that, the Daisywiki is a joy to play with, and is an excellent test-drive of Daisy’s novel stuff-it-all-in-one-bag approach to document storage.

TeXtML is the product for scuba-diving through oceans of text content. It also provides safeguard features that Daisy doesn’t have, such as the Fault Tolerant Server, which replicates documents and transactions on multiple TeXtML servers.

If hard-core text searching is what you need in your CMS system, then by all means give TeXtML a look. Daisy, however, has that powerful attribute that we are seeing more and more in high-quality software: open source. If you want to set up a wiki site in an evening or two, Daisy is very hard to beat.

InfoWorld Scorecard
Value (10.0%)
Integration (20.0%)
Scalability (10.0%)
Management (20.0%)
Flexibility (20.0%)
Ease of use (20.0%)
Overall Score (100%)
Ixiasoft TeXtML Server 8.0 8.0 9.0 8.0 8.0 8.0 8.1
Daisy 1.3 9.0 8.0 8.0 8.0 9.0 8.0 8.3