The future of XML documents and relational databases

As new species of XML documents are emerging, vendors are unveiling increased RDBMS support for XML

When XML came along five years ago, promising to rewrite the rules of data management, vendors of relational databases took note, but they didn't panic. They'd already seen this movie a decade before, when the object database had been cast in the role of paradigm shifter. This new software species did emerge and did popularize the notion of persistence -- that is, the capability of storing and retrieving programming-language objects without arduous translation to and from relational tables. As it turned out, however, the old RDBMS dog could learn new tricks. Relational databases figured out how to store complex types using the SQL:1999 object model. Implementations of JDO (Java Data Objects) exist for relational as well as for object databases. And according to Microsoft, the forthcoming Yukon edition of SQL Server will be capable of persisting .Net objects.

Having absorbed objects, the RDBMS vendors are now working hard to absorb XML documents. Don't expect a simple rerun of the last movie, though. We've always known that most of the information that runs our businesses resides in the documents we create and exchange, and those documents have rarely been kept in our enterprise databases. Now that XML can represent both the documents that we see and touch -- such as purchase orders -- and the messages that exchange those documents on networks of Web services, it's more critical than ever that our databases can store and manage XML documents. A real summer blockbuster is in the making. No one knows exactly how it will turn out, but we can analyze the story so far and make some educated guesses.

Consensual Hallucination

The first step in the long journey of SQL/XML hybridization was to publish relational data as XML. BEA Chief Architect Adam Bosworth, who worked on the idea's SQL Server implementation, calls it "the consensual-hallucination approach -- we all agree to pretend there is a document."

XML publishing was the logical place to start because it's easy to represent a SQL result set in XML and because so many dynamic Web pages are fed by SQL queries. The traditional approach required programmatic access to the result set and programmatic construction of the Web page. The new approach materializes that dynamic Web page in a fully declarative way, using a SQL-to-XML query to produce an XML representation of the data and XSLT (eXtensible Stylesheet Language Transformation) to massage the XML into the HTML delivered to the browser.

Originally these virtual documents were created using proprietary SQL extensions such as SQL Server's FOR XML clause. There's now an emerging ISO/ANSI standard called SQL/XML, which defines a common approach. SQL/XML is supported today by Oracle and DB2. It defines XML-oriented operators that work with the native XML data types available in these products. SQL Server does not yet support an XML data type or the SQL/XML extensions, but Tom Rizzo, SQL Server group product manager at Redmond, Wash.-based Microsoft, says that Yukon, due in 2004, will.

Storing Documents

Most of the information in an enterprise lives in documents kept in file systems, not in relational databases. There have always been reasons to move those documents into databases -- centralized administration, full-text search -- but in the absence of a way to relate the data in the documents to the data in the database, those reasons weren't compelling. XML cinches the argument.

As business documents morph from existing formats to XML -- admittedly a long, slow process that has only just begun -- it becomes possible to correlate the two flavors of data. Consider an insurance application that stores claims data in a relational table and claims documents in XML. A hybrid SQL/XML database enables the application to extract fragments of XML from a subset of the documents. And that subset can be created by joining XML elements in the document with column values in relational tables.

These wildly powerful effects are currently achieved using a few different kinds of storage and query strategies. On the storage side, there are two general approaches. You can put whole documents into columns of the database, or you can "shred" the document into a collection of relational tables. The latter approach makes best use of the database's query engine and atomic update capabilities, but mapping from irregular XML data to SQL is much harder than mapping from SQL to XML. It helps if your XML documents are governed by XML Schema descriptions. These provide hints to the XML-to-SQL mapper and can be annotated to more precisely control the mapping. It also helps if your database supports objects that can receive irregularly shaped XML data. "We extended the relational base technology to include objects as part of SQL:1999," says Sandeepan Banerjee, director of product management at Oracle Server Technologies. "And this has matured with 8i and 9i to the point where the typing system of XML Schema can be fully represented in the object/relational typing."

Query Strategies

The foundation of all XML-oriented query strategies is XPath, a syntax built to descend treelike structures and to lop off branches. When an XSLT stylesheet transforms an XML document, it uses XPath to isolate fragments of the document. Relational databases that support XML queries -- including stalwarts Oracle, DB2, and SQL Server, newcomers such as OpenLink Software's Virtuoso, but not yet MySQL -- use XPath in the same way. At first, this XPath support was delivered in the form of proprietary extensions. More recently, the SQL/XML standard has defined a common set of XPath-aware SQL extensions. XPath is also used in the W3C's forthcoming XQuery standard, an ambitious effort to adapt the data-joining power of SQL to the world of semistructured XML data. "We're working heavily with XQuery in order to enable manipulation of XML content in ways familiar to SQL developers," says Jeff Jones, director of strategy for the information management group at Armonk, N.Y.-based IBM.

Although vendors are chomping at the bit for XQuery 1.0 to be finalized, their implementations of it will be less powerful, in some ways, than their current SQL/XML implementations. Most notably, XQuery does not define a syntax for updating elements within XML documents. Although SQL/XML's update mechanism is not yet approved, it has been defined and is already implemented in Oracle and DB2.

Has SQL/XML stolen XQuery's thunder? In the short term, XQuery may appear to be just an alternative way to do things that can be done equally well in SQL and XPath. But Redwood Shores, Calif.-based Oracle's Banerjee thinks that in the long run, it's possible that developers "will want to stay within an XML abstraction for all their data sources." In that case XQuery, a rich and complete programming language built to manipulate complex data, could emerge as a major paradigm.

The Future of Documents

Imagine a purchase order flowing through a business process some time in 2005. It's an XML document, created with a tool such as InfoPath, carrying a mixture of core data and contextual metadata. The core data, including the item number and department code, will wind up in the columns of a relational table. The contextual metadata, which might include a threaded discussion made from comments injected by the requester, the reviewer, and the approver, will remain in document form. "This human context is never stored in the RDBMS today," says Kingsley Idehen, CEO of Burlington, Mass.-based OpenLink. Yet it's the key to understanding how the data got there and what it means.

Once written, the purchase order is injected into a workflow orchestrated on top of a Web services network. A security service may enforce authorization policy by updating a SOAP header; a choreography service may search for sets of documents that have SOAP headers that contain the same correlation ID. These active intermediaries will need some kind of database technology to manage the XML that lives transiently in their queues, but it probably won't be a job for Oracle or DB2. Here a specialized XML database, such as Software AG's Tamino or Sleepycat Software's Berkeley DB XML may be better suited to the task. They're fast and, as Mike Champion, senior R&D advisor at Software AG in Darmstadt, Germany notes they're built to work well with dynamic XML documents even when those documents lack the schemas the RDBMS SQL/XML mappers rely on.

During the workflow and after it has been completed, the document will be accessible to interested parties via a certain URL. That URL might resolve to a projection of the document -- from a hybrid SQL/XML RDBMS, to an intranet Web server or a WebDAV repository such as Oracle's. Alternatively, the URL might resolve to the underlying instance of the document stored natively in the RDBMS. Either way, the state of the business process -- both core data and contextual metadata -- will be visible at all times to anyone who's interested in looking at it and is authorized to do so. What's more, both flavors of data carried in the document will be accessible to queries that reach across the enterprise, joining SQL and XML sources to create consolidated views.

Did I say that this will come to pass by 2005? Make that 2006 or maybe 2007. A major shift in the style of enterprise data management is under way, and there are huge architectural issues yet to be resolved. Oracle, not surprisingly, wants you to store everything in a centralized hybrid DBMS. IBM says it would rather enable you to federate data across a range of sources. Each strategy has merit, and most enterprises will wind up pursuing both -- in different ways, for various reasons. Despite these differences, we are witnessing a sacred union. SQL and XML have been pronounced man and wife, and the honeymoon has begun.

Copyright © 2003 IDG Communications, Inc.