XML's quirky namespaces

You may not understand namespaces now, but soon you may have to

Last month, Microsoft announced that the forthcoming Office 12 will save to XML by default and that earlier versions will be retrofitted to work with XML. This week Apple released its podcast-aware version of iTunes and defined an extension to RSS 2.0 for use with its online music store. Over the next year or so, these initiatives will create millions of new users of XML. They'll also expose thousands of developers to a feature of XML that's caused more than its fair share of headaches: namespaces.

You can perform all kinds of useful XML processing without ever touching a namespace, and many developers do. Most flavors of RSS don't use namespaces, for example. The tag names -- title, link, description -- are only implicitly associated with RSS feeds, and that's fine for many purposes. But what happens when you extract an RSS item from a feed, mix it with a chunk of XML from some other source, and produce an HTML page? Now you need to be able to distinguish the title of the RSS item from, say, the title of the HTML page.

Modular namespaces are a familiar concept in many realms. Area codes disambiguate phone numbers; domain names qualify URLs; package names scope identifiers in programs. Partitioning XML vocabularies in the same way seems like a natural thing to do, and it is. But for a variety of reasons explained in Ronald Bourret's "Namespace Myths Exploded" -- an essay written way back in 2000 that still resonates today -- XML namespaces cause a lot of confusion.

Recently, for example, I needed to process some RSS 1.0 feeds. An RSS 1.0 feed is actually rooted in the RDF (Resource Description Framework) namespace, though its items live in the RSS 1.0 namespace. Such feeds typically also weave in elements from other namespaces -- for example, Dublin Core metadata. My task was simple: parse the feed, use XPath queries to carve out items, and unpack the elements contained within those items.

This proved surprisingly hard to do with my regular XML parser and toolkit, libxml2, which deals strictly with namespaces. I then repeated the exercise using three other toolkits -- Python's minidom module, E4X (ECMAScript for XML) implemented using Rhino, and Mark Logic's XQuery-based Content Interaction Server. Each made the task simpler, though perhaps not laudably so in the case of minidom and E4X, neither of which requires namespace prefixes to resolve to Universal Resource Identifiers. But what's most striking when you point a variety of XML toolkits at documents that use namespaces is how differently each of them approaches the problem.

That's understandable, given that namespaces were always -- and still are -- optional. But thanks to Microsoft and Apple, what was the exception may soon become the rule.

That's good news in the long run. We'll increasingly want to mix and remix XML data, and to do so we'll need to master namespaces. In the short run, though, I expect more of the turbulence we ran into this week when Sam Ruby and Mark Pilgrim, co-developers of the RSS/Atom Feed Validator and contributors to the Atom specification, found problems with Apple's specification of an iTunes namespace, and with Apple's -- and other podcast publishers' -- use of that namespace. These folks should have known better. But they weren't the first to be bitten by the quirkiness of XML namespaces, and they won't be the last.