Mapping XML to Java, Part 2

Create a class library that uses the SAX API to map XML documents to Java objects

As I mentioned in Part 1, one of the big problems programmers face when using the SAX API is conceptual. I'd like to address that issue again as the foundation for the reusable class library that we will develop in this article.

TEXTBOX:

TEXTBOX_HEAD: Mapping XML to Java: Read the whole series!

:END_TEXTBOX

Regardless of whether you use DOM or SAX, when mapping XML data into Java, two things happen -- navigation and data collection. DOM and SAX differ in how those aspects are addressed. That is, DOM separates navigation and data collection, while SAX merges navigation and collection.

Most of DOM's performance weakness stems from the fact that the separation of the navigational and data-collection aspects seems natural and required as expressed in the DOM programming model, but that separation is not, in fact, a runtime requirement. SAX pierces that illusion by merging navigation and data collection during runtime, at the cost of making its programming model less obvious.

Using DOM, once you've created your in-memory DOM tree, you navigate to find the node that interests you. Then, once you've found the correct node, you collect data. You navigate and collect data, and those two aspects are conceptually separated. Unfortunately, as previously mentioned, using the in-memory DOM tree presents big performance implications.

With SAX, it's more of a juggling game. You listen to SAX events to keep track of where you are -- a different form of navigation. When the SAX events have positioned you in just the right place, you collect data. One of the reasons that SAX hasn't dominated the XML APIs is that the navigational aspect of its programming model is not as intuitive as it is with DOM.

As such, wouldn't it be really cool if we could get the navigational and data collection aspects of SAX into separate corners but keep the runtime performance advantages? Well, pay attention because that's exactly what we will do. That is, no reason exists for not separating navigational and data collection aspects in the programming model during development but leaving them intermixed at runtime.

You are here

In Part 1, I went through some basic applications of SAX. I also mentioned that there were some situations that needed special attention, such as recursive data structures. To create a class library that separates out the navigational aspects of SAX in the programming model, we will need a general-purpose approach to dealing with navigation. That approach will have to deal with all the special cases, including ambiguous tag names and recursive data structures.

So, how do we do that?

The key to navigation in SAX: at all times keep track of where you are during parsing. The most complicated navigational case is keeping track of where you are while receiving SAX events for recursive data structures generated while parsing an XML document. The conventional programming approach to using recursive structures -- sometimes called walking the tree -- is to use either a stack data structure or recursive function calls. Unfortunately, we can't use recursive function calls in SAX because we have to return control back to the XML parser after processing each SAX event. But we can use a stack data structure to keep track of SAX events.

Using a stack fixes a second problem that I mentioned in my previous article: if you have an ambiguous tag name such as name or location, that appears in more than one location within the XML document, you have to do something to remove the ambiguity. Using the full XML path from the XML document root to the ambiguous tag accomplishes that. Using a stack makes the full path from root to tag name available at all times. So, using a stack addresses both parsing special cases.

To demonstrate, let's look at a simple example that uses concatenate tags as they are discovered and uses the concatenated string as pseudo stack:

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;
import java.util.*;
import common.*;
public class Example1 extends DefaultHandler {
   // A stack to keep track of the tag names
   // that are currently open ( have called
   // "startElement", but not "endElement".)
   private Stack tagStack = new Stack();
   // Local list of item names...
   private Vector items = new Vector();
   // Customer name...
   private String customer;
   // Buffer for collecting data from
   // the "characters" SAX event.
   private CharArrayWriter contents = new CharArrayWriter();
   // Override methods of the DefaultHandler class
   // to gain notification of SAX Events.
   //
        // See org.xml.sax.ContentHandler for all available events.
   //
   public void startElement( String namespaceURI,
               String localName,
              String qName,
              Attributes attr ) throws SAXException {
      contents.reset();
      // push the tag name onto the tag stack...
      tagStack.push( localName );
      // display the current path that has been found...
      System.out.println( "path found: [" + getTagPath() + "]" );
   }
   public void endElement( String namespaceURI,
               String localName,
              String qName ) throws SAXException {
      if ( getTagPath().equals( "/CustomerOrder/Customer/Name" ) ) {
         customer = contents.toString().trim();
      }
      else if ( getTagPath().equals( "/CustomerOrder/Items/Item/Name" ) ) {
         items.addElement( contents.toString().trim() );
      }
      // clean up the stack...
      tagStack.pop();
   }
   public void characters( char[] ch, int start, int length )
                  throws SAXException {
      // accumulate the contents into a buffer.
      contents.write( ch, start, length );
   }
   // Build the path string from the current state
   // of the stack...
   //
   // Very inefficient, but we'll address that later...
   private String getTagPath( ){
      //  build the path string...
      String buffer = "";
      Enumeration e = tagStack.elements();
      while( e.hasMoreElements()){
               buffer  = buffer + "/" + (String) e.nextElement();
      }
      return buffer;
   }
   public Vector getItems() {
           return items;
   }
   public String getCustomerName() {
         return customer;
   }
   public static void main( String[] argv ){
      System.out.println( "Example1:" );
      try {
         // Create SAX 2 parser...
         XMLReader xr = XMLReaderFactory.createXMLReader();
         // Set the ContentHandler...
         Example1 ex1 = new Example1();
         xr.setContentHandler( ex1 );
         System.out.println();
         System.out.println("Tag paths located:");
         // Parse the file...
         xr.parse( new InputSource(
               new FileReader( "Example1.xml" )) );
         System.out.println();
         System.out.println("Names located:");
         // Display Customer
         System.out.println( "Customer Name: " + ex1.getCustomerName() );
         // Display all item names to stdout...
         System.out.println( "Order Items: " );
         String itemName;
         Vector items = ex1.getItems();
         Enumeration e = items.elements();
         while( e.hasMoreElements()){
                   itemName = (String) e.nextElement();
            System.out.println( itemName );
         }
      }catch ( Exception e )  {
         e.printStackTrace();
      }
   }
}

Below you'll find sample data to use with our example:

<?xml version="1.0"?>
<CustomerOrder>
   <Customer>
      <Name> Customer X </Name>
      <Address> unknown  </Address>
   </Customer>
   <Items>
      <Item>
         <ProductCode> 098 </ProductCode>
         <Name> Item 1 </Name>
         <Price> 32.01 </Price>
      </Item>
      <Item>
         <ProductCode> 4093 </ProductCode>
         <Name> Item 2 </Name>
         <Price> 0.76 </Price>
      </Item>
      <Item>
         <ProductCode> 543 </ProductCode>
         <Name> Item 3 </Name>
         <Price> 1.42 </Price>
      </Item>
   </Items>
</CustomerOrder>

Running our example with the sample data yields the following output:

Example1:
Tag paths located:
path found: [/CustomerOrder]
path found: [/CustomerOrder/Customer]
path found: [/CustomerOrder/Customer/Name]
path found: [/CustomerOrder/Customer/Address]
path found: [/CustomerOrder/Items]
path found: [/CustomerOrder/Items/Item]
path found: [/CustomerOrder/Items/Item/ProductCode]
path found: [/CustomerOrder/Items/Item/Name]
path found: [/CustomerOrder/Items/Item/Price]
path found: [/CustomerOrder/Items/Item]
path found: [/CustomerOrder/Items/Item/ProductCode]
path found: [/CustomerOrder/Items/Item/Name]
path found: [/CustomerOrder/Items/Item/Price]
path found: [/CustomerOrder/Items/Item]
path found: [/CustomerOrder/Items/Item/ProductCode]
path found: [/CustomerOrder/Items/Item/Name]
path found: [/CustomerOrder/Items/Item/Price]
Names located:
Customer Name: Customer X
Order Items:
Item 1
Item 2
Item 3

You are here -- now do something

Now that you have an idea of where I'm going, what should we keep on the stack? Obviously, performing string operations to keep track of where you are during parsing proves inefficient. We also must tackle how to use the stack effectively. Even though I used a stack in the example, I didn't leverage it to control activating the collection of data at key locations during parsing. I simply referenced it as the name of you are here during parsing for every single start-tag SAX event.

Back to the question, what should we keep on the stack? Well, I just mentioned that we want to leverage the stack to help us activate the collection of data. So, maybe a good candidate for the contents of the stack is data collection actions. As those actions are sometimes just place holders, I've named them tag trackers.

Tag trackers are markers that represent positions within the XML document. To reflect the structure of the XML document, tag trackers have one parent and zero-to-many children. Starting with a root tag tracker, all other tag trackers are connected via a parent-child relationship. When a startElement SAX event occurs, the active tag tracker compares the tag name to the tag names that were associated with each of its children tag trackers when they were created. When a match is found, the active tag tracker places itself on the stack and makes the matching child tag tracker the new active tag tracker.. Later, when the child has finished processing SAX events, the parent will be popped back off of the stack and reestablished as the active tag tracker.

Not only do tag trackers mark positions within the XML document but also associate actions with positions within the XML document. That is where the navigational aspects and the data collection aspects coordinate. When a tag tracker activates, it will fire an associated event, indicating that a particular position in the XML document has been announced by the SAX API. Unlike SAX, the tag tracking network fires that event only when the full path has been reached, making the event fully specified and unambiguous.

Tag trackers work as a group in a tag tracker network to navigate an XML document. Programs that use a tag tracker network start by creating a root tag tracker node. Then they create child tag trackers and bind them to the root tag tracker for each possible XML tag that can occur in the root of the XML document. That process is repeated recursively for each child until all XML tags in which the program is interested have a tag tracker linked to the tag tracker network. That is continued for every level in the XML document that is to be mapped. In that way, a network is created.

Our next example simply demonstrates tag trackers and a stack to keep track of where we are during XML parsing:

1 2 3 4 Page 1
Page 1 of 4