Mapping XML to Java, Part 1

Employ the SAX API to map XML documents to Java objects

XML is hot. Because XML is a form of self-describing data, it can be used to encode rich data models. It's easy to see XML's utility as a data exchange medium between very dissimilar systems. Data can be easily exposed or published as XML from all kinds of systems: legacy COBOL programs, databases, C++ programs, and so on.

TEXTBOX:

TEXTBOX_HEAD: Mapping XML to Java: Read the whole series!

:END_TEXTBOX

However, using XML to build systems poses two challenges. First, while generating XML is a straightforward procedure, the inverse operation, using XML data from within a program, is not. Second, current XML technologies are easy to misapply, which can leave a programmer with a slow, memory-hungry system. Indeed, heavy memory requirements and slow speeds can prove problematic for systems that use XML as their primary data exchange format.

Some standard tools currently available for working with XML are better than others. The SAX API in particular has some important runtime features for performance-sensitive code. In this article, we will develop some patterns for applying the SAX API. You will be able to create fast XML-to-Java mapping code with a minimum memory footprint, even for fairly complex XML structures (with the exception of recursive structures).

In Part 2 of this series, we will cover applying the SAX API to recursive XML structures in which some of the XML elements represent lists of lists. We will also develop a class library that manages the navigational aspects of the SAX API. This library simplifies writing XML mapping code based on SAX.

Mapping code is similar to compiling code

Writing programs that use XML data is like writing a compiler. That is, most compilers convert source code into a runnable program in three steps. First, a lexer module groups characters into words or tokens that the compiler recognizes -- a process known as tokenizing. A second module, called the parser, analyzes groups of tokens in order to recognize legal language constructs. Last, a third module, the code generator, takes a set of legal language constructs and generates executable code. Sometimes, parsing and code generation are intermixed.

To use XML data in a Java program, we must undergo a similar process. First, we analyze every character in the XML text in order to recognize legal XML tokens such as start tags, attributes, end tags, and CDATA sections.

Second, we verify that the tokens form legal XML constructs. If an XML document consists entirely of legal constructs per the XML 1.0 specification, it is well-formed. At the most basic level, we need to make sure that, for instance, all of the tagging has matching opening and closing tags, and the attributes are properly structured in the opening tag.

Also, if a DTD is available, we have the option to make sure that the XML constructs found during parsing are legal in terms of the DTD, as well as being well-formed XML.

Finally, we use the data contained in the XML document to accomplish something useful -- I call this mapping XML into Java.

XML Parsers

Fortunately, there are off-the-shelf components -- XML parsers -- that perform some of these compiler-related tasks for us. XML parsers handle all lexical analysis and parsing tasks for us. Many currently available Java-based XML parsers support two popular parsing standards: the SAX and DOM APIs.

The availability of an off-the-shelf XML parser may make it seem that the hard part of using XML in Java has been done for you. In reality, applying an off-the-shelf XML parser is an involved task.

SAX and DOM APIs

The SAX API is event-based. XML parsers that implement the SAX API generate events that correspond to different features found in the parsed XML document. By responding to this stream of SAX events in Java code, you can write programs driven by XML-based data.

The DOM API is an object-model-based API. XML parsers that implement DOM create a generic object model in memory that represents the contents of the XML document. Once the XML parser has completed parsing, the memory contains a tree of DOM objects that offers information about both the structure and contents of the XML document.

The DOM concept grew out of the HTML browser world, where a common document object model represents the HTML document loaded in the browser. This HTML DOM then becomes available for scripting languages like JavaScript. HTML DOM has been very successful in this application.

Dangers of DOM

At first glance, the DOM API seems to be more feature-rich, and therefore better, than the SAX API. However, DOM has serious efficiency problems that can hurt performance-sensitive applications.

The current group of XML parsers that support DOM implement the in-memory object model by creating many tiny objects that represent DOM nodes containing either text or other DOM nodes. This sounds natural enough, but has negative performance implications. One of the most expensive operations in Java is the new operator. Correspondingly, for every new operator executed in Java, the JVM garbage collector must eventually remove the object from memory when no references to the object remain. The DOM API tends to really thrash the JVM memory system with its many small objects, which are typically tossed aside soon after parsing.

Another DOM issue is the fact that it loads the entire XML document into memory. For large documents, this becomes a problem. Again, since the DOM is implemented as many tiny objects, the memory footprint is even larger than the XML document itself because the JVM stores a few extra bytes of information regarding all of these objects, as well as the contents of the XML document.

It is also troubling that many Java programs don't actually use DOM's generic object structure. Instead, as soon as the DOM structure loads in memory, they copy the data into an object model specific to a particular problem domain -- a subtle yet wasteful process.

Another subtle issue for the DOM API is that code written for it must scan the XML document twice. The first pass creates the DOM structure in memory, the second locates all XML data the program is interested in. Certain coding styles may traverse the DOM structure several additional times while locating different pieces of XML data. By contrast, SAX's coding style encourages locating and collecting XML data in a single pass.

Some of these issues could be addressed with a better underlying data-structure design to internally represent the DOM object model. Issues such as encouraging multiple processing passes and translating between generic and specific object models cannot be addressed within the XML parsers.

SAX for survival

Compared to the DOM API, the SAX API is an attractive approach. SAX doesn't have a generic object model, so it doesn't have the memory or performance problems associated with abusing the new operator. And with SAX, there is no generic object model to ignore if you plan to use a specific problem-domain object model instead. Moreover, since SAX processes the XML document in a single pass, it requires much less processing time.

SAX does have a few drawbacks, but they are mostly related to the programmer, not the runtime performance of the API. Let's look at a few.

The first drawback is conceptual. Programmers are accustomed to navigating to get data; to find a file on a file server, you navigate by changing directories. Similarly, to get data from a database, you write an SQL query for the data you need. With SAX, this model is inverted. That is, you set up code that listens to the list of every available piece of XML data available. That code activates only when interesting XML data are being listed. At first, the SAX API seems odd, but after a while, thinking in this inverted way becomes second nature.

The second drawback is more dangerous. With SAX code, the naive "let's take a hack at it" approach will backfire fairly quickly, because the SAX parser exhaustively navigates the XML structure while simultaneously supplying the data stored in the XML document. Most people focus on the data-mapping aspect and neglect the navigational aspect. If you don't directly address the navigational aspect of SAX parsing, the code that keeps track of the location within the XML structure during SAX parsing will become spread out and have many subtle interactions. This problem is similar to those associated with overdependence on global variables. But if you learn to properly structure SAX code to keep it from becoming unwieldy, it is more straightforward than using the DOM API.

Basic SAX

There are currently two published versions of the SAX API. We'll use version 2 (see Resources) for our examples. Version 2 uses different class and method names than version 1, but the structure of the code is the same.

SAX is an API, not a parser, so this code is generic across XML parsers. To get the examples to run, you will need to access an XML parser that supports SAX v2. I use Apache's Xerces parser. (See Resources.) Review your parser's getting-started guide for specifics on invoking a SAX parser.

The SAX API specification is pretty straightforward. In includes many details, but its primary task is to create a class that implements the ContentHandler interface, a callback interface used by XML parsers to notify your program of SAX events as they are found in the XML document.

The SAX API also conveniently supplies a DefaultHandler implementation class for the ContentHandler interface.

Once you've implemented the ContentHandler or extended the DefaultHandler, you need only direct the XML parser to parse a particular document.

Our first example extends the DefaultHandler to print each SAX event to the console. This will give you a feel for what SAX events will be generated and in what order.

To get started, here's the sample XML document we will use in our first example:

<?xml version="1.0"?>
<simple date="7/7/2000" >
   <name> Bob </name>
   <location> New York </location>
</simple>

Next, we see the source code for XML mapping code of the first example:

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;
public class Example1 extends DefaultHandler {
   // Override methods of the DefaultHandler class
   // to gain notification of SAX Events.
   //
        // See org.xml.sax.ContentHandler for all available events.
   //
   public void startDocument( ) throws SAXException {
      System.out.println( "SAX Event: START DOCUMENT" );
   }
   public void endDocument( ) throws SAXException {
      System.out.println( "SAX Event: END DOCUMENT" );
   }
   public void startElement( String namespaceURI,
              String localName,
              String qName,
              Attributes attr ) throws SAXException {
         System.out.println( "SAX Event: START ELEMENT[ " +
                  localName + " ]" );
      // Also, let's print the attributes if
      // there are any...
                for ( int i = 0; i < attr.getLength(); i++ ){
                   System.out.println( "   ATTRIBUTE: " +
                  attr.getLocalName(i) +
                  " VALUE: " +
                  attr.getValue(i) );
      }
   }
   public void endElement( String namespaceURI,
              String localName,
              String qName ) throws SAXException {
      System.out.println( "SAX Event: END ELEMENT[ " +
                  localName + " ]" );
   }
   public void characters( char[] ch, int start, int length )
                  throws SAXException {
      System.out.print( "SAX Event: CHARACTERS[ " );
      try {
         OutputStreamWriter outw = new OutputStreamWriter(System.out);
         outw.write( ch, start,length );
         outw.flush();
      } catch (Exception e) {
         e.printStackTrace();
      }
      System.out.println( " ]" );
   }
   public static void main( String[] argv ){
      System.out.println( "Example1 SAX Events:" );
      try {
         // Create SAX 2 parser...
         XMLReader xr = XMLReaderFactory.createXMLReader();
         // Set the ContentHandler...
         xr.setContentHandler( new Example1() );
            // Parse the file...
         xr.parse( new InputSource(
               new FileReader( "Example1.xml" )) );
      }catch ( Exception e )  {
         e.printStackTrace();
      }
   }
}

Finally, here is the output generated by running the first example with our sample XML document:

Example1 SAX Events:
SAX Event: START DOCUMENT
SAX Event: START ELEMENT[ simple ]
   ATTRIBUTE: date VALUE: 7/7/2000
SAX Event: CHARACTERS[
    ]
SAX Event: START ELEMENT[ name ]
SAX Event: CHARACTERS[  Bob  ]
SAX Event: END ELEMENT[ name ]
SAX Event: CHARACTERS[
    ]
SAX Event: START ELEMENT[ location ]
SAX Event: CHARACTERS[  New York  ]
SAX Event: END ELEMENT[ location ]
SAX Event: CHARACTERS[
 ]
SAX Event: END ELEMENT[ simple ]
SAX Event: END DOCUMENT

As you can see, the SAX parser will call the appropriate ContentHandler method for every SAX event it discovers in the XML document.

Hello world

Now that we understand the basic pattern of SAX, we can start to do something slightly useful: extract values from our simple XML document and demonstrate the classic hello world program.

1 2 3 4 Page 1
Page 1 of 4