Cut, paste, split, and assemble XML documents with VTD-XML

VTD-XML eliminates the performance overhead associated with updating XML

My last JavaWorld article "Simplify XML Processing with VTD-XML" looked at three important benefits of VTD-XML: performance, memory usage, and ease of use. VTD-XML makes XML applications not only easier to write, but also leaner and faster. XML applications written in VTD-XML are 10 times more responsive when compared to the same applications written with the Document Object Model (DOM) and are capable of serving 10 times the workload, while maintaining the same quality of services for proportionally bigger XML messages.

However, parsing is often only part of what needs to be done in an XML-based application; many applications change the XML data as well. Consider the following three use-cases involving XML content updates:

  1. An address-book application internally saves data into XML format, and the user wants to update the contact information (e.g., a phone number)
  2. An XML-enabled content switch inspects the incoming and outgoing SOAP messages and selectively turns on the values of mustUnderstand attributes in the SOAP header
  3. An SOA (service-oriented architecture) security application canonicalizes a subset of XML data for the subsequent signing and encryption, then inserts the digest and ciphered text back into XML payload

For those use-cases, the ability to efficiently update XML content also contributes significantly to the overall application performance. But with VTD-XML's incremental update feature, the performance overhead normally associated with DOM and SAX is eliminated, as I will illustrate in the example below.

The double whammy of updating XML with DOM or SAX

Unfortunately, DOM and SAX (Simple API for XML) tax application performance twice when applying changes to XML. First with parsing, which is notoriously slow and CPU intensive; even worse is the reserialization needed to generate updated XML. Consider modifying the text content of the following XML snippet from "red" to "blue."

 <color> red </color>  

Using DOM, you would have to go through the following three steps:

  1. Build the DOM tree
  2. Navigate to and then update the text node
  3. Write the updated structure back into XML

SAX and Pull are not even worth mentioning here since neither provides you with the liberty to navigate the tree structure. If the XML file size increases, writing out updated XML—effectively, many string concatenations, buffer allocations, and encoding conversions—further degrades overall application performance already constrained by slow parsing.

Notice that the same task can be done far more efficiently by a human using a text editor. To edit the XML snippet in the previous example, just open the file with a simple notepad, move the cursor to the start of the text node, replace "red" with "blue" and you're done! Notice that this time, the update is "incremental," meaning it does not touch irrelevant parts of the document. And if we humans can edit XML like this, why can't XML parsers?

VTD-XML enables incremental update

VTD-XML is the first XML parser engineered from ground up to support incremental updates. In other words, VTD-XML not only parses XML blazingly fast, but also makes possible zero-overhead XML content updates, distancing itself further from DOM and SAX as an advanced and powerful XML parser.

How does VTD-XML accomplish all those feats? By solving three deeply rooted problems surrounding DOM and SAX, each of which can be described as a round-trip performance penalty. The first round-trip is object allocation and garbage collection. Every time an object (e.g., DOM's nodes) is allocated, it will eventually go out of scope and get garbage-collected. The second round-trip applies to virtually all text parsers. Historically, the first step of text processing is always to break input text into tokens, which are small and discrete character arrays containing relevant text data. Applying any small change to the original text requires all the token contents be put back together. The last round-trip is character decoding and re-encoding. DOM and SAX decode XML's native encoding (e.g., UTF-8) into UCS format during parsing, which must be re-encoded into the native encoding when any content changes.

VTD-XML eliminates all of these problems. There is no encoding conversion, no discrete strings, and virtually no object allocations. Virtual Token Descriptor (VTD) is the name of the "non-extractive" tokenization technique largely responsible for VTD-XML's unrivaled efficiency. VTD records are 64-bit integers that encode the lengths, starting offsets, types, and nesting depths of tokens in XML. In other words, digging a little deeper, you will find that the old, discrete-string-based tokenization is doing a little too much; offsets and lengths (also known as non-extractive tokenization) are all you need to represent tokens.

Simple in concept, VTD-XML erases nearly every vice of DOM and SAX. Below is a summary of what differences VTD-XML makes in terms of memory usage, performance, and incremental update.

  • Conserving memory:
    1. Because VTD records are not objects, they are not subject to per-object memory overhead typically associated with Java objects (8 bytes per object)
    2. VTD storage can be bulk-allocated (i.e., using large memory blocks): when allocating a large memory block to store 1,024 VTD tokens, one only incurs the per-array memory overhead once, essentially reducing the per-record overhead to almost nothing
    3. It is easy to reuse VTD token buffers
  • High performance:
    1. VTD-XML's high performance in parsing is a by-product of VTD's memory-conserving features: less memory usage means less memory is allocated
    2. Large memory blocks are faster to allocate and garbage collect than many discrete objects
  • Incremental update:

    Because VTD-XML mandates the original XML as part of its internal representation and exclusively uses offsets and lengths to represent tokens, updating XML content can be done with surgical precision and no longer wastes any CPU cycles on recomposing irrelevant portions of the document

To help you warm up to the concept of incremental update, here is a quick example that uses VTD-XML to replace an attribute value in an XML file, shown in the figure below.

Replace an attribute value. Click on thumbnail to view full-sized image.

To generate the updated XML in the same way you would use a text editor to make the identical change, our application (source shown below) first parses the test.xml using VTDGen; its method, parseFile(), navigates the cursor to the attribute, replaces the old value (marked green in the figure) with a new one, then immediately writes the updated document back into updated.xml:


import com.ximpleware.*; import*;

public class increUpdate{ public static void main(String[] args){ try{ VTDGen vg = new VTDGen(); File fo = new File("updated.xml"); FileOutputStream fos = new FileOutputStream(fo); if (vg.parseFile("test.xml",false)){ VTDNav vn = vg.getNav(); if (vn.matchElement("purchaseOrder")){ int i = vn.getAttrVal("orderDate"); if (i!=-1){ //Get the starting offset of "1999-10-21" int os1 = vn.getTokenOffset(i); // Get the ending offset of "1999-10-21" int os2 = vn.getTokenOffset(i)+ vn.getTokenLength(i); // Get the total number of bytes of XML int os3 = vn.getXML().length() ; byte[] xml = vn.getXML().getBytes();

// Write everything before "1999-10-21" fos.write(xml, 0, os1); // Write "2006-6-17" fos.write("2006-6-17".getBytes()); // Write everything after fos.write(xml, os2 , os3 - os2); fos.close();

} } } } catch (Exception e){ System.out.println("exception occurred ==>"+e); } } }

Content manipulation with VTD-XML

Enabling more than just incremental update, VTD-XML's non-extractive tokenization unleashes an array of new features. The list below defines four of the XML content manipulation operations made possible by VTD-XML:

  • Cut: Given an XML document, carve away some portions of it (e.g., an attribute, a text node, or an element) while keeping it well-formed.
  • Paste: After copying a chunk of an XML document, stick the copy into another XML document while keeping it well-formed.
  • Split: A beauty of XML is that any element of the root element is itself XML. VTD-XML can cut the elements of a single large piece of XML and dump each into an XML file.
  • Assemble: Pull out element fragments and combine them into a new XML file.

At the API level, VTD-XML supports those operations at the following three granularity levels:

  • A single token: Once you have the index value of a VTD record, you can use VTDNav's getTokenOffset(...) and getTokenLength(...) methods to return the starting offset and length of a token. Both values are useful when the token content must be updated.
  • A group of adjacent tokens: Sometimes a group of adjacent tokens (the attribute name/value pair) must be replaced or deleted in one swoop. In this case, you need to obtain the starting and ending offset for the first and last token, respectively, in the group.
  • An element: VTDNav's getElementFragment(...) method returns the offset and length of an element as it appears in the XML document. This allows you to directly manipulate the element content with byte-level fidelity.

Here is a quick overview of the basic VTD-XML concepts you will encounter in the examples that follow.

  • Essential classes:
    1. VTDGen is the name of the class encapsulating parsing functions.
    2. After parsing, you can obtain an instance of VTDNav, which allows you to move to different locations in the tree.
    3. AutoPilot is the wrapper class for XPath and node iterators.
  • Cursor-based model: There is only one cursor available. After parsing, the cursor is at the root element. You can use a global stack to remember the position of the cursor.
  • Stateless XPath evaluation: VTD-XML's XPath evaluation returns one node at a time, unless the node set is empty. An instance of the AutoPilot class acts like a magic hand that moves the cursor position across the XML tree according to the XPath expression.

Cutting XML

Our first example deletes part of the content of a CD catalog file named cd.xml, which is shown below. Notice that two of the CDs are priced below 0, with the other two priced above.

        <TITLE>Empire Burlesque</TITLE>
        <ARTIST>Bob Dylan</ARTIST>
        <TITLE>Still got the blues</TITLE>
        <ARTIST>Gary More</ARTIST>
        <COMPANY>Virgin redords</COMPANY>
        <TITLE>Hide your heart</TITLE>
        <ARTIST>Bonnie Tyler</ARTIST>
        <COMPANY>CBS Records</COMPANY>
        <TITLE>Greatest Hits</TITLE>
        <ARTIST>Dolly Parton</ARTIST>

Our first application (shown below) carves away from the XML file all CDs more expensive than 0. The corresponding XPath expression is /CATALOG/CD [PRICE > 10]. The main application logic first instantiates VTDGen to parse the XML file, then instantiates AutoPilot, and selects the XPath expression. During the XPath evaluation, the application repeatedly calls getElementFragment(), which returns a long that encodes both the offset and length of the segment containing the content of the element matching the XPath. In the final step, the application creates a new file named cd_after.xml by copying byte content of the original XML, minus those selected segments.

1 2 3 Page 1
Page 1 of 3