Simplify XML processing with VTD-XML

A new option that overcomes the problems of DOM and SAX

1 2 Page 2
Page 2 of 2
File name/sizeVTD-XML (ms)VTD-XML buffer reuse (ms)SAX (ms)DOM(ms)DOM deferred(ms)Piccolo (ms)Pull (ms)
nav_50_0.xml (10304 bytes)0.20.1850.550.8020.7730.7010.398
officeOrder.xml (10591 bytes)0.1860.1740.410.6170.6150.5260.432
form.xml (15845 bytes)0.2740.2580.2270.2140.4860.7730.921
book.xml (22996 bytes)0.3680.3540.7432.3912.0460.8430.857
soap_small.xml (26734 bytes)0.580.5631.2213.8253.0681.3461.137
cd.xml (30831 bytes)0.5690.5491.2055.0924.3761.2111.362
bioinfo.xml (34759 bytes)0.5290.5171.0684.1264.3661.1881.33
soap_mid.xml (134334 bytes)2.8852.8046.02832.84621.8966.6685.828

Table 3. Big XML files

File name/sizeVTD-XML (ms)VTD-XML buffer reuse (ms)SAX (ms)DOM(ms)DOM deferred(ms)Piccolo (ms)Pull (ms)
po1m.xml (1.01 MB)25.7120.0836.4186.16115.6747.6263.27
soap.xml (2.59 MB)64.757.27123.18502.32380.74134.8393.96
bioinfo_big.xml (4.27 MB)70.173.9131.8629.1442.02151.62177.64
SUAS.xml (13.13 MB)359.91315.24665.361961.011296.08820.38637.72
address.xml (15.24 MB)315.06276658.562158.51822.22617.48684.57

Memory usage comparisons

Since SAX and Pull do not build datastructures in memory, the meaningful comparison is between DOM (both with and without deferred node expansion) and VTD-XML. To this end, this section benchmarks the multiplying factor, which is the ratio between the memory usage and the document size for large files (as memory usage is a particular concern for large files).

Figure 4. Click on thumbnail to view full-sized image.

Navigation performance comparisons

This section presents the navigation performance in terms of latency for VTD-XML and DOM (without deferred node expansion), which is the time it takes to visit every node in the document. To traverse the nodes, the DOM code relies on the nodeIterator interface, while the VTD-XML code calls the member methods selectElement(...) and iterate(...) of the class AutoPilot. As expected, navigation is much faster than parsing. For VTD-XML, the navigation cost is between 15 percent and 30 percent of the parsing cost. The ratio for DOM is 5 percent to 7 percent. Not that VTD-XML navigates slower than DOM; the difference is entirely the result of VTD-XML's far superior parsing performance.

Table 4. Small files

File name/sizeVTD-XML (ms)DOM(ms)
soap2.xml (1727 bytes)0.006710.00676
nav_48_0.xml (4608 bytes)0.0280.0155
cd_catalog.xml (5035 bytes)0.03880.0385
nav_63_0.xml (6848 bytes)0.04310.0238
nav_78_0.xml (6920 bytes)0.0430.0244

Table 5. Mid-sized files

File name/sizeVTD-XML (ms)DOM(ms)
nav_50_0.xml (10304 bytes)0.0630.034
officeOrder.xml (10591 bytes)0.07880.051
form.xml (15845 bytes)0.0650.046
book.xml (22996 bytes)0.1490.144
soap_small.xml (26734 bytes)0.2250.193
cd.xml (30831 bytes)0.2260.3
bioinfo.xml (34759 bytes)0.2360.178
soap_mid.xml (134334 bytes)1.611.151

Table 6. Large files

File name/sizeVTD-XML (ms)DOM(ms)
po1m.xml (1.01 MB)11.1910.84
soap.xml (2.59 MB)32.8435.44
bioinfo_big.xml (4.27 MB)30.4338.26
SUAS.xml (13.13 MB)21.4321.82
address.xml (15.24 MB)132.18


Result analysis

In Dennis Sosnoski's JavaWorld

article from four years ago, Piccolo was chosen as the overall winner among many SAX implementations. This has changed: The latest Xerces SAX parser has overtaken the leadership position as the best performing SAX parser. Also this article's test results indicated that, when compared with Xerces SAX, XPP3 turned in a robust performance and wasn't behind by much.

Also interesting is the finding that when the document size is small (less than 10 Kb), DOM parsing performance isn't behind SAX by as much as when the document sizes are big. For small XML files, DOM's deferred node expansion results in slower parsing performance than DOM with full node expansion.

Yet VTD-XML's performance so dominates the other parsers that it is in a class by itself. And the real comparison is between VTD-XML with and without buffer reuse. The significant better memory usage means VTD-XML can be used to process large XML files, and the performance benefit applies broadly to XML of all sizes.Conclusion

VTD-XML is a new, next-generation XML parser that overcomes many technical issues currently surrounding DOM and SAX. The combination of VTD-XML's high performance and low-memory usage has a few interesting implications. First, given how cheap DRAM (dynamic random access memory) chips are, unless there is absolutely no way to hold the XML document in memory, basing application development on a SAX parser offers few incentives. Second, applications become simple to write with VTD-XML, and they will also be much faster than previously thought possible. From big to small, VTD-XML's coverage of XML file sizes means that selecting the right processing model is now a simple process and developers no longer have to switch between the two drastically different parsing styles of DOM and SAX. Last, but not the least, VTD-XML should offer convincing answers to a few long-standing complaints about XML. For example, VTD-XML has built-in native XML indexing capability that should, once and for all, change the perception that XML is slow. Performance-wise, VTD-XML should mark the beginning of the "10x XML" era. And more importantly, the next stop for VTD-XML, just around the corner, is the era "100x XML."

Jimmy Zhang is founder of XimpleWare, a provider of high-performance XML processing solutions. He has working experiences in the fields of electronic design automation and voice-over IP with numerous Silicon Valley tech companies. He graduated from University of California, Berkeley with both a MS and a BS from the department of EECS.

Learn more about this topic

This story, "Simplify XML processing with VTD-XML" was originally published by JavaWorld.


Copyright © 2006 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2