XML News from Monday, October 16, 2006

XimpleWare has released VTD-XML 1.7, a free (GPL) non-extractive Java library for processing XML that supports XPath. This appears to be an example of what Sam Wilmot calls "in situ parsing". In other words, rather than creating objects representing the content of an XML document, VTD-XML just passes pointers into the actual, real XML. (These are the abstract pointers of your data structures textbook, not C-style addresses in memory. In this cases the pointers are int indexes into the file.) You don't even need to hold the document in memory. It can remain on disk. This should improve speed and memory usage. Current tree models typically require at least 3 times the size of the actual document, more often more. Using a model based on indexes into one big array might allow these to reduce their requirements to twice the size of the original document or even less. VTD-XML claims 1.3 times, but I haven't verified that.

However VTD-XML currently only supports the built-in entity references (" & ' > <). They're some other limits. Element names are limited to 2048 characters. Documents can't be much bigger than a billion characters, so SAX (or XOM) is still needed for really huge documents. There's also a maximum depth to the document, though exactly what it is isn't specified. All this means VTD-XML is not a conformant XML parser. Given this, comparisons to other parsers are unfair and misleading. I've seen many products that outperform real XML parsers by sub-setting XML and cutting out the hard parts. It's often the last 10% that kills the performance. :-( The other question I have for anything claiming these speed gains is whether it correctly implements well-formedness testing, including the internal DTD subset. Will VTD-XML correctly report all malformed documents as malformed? Has it been tested against the W3C XML conformance test suite? I'm not sure.