Keywords: SAX, XML, test, conformance
Elliotte Rusty Harold is an adjunct professor of computer science at Polytechnic University in Brooklyn. He's the author of numerous books on XML including the XML 1.1 Bible, XML in a Nutshell, Effective XML, and Processing XML with Java. Most recently he has been working on XOM, the only tree-based API for XML that absolutely guarantees well-formedness.
, the simple API for XML,
is a broadly, almost universally implemented standard among Java parsers,
many SAX parsers have serious bugs. The lack of a complete SAX conformance test
suite has been a severe hindrance to interoperability. For example, about half
of SAX parsers call
even after reporting a fatal error, while
the other half don’t. Existing XML test suites mostly focus on whether the
parser correctly answers boolean questions of well-formedness or validity,
while ignoring the much more complex questions of whether the parser correctly
reports document content in the correct order. Indeed the XML specification
is mostly silent on exactly which parts of the document a parser is required to report.
Not surprisingly this has led to a number of inconsistencies between parsers as well
as outright bugs in more than a few implementations.
This paper demonstrates a conformance suite written in Java that tests parsers which claim to implement the SAX API. The framework asks the parser to read a collection of input documents and then logs the methods the parser invokes and their arguments. This log takes the form of an XML document that can be compared against the expected results.
The documents in the test set are derived from the W3C XML conformance test suite. The software includes a framework for testing parsers against this document collection and measuring their conformance to both the core and optional parts of both XML and SAX. Conformance results for major parsers including Xerces, Crimson, and Piccolo are reported. A number of areas in which deficiencies in the SAX specification have led to varying parser behavior are identified.
Existing test suites
Results for different parsers
Is endDocument invoked after a fatal error?
What kinds of exceptions can parsers throw?
How much data is passed after a fatal error?
What is the type of enumerated attributes?
XML 1.1 support
Problems with Specific Parsers
Future Directions for more research
Test Suite Availability
There are several existing test suites for XML and SAX. However, they are of limited coverage, and failed to expose many bugs I noticed during the development of XOM. [XOM]
There is an embryonic, semi-official test suite for SAX [Arnold 2001] . However, this includes only a few dozen JUnit based tests for the most basic features of SAX2. There are also a couple of thousand SAX 1 tests based on a draft version of the OASIS/NIST XML test suite. However, these perform limited output testing, and leave many holes in coverage. The latest version is 0.2 from November 12, 2001. Work appears to have been abandoned.
The most comprehensive XML test suite is the W3C's XML test suite [W3C XML Group] , which bundles tests gathered from a variety of sources including James Clark, the OASIS/NIST XML test suite, Sun, IBM, Henry S. Thompson, and others. This offers the broadest coverage of a range of XML documents. However, it focuses on testing binary decisions. Is the document well-formed or not? Is the document valid or not? It provides a limited number of output tests, based on the Second XML Canonical Form. [Sun] .
To properly test a SAX parser, it is necessary to verify that it reports the right events with the right content in the right order. This exceeds the scope of the XML Test Suite, which is API independent. However, because the W3C test suite is so broad, it became the primary source of input data for this new test suite. In order to pass the tests, a parser must be able to correctly process all the documents in the W3C XML test suite. The difference is not in the documents themselves. It is in the scope of the output. Passing the W3C test suite primarily requires correctly identifying well-formed and malformed, valid and invalid documents. My test suite requires not only this, but also the reporting of the right content in the right order at the right time using the right methods.
The W3C test suite is divided into numerous test cases stored in several directories, mostly organized by the submitter. The test cases are further subdivided into well-formed and malformed, valid and invalid, namespace aware and non-namespace aware, and external entity using and self-contained test cases. The master file lists a typical case like this:
<TEST TYPE="invalid" URI="invalid/attr06.xml" ID="attr06" SECTIONS="3.3.1"> Tests the "Name Token" VC for the NMTOKENS attribute type.</TEST>
This says that the test case document can be found at the relative URL "invalid/attr06.xml", that the document at that URL is invalid (but well-formed); that it tests section 3.3.1 of the XML specification, and more specifically it tests the name token validity constraint for the NMTOKENS attribute type. Here I'm not so interested in testing whether the document is valid or invalid as I am in testing that all the content from that document is properly reported through SAX.
In order to compare the the output of different parser,
it's necessary to place the output
in a standard format that can be easily diffed. It seemed natural to use XML for this purpose.
A single class that implemented
DTDHandler--the four required SAX interfaces--was written that logged all its calls to an XML document.
it created a XOM
Document object which was later serialized).
However, the problem is thornier than it may appear at first.
It is necessary to produce well-formed output
even for malformed input. We must not assume that the SAX parser will detect such bugs
because that would require assuming that the parser is non-buggy,
precisely what we're endeavoring to determine.
For instance, we cannot assume element names will not contain white space or
PCDATA will not contain nulls.
suppose we begin with this test document (Test case ibm-valid-P10-ibm10v02.xml)
<?xml version="1.0"?> <!DOCTYPE student [ <!ELEMENT student (#PCDATA)> <!ATTLIST student first CDATA #REQUIRED middle CDATA #IMPLIED last CDATA #REQUIRED > <!ENTITY myfirst "Snow"> <!ENTITY mymiddle "Y"> <!ENTITY mylast ''> ]> <!-- testing AttValue with empty char inside single quote --> <student first='' last=''>My Name is Snow &mylast; Man. </student>
When parsed it produces this output:
<?xml version="1.0" encoding="UTF-8"?> <ConformanceResults> <startDocument/> <startElement> <namespaceURI/> <qualifiedName>student</qualifiedName> <attributes> <attribute> <namespaceURI/> <localName>first</localName> <qualifiedName>first</qualifiedName> <value/> <type>CDATA</type> </attribute> <attribute> <namespaceURI/> <localName>last</localName> <qualifiedName>last</qualifiedName> <value/> <type>CDATA</type> </attribute> </attributes> </startElement> <char>M</char> <char>y</char> <char>\s</char> <char>N</char> <char>a</char> <char>m</char> <char>e</char> <char>\s</char> <char>i</char> <char>s</char> <char>\s</char> <char>S</char> <char>n</char> <char>o</char> <char>w</char> <char>\s</char> <char>\s</char> <char>M</char> <char>a</char> <char>n</char> <char>.</char> <char>\s</char> <endElement> <namespaceURI/> <qualifiedName>student</qualifiedName> </endElement> <endDocument/> </ConformanceResults>
The general format could have the following DTD:
<!ELEMENT locator EMPTY> <!ELEMENT startDocument EMPTY> <!ELEMENT endDocument EMPTY> <!ELEMENT fatalError EMPTY> <!ELEMENT char (#PCDATA)> <!ELEMENT ignorable (#PCDATA)> <!ELEMENT localName (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT systemID (#PCDATA)> <!ELEMENT qualifiedName (#PCDATA)> <!ELEMENT namespaceURI (#PCDATA)> <!ELEMENT value (#PCDATA)> <!ELEMENT type (#PCDATA)> <!ELEMENT ConformanceResults (startDocument | startElement | endElement | char | ignorable | notation | unparsedEntity | resolveEntity | endDocument | fatalError | processingInstruction | locator )*> <!ELEMENT attributes (attribute)*> <!ELEMENT attribute (namespaceURI?, localName?, qualifiedName?, value?, type?)> <!ELEMENT startElement (namespaceURI?, localName?, qualifiedName?, attributes?)> <!ELEMENT endElement (namespaceURI?, localName?, qualifiedName?)> <!ELEMENT notation (name?, systemID?)> <!ELEMENT unparsedEntity (name?, publicID?, systemID?, notation?)> <!ELEMENT bug (#PCDATA)> <!ATTLIST bug reason (CDATA) #IMPLIED>
This output format is designed to avoid some common problems:
The actual generation also introduces some issues:
characters. Initially, I combined these into a single element in the output using the well-known algorithm [Harold 2002] . However, that proved both difficult to compare by eyeball, and caused problems when different parsers identified ignorable white space in different places. Consequently I decided to report each char separately.
java.util.SortedMap[SortedMap] where the keys were a concatenation of the local name, qualfiied name, and namespace URI. The null (\u0000) was used as a concatenation character because this would never show up in any legal data.
endPrefixMapping()order is also indeterminate in SAX. Once again an arbitrary but reproducible sorting was chosen using a
SortedMap. However, this time the map had to be maintained across several method calls and only flushed when a
startElement()was seen or the next call after an
startElement(). Again, I sorted by lexical order of the names.
http://xml.org/sax/features/external-general-entitiesis set to true.
http://xml.org/sax/features/external-parameter-entitiesis set to true. Theoretically, a parser does not have to support these two features. In practice, all eight parsers tested did support them.  .
The canonical output a parser was supposed to report was generated via a bootstrap process. I began by runnings the parser experience had led me to expect was most often correct, Xerces2-J 2.6.1, [Xerces J] through the test harness. Then I compared its output to the output of seven other parsers. Where a difference was found, I manually inspected the reason for the difference to determine, based on the XML 1.0 and SAX specifications, which parser was correct. If Xerces proved incorrect, then the expected result was modified to use the correct result. More than once, both parsers were arguably correct. In these cases the comparison code needed to be adjusted to allow for the differences.
After repeating this process several times (and reducing the bugs in the test framework) it became apparent that while Xerces made many mistakes, it had one very nice property: almost all the mistakes were predictable and reproducible. Thus they could be fixed automatically. Specifically, the following changes needed to be made in Xerces' output:
endDocumentelements, especially after fatal errors.
XMLReaderwhen generating the expected results. Xerces has several bugs that are only exposed when the parser is reused. These have now been fixed in CVS as a result of this author's report. [Harold 2004]
In some cases these fixes duplicate each other. For instance,
not reusing the
XMLReader avoids most problems with extraneous characters.
However, that's not a problem since the fixes are all careful to check that the problem
exists in a particular case before fixing it.
These could all be fixed automatically. However, two cases remained which needed to be fixed by hand:
Once the bootstrapping process was complete, it becomes possible to compare the results for different parsers. Eight currently available XML parsers were tested:
IBM also produces XML4J. However, this is just a rebranded Xerces.
Three of these (Saxon, GNU JAXP, and DOM4J) are not independent. They are all descendants of David Megginson's Ælfred parser from Microstar [Megginson 1998] .
Currently only Xerces-J and the Oracle XML parser appear to be actively developed. The other six seem to have been abandoned. The lack of a competitive market for SAX parsers in Java came as something of a surprise and is a cause for concern. To a large extent, most users seem satisfied with Xerces; and there does not appear to be a large demand for alternate parsers. The C world has a much broader choice with at least four major parsers that implement the SAX API (libxml2 [Veillard] , expat [expat] , Oracle XML Parser for C++ [Oracle C] , and Xerces-C++ [Xerces C] .).
There were several particularly common errors that were exhibited many times by multiple parsers. These are definitely places anyone writing or contemplating writing a parser should watch carefully.
The most common error, though perhaps an arguable one, was failing to invoke
for a malformed document. The API documentation for the
endDocument() method states, "The SAX parser will invoke this method only once, and it
will be the last method invoked during the parse. The parser shall not invoke this method
until it has either abandoned parsing (because of an unrecoverable error) or reached
the end of input."
One could wish the language were slightly clearer
(I would prefer it to say "exactly once" rather than "only once") but it still
endDocument() should be called even in the event of a fatal error.
On the other hand, the API documentation for
"The application must assume that the document is unusable after the parser
has invoked this method, and should continue (if at all) only
for the sake of collecting additional error messages: in fact, SAX parsers
are free to stop reporting any other events once this method has been invoked."
This implies that it is acceptable not to call
Given the apparent inconsistency in the spec, what do the authors have to say?
David Brownell, the second maintainer of the SAX specification and the probable author of this
statement was explicit that
always be called. According to Brownell, "If it's not, that's a SAX conformance bug.
Sadly: last I looked, it wasn't an uncommon bug to omit
calling it in the 'abandoned parsing' case. That makes
it tough to use endDocument() to do things like clean up
That seems clear enough. However, David Megginson, the original maintainer of the SAX specification, disagrees:
My original intention at the start of SAX development was that endDocument would not necessarily be called once the parser was in an error state, but the documentation might not have been clear and David Brownell might have clarified things the other way after he took over. [Megginson 2004]
Furthermore, Megginson has recently announced plans to revise this
as part of the final release of SAX 2.0.1.
[Megginson 2004 2]
, and as I write these
words, it's being hashed out one more time on the sax-devel mailing list.
The situation is at best unclear. My opinion is that
endDocument must called,
and parsers certainly can and should do this.
However, reasonable people may disagree with support from both the spec and the maintainers.
The SAX documentation is clear that on encountering a well-formedness error,
parse() method must throw a
"If the application needs to pass through other types of exceptions,
it must wrap those exceptions in a SAXException or an exception derived from a SAXException."
It is also explicitly allowed to throw
IOExceptions for an I/O error.
However, it is clearly wrong for the
to throw a
RuntimeException. Nonetheless many parsers threw
and more when encountering
malformed documents. Piccolo was the worst offender here,
but even the relatively well-behaved Xerces had a few problems in this area.
Another common source of errors is reporting too much data after a fatal error. the XML spec says, "After encountering a fatal error, the processor MAY continue processing the data to search for further errors and MAY report such errors to the application. In order to support correction of errors, the processor MAY make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor MUST NOT continue normal processing (i.e., it MUST NOT continue to pass character data and information about the document's logical structure to the application in the normal way)." [Bray 2004] However, parsers often mismark the bounds of a fatal error for example, consider James Clark's test not-wf-sa-027.xml [Clark27] :
<doc> <!-- abc </doc>
Under some circumstances, Xerces 2.6.1 passes "abc </doc>" to the characters() method before reporting a fatal error. [Harold 2004] . This bug has been fixed in CVS.
Or consider James Clark's test not-wf/sa/045.xml [Clark45] :
<doc> <a/ </doc>
before it realizes the closing greater than sign is missing.
Another common, though minor, error was reporting the wrong type for attributes declared with enumerations. The SAX spec requires these to be reported with the type NMTOKEN, However, several parsers use the non-standard ENUMERATION type instead.
About 5% of the test cases cover XML 1.1. Except for Xerces none of the parsers explicitly support this. (Some of the parsers almost accidentally pass some of the 1.1 tests though.)
What follows are bugs uncovered in individual parsers. Most of these were exposed in multiple tests. Conformance ranged from a low of about 18% to a high just over 90%. Most products could radically improve their scores just by fixing one or two key problems that account for most of their failures. There's quite a bit of low hanging fruit here waiting to be picked off.
Xerces is the most popular and broadly used parser I tested.
It will become the default parser in Java 1.5.
Xerces 2.6.2 achieves a conformance score of only 91%, surprisingly low,
especially given that it served as the base for testing other parsers.
(The score's even worse, an abysmal 26%, if you require that it call
endDocument() following a well-formedness error.)
However, almost all of the Xerces problems related to a few easily worked around bugs.
In order of frequency, these were:
endDocumentafter a fatal error.
characters(xmltest/not-wf/sa/027.xml ) [Harold 2004]
Although Xerces posts very low scores, most of these are for problems that can be worked around if you know you're using Xerces. Several of the bugs in Xerces have now been fixed in CVS, and should no longer be problems as of version 2.7.0. It is probably the parser of choice for most applications.
Besides Xerces, the Oracle XML Parser for Java, is the only SAX parser written in Java
currently maintained. Thus it's disappointing that it doesn't do a better job.
It passed only 42% (19% when requiring
endDocument()) of the tests.
Furthermore, unlike Xerces, many of the errors
were XML conformance errors rather than less serious SAX conformance errors.
Notable problems included:
endDocument(), even in well-formed documents, with an empty root element tag (sun/invalid/dtd01.xml sun/invalid/el05.xml)
fatalError()until it's passed most of the content in, including 1.0 illegal characters (see all the IBM 1.1 test cases)
xmlnsattributes even when it isn't supposed to
xmlnsattributes, in contrast to SAX spec.
In fairness, I must note that shortly before the deadline for the submission of conference papers, Oracle released version 10 of their parser, which shows signs of being significantly improved. I hope to have updated results covering this new version of the Oracle parser at the conference. However, version 9 is clearly too nonconformant to both SAX and XML to rely on.
Sun has abandoned their home-grown Crimson parser
in favor of the IBM developed Xerces for the next 1.5 release of the Java Development Kit (JDK).
However, Crimson is the parser
bundled with the JDK through version 1.4.2_03 and is still used
by many Java programmers by default.
It passes 88% of the tests (but only 29% if you require endDocument on well-formedness errors).
Failures are mostly similar to Xerces'.
In particular, it also does not call
following a fatal error. However, it has a few unique problems as well:
Sun is moving to Xerces; and, given these problems, I see little reason why other programmers shouldn't make the switch as well.
When Yuval Oren first released Piccolo two years ago, I had very high hopes for it. It was a very small, very fast parser that filled an important niche of non-validating but entity resolving parser. It was notable for being built using a formal grammar and the parser generator tools JFlex and BYACC/J rather than a handrolled parser, as most implementers working in Java had done up to that point. However, the initial releases had numerous bugs, and no progress has been made on fixing these since July, 2002. My tests uncovered many more problems I had not previously noticed. These include:
<mixed1></mixed1>calls characters with no text to report, while it does not do so for empty-element tags such as
<mixed1/>. It's not 100% obvious that this is illegal, but it's certainly strange.
The overall conformance rate was only 57%. I cannot at this time recommend Piccolo for serious work, though it might make an interesting "fixer-upper" project if someone wished to begin plugging its holes.
Michael Kay's Ælfred derivative posted the highest overall conformance scores in the tests (over 90%),
until I turned on checking for entity resolution at which point the scores dropped to
Such an unbelievably low score makes one question the validity of the tests.
However, on investigation the test proved correct. Saxon's Ælfred calls
for the document entity as well as for all external entities. However, the SAX specification specifically
prohibits this, "The parser will call this method before opening any
external entity except the top-level document entity." (emphasis added)
Besides this, it had two other significant failure modes:
Kay has halted further work on this parser now that
an XML parser is bundled with the JDK.
If anyone is interested in picking this product up again, it would be straight-forward to
fix the bugs in character class detection and prevent it from calling
for the document entity. The problems with well-formedness of entity
replacement text may run deeper in the code base though.
dom4j's Ælfred derivative has the same bug in entity resolution that Saxon's Ælfred exhibited, and consequently scores identically at 0%. However, even when this bug is ignored, this parser performs noticeably worse than Saxon's Ælfred with only 60%. It shared all of Saxon's problems including failure to detect malformed entities used in a well-formed way and allowing unpaired surrogates. However, it also had several new problems:
fatalError()for a malformed document (xmltest/not-wf/sa/050.xml)
I can't see any particular reason to choose this parser over Saxon's Ælfred derivative.
At only 46% conformance, GNU JAXP scored significantly worse than the
other two Ælfred derivatives.
Its problems included most of those of the other two Ælfred derivatives,
with one very important exception: it does not call
for the document entity.
these tended to be masked by other bugs in GNU JAXP. In addition, it had these
ArrayIndexOutOfBoundsException, rather than
SAXException(not-wf-sa-017 through 019, 024-33)
startDocument()(not-wf-sa-099, not-wf-sa-152, o-p24fail2, o-p39fail5, ibm-not-wf-P02-ibm02n30.xml, ibm-not-wf-P24-ibm24n03.xml)
GNU JAXP has some features the other Ælfred derivatives don't, such as validation and DOM support. However, its low conformance level makes it a very poor choice for basic SAX work. Its rejection of many well-formed documents is particularly horrendous. I really can't recommend this to anyone.
James Clark's XP is the oldest parser tested.
The parser code itself dates to 1998, prior to the advent of SAX2.
However, Hussein Shafie made a few minor fixes and improvements to Clark's original
code, and wrote a SAX 2 interface for the parser.
It performs very well for such an old parser, scoring 87.25%
endDocument() events would reduce its score to 25%).
Among others, problems included:
There seems little reason to use this product in production today.
In many ways this experiment was a test of the SAX specification itself. A good specification leaves little room for interpretation and carefully spells out those areas where different implementations may behave differently. How well does SAX meet this criterion? In other words, is it really a testable spec?
With some reasonable assumptions, I think the answer is yes. There is much guaranteed behavior from any conformant SAX parser. However, there are a few tricky areas. Specifically,
startElement()by default, or can they be null? Here, pretty much everyone in the SAX community except the maintainer of the SAX specification agrees. In fact, all parsers I tested always passed the full name for the qName argument including Brownell's own Ælfred. However, after a significant discussion on the sax-devel mailing list in 2002 [sax-devel] Brownell unilaterally rewrote the SAX spec to reflect his view. (Previously it had been unclear.) Since I'm writing a test suite, I can unilaterally write it to reflect my view (and incidentally the behavior of all parsers).
endDocument()always called? Even if it's always called for a fatal error, what about these cases:
characters()throws an unexpected
Erroror other non-exception
fatalError? I suggest no, because in this case it's the client code that's throwing the exception. However, again both sides of the argument can point to different sentences in the spec to buttress their position.
startDocument()always called? Can a parser throw a fatal error before calling startDocument, and then call
endDocument()? GNU JAXP does this when encountering a malformed encoding declaration; for instance, in the sun not-wf encoding tests
IOExceptions be reported to
fatalError()? Especially when it's really a character encoding error rather than an I/O error like a broken socket? or only for lower byte-level I/O error such as broken stream?
I also have one feature request for SAX. It would be very helpful to define standard read-only SAX properties analogous to Java's java.version and java.vendor system properties that provide the vendor and version of the parser being used. For example, http://xml.org/sax/properties/vendor and http://xml.org/sax/properties/version.
Ultimately I hope the SAX community will come to consensus on these issues, and issue a revised version of SAX (2.0.2?) which nails down these inconsistencies.
The test suite is far from complete.
It pretty thoroughly tests tests three the four required
ErrorHandler is tested to the extent possible
given SAX's almost complete lack of requirements for what this interface actually does.
However, much work remains to be done:
DeclHandlerespecially, for those parsers that implement them.
If you'd like to run the test suite for yourself, you can download it from http://www.cafeconleche.org/SAXTest/. An Ant build file is included that will produce both the expected data and the individual parser results from the W3C XML Test Suite. Because the license agreement for the test suite is unclear, you'll also need to download that from the W3C at http://www.w3.org/XML/Test and install it in the same directory. The results cited here were produced using the December 10, 2003 drop of the XML test suite. Of course changes to both the test suite and the parsers are likely to change the exact numbers.
I think this indicates that the designers of XML made the wrong choice about where to cut the difference between validating and non-validating parsers. Since all parsers must support the internal DTD subset without exception, it's really not very hard to also add support for the external DTD subset. Many parser writers have expressed a desire to be able to ignore the internal and external DTD subsets. However, a parser that did this would not be conformant to XML 1.0, much less to SAX.
Xerces is not incorrect here, and indeed passes this test. However, it does not report the maximum possible amount of content before the well-formedness error which some other parsers do. To make the comparison work, this content needs to be added by hand.
To add insult to injury, on further analysis the single test Saxon passed proved to be a false positive due to a bug in the comparison code. It really failed all tests.
Local names must always be passed. Even according to David Brownell's rules, only qualified names are sometimes optional.
XHTML rendition created by gcapaper Web Publisher v2.1, © 2001-3 Schema Software Inc.