SAX Conformance Testing

Keywords: SAX, XML, test, conformance

Elliotte Rusty Harold
Polytechnic University
Dept. of Computer Science
Brooklyn
NY
USA
elharo@metalab.unc.edu
http://www.cafeconleche.org/

Biography

Elliotte Rusty Harold is an adjunct professor of computer science at Polytechnic University in Brooklyn. He's the author of numerous books on XML including the XML 1.1 Bible, XML in a Nutshell, Effective XML, and Processing XML with Java. Most recently he has been working on XOM, the only tree-based API for XML that absolutely guarantees well-formedness.

Abstract

While SAX [SAX] , the simple API for XML, is a broadly, almost universally implemented standard among Java parsers, many SAX parsers have serious bugs. The lack of a complete SAX conformance test suite has been a severe hindrance to interoperability. For example, about half of SAX parsers call endDocument() even after reporting a fatal error, while the other half don’t. Existing XML test suites mostly focus on whether the parser correctly answers boolean questions of well-formedness or validity, while ignoring the much more complex questions of whether the parser correctly reports document content in the correct order. Indeed the XML specification is mostly silent on exactly which parts of the document a parser is required to report. Not surprisingly this has led to a number of inconsistencies between parsers as well as outright bugs in more than a few implementations.

This paper demonstrates a conformance suite written in Java that tests parsers which claim to implement the SAX API. The framework asks the parser to read a collection of input documents and then logs the methods the parser invokes and their arguments. This log takes the form of an XML document that can be compared against the expected results.

The documents in the test set are derived from the W3C XML conformance test suite. The software includes a framework for testing parsers against this document collection and measuring their conformance to both the core and optional parts of both XML and SAX. Conformance results for major parsers including Xerces, Crimson, and Piccolo are reported. A number of areas in which deficiencies in the SAX specification have led to varying parser behavior are identified.

Existing test suites
Comparing Output
Bootstrapping
Results for different parsers
     Common Errors
         Is endDocument invoked after a fatal error?
         What kinds of exceptions can parsers throw?
         How much data is passed after a fatal error?
         What is the type of enumerated attributes?
         XML 1.1 support
     Problems with Specific Parsers
         Xerces-J 2.6.2
         Oracle 9.2.0.6.0
         Crimson
         Piccolo
         Saxon's Ælfred
         dom4j's Ælfred
         GNU JAXP
         XP
SAX Issues
Future Directions for more research
Test Suite Availability
Footnotes
Bibliography

Existing test suites

There are several existing test suites for XML and SAX. However, they are of limited coverage, and failed to expose many bugs I noticed during the development of XOM. [XOM]

There is an embryonic, semi-official test suite for SAX [Arnold 2001] . However, this includes only a few dozen JUnit based tests for the most basic features of SAX2. There are also a couple of thousand SAX 1 tests based on a draft version of the OASIS/NIST XML test suite. However, these perform limited output testing, and leave many holes in coverage. The latest version is 0.2 from November 12, 2001. Work appears to have been abandoned.

The most comprehensive XML test suite is the W3C's XML test suite [W3C XML Group] , which bundles tests gathered from a variety of sources including James Clark, the OASIS/NIST XML test suite, Sun, IBM, Henry S. Thompson, and others. This offers the broadest coverage of a range of XML documents. However, it focuses on testing binary decisions. Is the document well-formed or not? Is the document valid or not? It provides a limited number of output tests, based on the Second XML Canonical Form. [Sun] .

To properly test a SAX parser, it is necessary to verify that it reports the right events with the right content in the right order. This exceeds the scope of the XML Test Suite, which is API independent. However, because the W3C test suite is so broad, it became the primary source of input data for this new test suite. In order to pass the tests, a parser must be able to correctly process all the documents in the W3C XML test suite. The difference is not in the documents themselves. It is in the scope of the output. Passing the W3C test suite primarily requires correctly identifying well-formed and malformed, valid and invalid documents. My test suite requires not only this, but also the reporting of the right content in the right order at the right time using the right methods.

The W3C test suite is divided into numerous test cases stored in several directories, mostly organized by the submitter. The test cases are further subdivided into well-formed and malformed, valid and invalid, namespace aware and non-namespace aware, and external entity using and self-contained test cases. The master file lists a typical case like this:

<TEST TYPE="invalid" URI="invalid/attr06.xml" ID="attr06" SECTIONS="3.3.1">
    Tests the "Name Token" VC for the NMTOKENS attribute type.</TEST>

This says that the test case document can be found at the relative URL "invalid/attr06.xml", that the document at that URL is invalid (but well-formed); that it tests section 3.3.1 of the XML specification, and more specifically it tests the name token validity constraint for the NMTOKENS attribute type. Here I'm not so interested in testing whether the document is valid or invalid as I am in testing that all the content from that document is properly reported through SAX.

Comparing Output

In order to compare the the output of different parser, it's necessary to place the output in a standard format that can be easily diffed. It seemed natural to use XML for this purpose. A single class that implemented ContentHandler, ErrorHandler, EntityResolver, and DTDHandler--the four required SAX interfaces--was written that logged all its calls to an XML document. (More specifically, it created a XOM Document object which was later serialized). However, the problem is thornier than it may appear at first. It is necessary to produce well-formed output even for malformed input. We must not assume that the SAX parser will detect such bugs because that would require assuming that the parser is non-buggy, precisely what we're endeavoring to determine. For instance, we cannot assume element names will not contain white space or PCDATA will not contain nulls. For example, suppose we begin with this test document (Test case ibm-valid-P10-ibm10v02.xml) [ibm10v02]

<?xml version="1.0"?>
<!DOCTYPE student [
	<!ELEMENT student (#PCDATA)>
	<!ATTLIST student
		first CDATA #REQUIRED
		middle CDATA #IMPLIED
		last CDATA #REQUIRED > 
	<!ENTITY myfirst "Snow">
	<!ENTITY mymiddle "Y">
	<!ENTITY mylast ''>
]>
<!-- testing AttValue with empty char inside single quote -->
<student first='' last=''>My Name is Snow &mylast; Man. </student>

When parsed it produces this output:

<?xml version="1.0" encoding="UTF-8"?>
<ConformanceResults>
    <startDocument/>
    <startElement>
        <namespaceURI/>
        <qualifiedName>student</qualifiedName>
        <attributes>
            <attribute>
                <namespaceURI/>
                <localName>first</localName>
                <qualifiedName>first</qualifiedName>
                <value/>
                <type>CDATA</type>
            </attribute>
            <attribute>
                <namespaceURI/>
                <localName>last</localName>
                <qualifiedName>last</qualifiedName>
                <value/>
                <type>CDATA</type>
            </attribute>
        </attributes>
    </startElement>
    <char>M</char>
    <char>y</char>
    <char>\s</char>
    <char>N</char>
    <char>a</char>
    <char>m</char>
    <char>e</char>
    <char>\s</char>
    <char>i</char>
    <char>s</char>
    <char>\s</char>
    <char>S</char>
    <char>n</char>
    <char>o</char>
    <char>w</char>
    <char>\s</char>
    <char>\s</char>
    <char>M</char>
    <char>a</char>
    <char>n</char>
    <char>.</char>
    <char>\s</char>
    <endElement>
        <namespaceURI/>
        <qualifiedName>student</qualifiedName>
    </endElement>
    <endDocument/>
</ConformanceResults>

The general format could have the following DTD:

    <!ELEMENT locator EMPTY>
    <!ELEMENT startDocument EMPTY>
    <!ELEMENT endDocument EMPTY>
    <!ELEMENT fatalError EMPTY>
    <!ELEMENT char (#PCDATA)>
    <!ELEMENT ignorable (#PCDATA)>
    <!ELEMENT localName (#PCDATA)>
    <!ELEMENT name (#PCDATA)>
    <!ELEMENT systemID (#PCDATA)>
    <!ELEMENT qualifiedName (#PCDATA)>
    <!ELEMENT namespaceURI (#PCDATA)>
    <!ELEMENT value (#PCDATA)>
    <!ELEMENT type (#PCDATA)>
    
    <!ELEMENT ConformanceResults 
      (startDocument | startElement | endElement | char | ignorable 
      | notation | unparsedEntity | resolveEntity | endDocument | fatalError
      | processingInstruction | locator
      )*>
      
    <!ELEMENT attributes (attribute)*>
    <!ELEMENT attribute  (namespaceURI?, localName?, qualifiedName?, value?, type?)>
    
    <!ELEMENT startElement  (namespaceURI?, localName?, qualifiedName?, attributes?)>
    <!ELEMENT endElement  (namespaceURI?, localName?, qualifiedName?)>
    
    <!ELEMENT notation (name?, systemID?)>
    <!ELEMENT unparsedEntity (name?, publicID?, systemID?, notation?)>
    
    <!ELEMENT bug (#PCDATA)>
    <!ATTLIST bug reason (CDATA) #IMPLIED>

This output format is designed to avoid some common problems:

Attributes are not used because attribute value normalization makes value comparison problematic. With attributes we could not test the proper reporting of line breaks.
Indenting is used to make the code easier to read. The indenting is reproducible. Identical input and processing will produce byte-per-byte identical output. The output does not need to be fairly illegible canonical XML (contrast with the OASIS XSLT test suite [Van Vleet 2001] ) so long as it's reproducible.
Arguments passed as null do not appear in the output. Arguments passed as empty strings become empty elements.
For ease of visual comparison all white space characters are escaped using backslash escapes as shown in Table 1 . In addition since Java chars do not correspond to legal XML characters (especially in XML 1.0, but also in XML 1.1 when halves of surrogate pairs are received), XML illegal Java chars were escaped as \u + the hexadecimal code for the character. This was also used for unprintable characters like the C1 controls, purely for ease of manual comparison.
This also neatly avoids any potential issues with linefeed normalization when the document is parsed for comparison. Some parsers have known bugs with white space handling, and we don't want to sweep the problems under the rug.
UTF-8 is used for the output.
The output is pure XML 1.0.
Namespaces are not used, so prefix scoping does not become an issue.

Carriage return	\r
Linefeed	\n
Space	\s
Tab	\t
Backslash	\\

Table 1

The actual generation also introduces some issues:

Different parsers may make different numbers of calls to characters. Initially, I combined these into a single element in the output using the well-known algorithm [Harold 2002] . However, that proved both difficult to compare by eyeball, and caused problems when different parsers identified ignorable white space in different places. Consequently I decided to report each char separately.
Attribute order needs to be normalized before the content is written. Any reproducible sort order will do. I chose lexical ordering by qualified name. The simplest way to do this was to use a java.util.SortedMap [SortedMap] where the keys were a concatenation of the local name, qualfiied name, and namespace URI. The null (\u0000) was used as a concatenation character because this would never show up in any legal data.
startPrefixMapping() and endPrefixMapping() order is also indeterminate in SAX. Once again an arbitrary but reproducible sorting was chosen using a SortedMap. However, this time the map had to be maintained across several method calls and only flushed when a startElement() was seen or the next call after an endElement().
Notation and unparsedEntity order is also indeterminate, though they must all appear before the first call to startElement(). Again, I sorted by lexical order of the names.
Optional features such as Locator, LexicalHandler and DeclHandler are ignored.
Non-fatal errors and warning, which the parser is not required to report, are ignored.
Default values are used for all features that have default values. Features with undefined default values are set as follows: http://xml.org/sax/features/external-general-entities is set to true. http://xml.org/sax/features/external-parameter-entities is set to true. Theoretically, a parser does not have to support these two features. In practice, all eight parsers tested did support them. [1] .
Different parsers absolutize system IDs (used in NOTATION and unparsed ENTITY declarations) differently. For file URLs some use file:/// and some use only file:/. And of course the complete URL depends on the local file system. Thus the comparison needs to test only the relative parts of these absolute URLs. On the other hand it does have to notice if the URL has not been properly absolutized.

Bootstrapping

The canonical output a parser was supposed to report was generated via a bootstrap process. I began by runnings the parser experience had led me to expect was most often correct, Xerces2-J 2.6.1, [Xerces J] through the test harness. Then I compared its output to the output of seven other parsers. Where a difference was found, I manually inspected the reason for the difference to determine, based on the XML 1.0 and SAX specifications, which parser was correct. If Xerces proved incorrect, then the expected result was modified to use the correct result. More than once, both parsers were arguably correct. In these cases the comparison code needed to be adjusted to allow for the differences.

After repeating this process several times (and reducing the bugs in the test framework) it became apparent that while Xerces made many mistakes, it had one very nice property: almost all the mistakes were predictable and reproducible. Thus they could be fixed automatically. Specifically, the following changes needed to be made in Xerces' output:

Add missing startDocument elements
Add missing endDocument elements, especially after fatal errors.
Remove extraneous post-root characters that result from earlier failures to flush the buffer
Do not reuse the XMLReader when generating the expected results. Xerces has several bugs that are only exposed when the parser is reused. These have now been fixed in CVS as a result of this author's report. [Harold 2004]
Replace CharacterConversionExceptions and UTFDataFormatExceptions with fatalErrors

In some cases these fixes duplicate each other. For instance, not reusing the XMLReader avoids most problems with extraneous characters. However, that's not a problem since the fixes are all careful to check that the problem exists in a particular case before fixing it.

These could all be fixed automatically. However, two cases remained which needed to be fixed by hand:

xmlconf/oasis/p02fail30.xml (needs to allow start-tag before fatal error). [2] .
ibm/not-wf/P01/ibm01n01.xml (flips order of fatalError and endDocument)

Results for different parsers

Once the bootstrapping process was complete, it becomes possible to compare the results for different parsers. Eight currently available XML parsers were tested:

Xerces-J 2.6.1 [Xerces J]
Crimson as shipped in Sun's JDK 1.4.2_03 [Crimson]
Oracle 9.2.0.6.0 [Oracle]
GNU JAXP [GNU JAXP]
DOM4J [DOM4J]
Saxon [Kay 2002]
Piccolo [Piccolo]
xp [XP]

IBM also produces XML4J. However, this is just a rebranded Xerces.

Three of these (Saxon, GNU JAXP, and DOM4J) are not independent. They are all descendants of David Megginson's Ælfred parser from Microstar [Megginson 1998] .

Currently only Xerces-J and the Oracle XML parser appear to be actively developed. The other six seem to have been abandoned. The lack of a competitive market for SAX parsers in Java came as something of a surprise and is a cause for concern. To a large extent, most users seem satisfied with Xerces; and there does not appear to be a large demand for alternate parsers. The C world has a much broader choice with at least four major parsers that implement the SAX API (libxml2 [Veillard] , expat [expat] , Oracle XML Parser for C++ [Oracle C] , and Xerces-C++ [Xerces C] .).

Common Errors

There were several particularly common errors that were exhibited many times by multiple parsers. These are definitely places anyone writing or contemplating writing a parser should watch carefully.

Is endDocument invoked after a fatal error?

The most common error, though perhaps an arguable one, was failing to invoke endDocument() for a malformed document. The API documentation for the endDocument() method states, "The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input." [ContentHandler] One could wish the language were slightly clearer (I would prefer it to say "exactly once" rather than "only once") but it still implies that endDocument() should be called even in the event of a fatal error. On the other hand, the API documentation for ErrorHandler.fatalError() states, "The application must assume that the document is unusable after the parser has invoked this method, and should continue (if at all) only for the sake of collecting additional error messages: in fact, SAX parsers are free to stop reporting any other events once this method has been invoked." [ErrorHandler] This implies that it is acceptable not to call fatalError().

Given the apparent inconsistency in the spec, what do the authors have to say? David Brownell, the second maintainer of the SAX specification and the probable author of this statement was explicit that endDocument must always be called. According to Brownell, "If it's not, that's a SAX conformance bug. Sadly: last I looked, it wasn't an uncommon bug to omit calling it in the 'abandoned parsing' case. That makes it tough to use endDocument() to do things like clean up application state." [Brownell 2002]

That seems clear enough. However, David Megginson, the original maintainer of the SAX specification, disagrees:

My original intention at the start of SAX development was that endDocument would not necessarily be called once the parser was in an error state, but the documentation might not have been clear and David Brownell might have clarified things the other way after he took over. [Megginson 2004]

Furthermore, Megginson has recently announced plans to revise this as part of the final release of SAX 2.0.1. [Megginson 2004 2] , and as I write these words, it's being hashed out one more time on the sax-devel mailing list. The situation is at best unclear. My opinion is that endDocument must called, and parsers certainly can and should do this. However, reasonable people may disagree with support from both the spec and the maintainers.

What kinds of exceptions can parsers throw?

The SAX documentation is clear that on encountering a well-formedness error, the parse() method must throw a SAXException. "If the application needs to pass through other types of exceptions, it must wrap those exceptions in a SAXException or an exception derived from a SAXException." [SAXException] It is also explicitly allowed to throw IOExceptions for an I/O error. [ErrorHandler] However, it is clearly wrong for the parse() method to throw a RuntimeException. Nonetheless many parsers threw NullPointerExceptions, ArrayIndexOutOfBoundsExceptions, NegativeArraySizeExceptions, and more when encountering malformed documents. Piccolo was the worst offender here, but even the relatively well-behaved Xerces had a few problems in this area.

How much data is passed after a fatal error?

Another common source of errors is reporting too much data after a fatal error. the XML spec says, "After encountering a fatal error, the processor MAY continue processing the data to search for further errors and MAY report such errors to the application. In order to support correction of errors, the processor MAY make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor MUST NOT continue normal processing (i.e., it MUST NOT continue to pass character data and information about the document's logical structure to the application in the normal way)." [Bray 2004] However, parsers often mismark the bounds of a fatal error for example, consider James Clark's test not-wf-sa-027.xml [Clark27] :

<doc>
<!-- abc
</doc>

Under some circumstances, Xerces 2.6.1 passes "abc </doc>" to the characters() method before reporting a fatal error. [Harold 2004] . This bug has been fixed in CVS.

Or consider James Clark's test not-wf/sa/045.xml [Clark45] :

<doc>
<a/
</doc>

Crimson calls startElement() for a before it realizes the closing greater than sign is missing.

What is the type of enumerated attributes?

Another common, though minor, error was reporting the wrong type for attributes declared with enumerations. The SAX spec requires these to be reported with the type NMTOKEN, However, several parsers use the non-standard ENUMERATION type instead.

XML 1.1 support

About 5% of the test cases cover XML 1.1. Except for Xerces none of the parsers explicitly support this. (Some of the parsers almost accidentally pass some of the 1.1 tests though.)

Problems with Specific Parsers

What follows are bugs uncovered in individual parsers. Most of these were exposed in multiple tests. Conformance ranged from a low of about 18% to a high just over 90%. Most products could radically improve their scores just by fixing one or two key problems that account for most of their failures. There's quite a bit of low hanging fruit here waiting to be picked off.

Xerces-J 2.6.2

Xerces is the most popular and broadly used parser I tested. It will become the default parser in Java 1.5. Xerces 2.6.2 achieves a conformance score of only 91%, surprisingly low, especially given that it served as the base for testing other parsers. (The score's even worse, an abysmal 26%, if you require that it call endDocument() following a well-formedness error.) However, almost all of the Xerces problems related to a few easily worked around bugs. In order of frequency, these were:

Never calls endDocument after a fatal error.
Occasionally reports the text following fatal error by passing it to characters (xmltest/not-wf/sa/027.xml ) [Harold 2004]
Occasionally passes data from comments into characters(ibm-valid-P09-ibm09v05.xml).
Treats encoding errors (mismatched byte order mark, bad UTF-8 data) as I/O errors rather than well-formedness errors.

Although Xerces posts very low scores, most of these are for problems that can be worked around if you know you're using Xerces. Several of the bugs in Xerces have now been fixed in CVS, and should no longer be problems as of version 2.7.0. It is probably the parser of choice for most applications.

Oracle 9.2.0.6.0

Besides Xerces, the Oracle XML Parser for Java, is the only SAX parser written in Java currently maintained. Thus it's disappointing that it doesn't do a better job. It passed only 42% (19% when requiring endDocument()) of the tests. Furthermore, unlike Xerces, many of the errors were XML conformance errors rather than less serious SAX conformance errors. Notable problems included:

Does not normalize white space in NMTOKENS attributes when not validating (/xmltest/valid/sa/037.xml, 096.xml /sa/111.xml)
Does not always allow non-ASCII characters in names (/xmltest/valid/sa/063.xml)
Does not handle character references for characters outside Unicode's Basic Multilingual Plane properly (xmltest/valid/sa/064.xml /xmltest/valid/sa/089.xml)
Reads  as \n rather than \r (xmltest/valid/sa/067.xml, /xmltest/valid/sa/107.xml)
Does not report NOTATION declarations (xmltest/valid/sa/069.xml.html xmltest/valid/sa/091.xml.html sun/valid/notation01.xml)
Does not use DTD to determine ignorable white space (xmltest/valid/sa/093.xml sun/valid/element.xml sun/invalid/el01.xml) This is not just reporting ignorable white space as characters, which is legal. It reports non-ignorable whitespace as ignorable.
Doesn't read attribute types from the external DTD subset (sun/valid/pe01.xml)
Does not always call endDocument(), even in well-formed documents, with an empty root element tag (sun/invalid/dtd01.xml sun/invalid/el05.xml)
Does not apply default attributes from external DTD subset in an invalid document (ibm/invalid/P32/ibm32i01.xml=)
Reports too much content from malformed document (ibm/not-wf/P10/ibm10n01.xml)
Has trouble with unusual Unicode chars in processing instruction names (ibm-valid-P85-ibm85v01.xml through ibm-valid-P85-ibm87v01.xml ibm89v01.xml)
No 1.1 support, but does not call fatalError() until it's passed most of the content in, including 1.0 illegal characters (see all the IBM 1.1 test cases)
Reports xmlns attributes even when it isn't supposed to
Reports a namespace URI for xmlns attributes, in contrast to SAX spec.

In fairness, I must note that shortly before the deadline for the submission of conference papers, Oracle released version 10 of their parser, which shows signs of being significantly improved. I hope to have updated results covering this new version of the Oracle parser at the conference. However, version 9 is clearly too nonconformant to both SAX and XML to rely on.

Crimson

Sun has abandoned their home-grown Crimson parser in favor of the IBM developed Xerces for the next 1.5 release of the Java Development Kit (JDK). However, Crimson is the parser bundled with the JDK through version 1.4.2_03 and is still used by many Java programmers by default. It passes 88% of the tests (but only 29% if you require endDocument on well-formedness errors). Failures are mostly similar to Xerces'. In particular, it also does not call endDocument following a fatal error. However, it has a few unique problems as well:

Does not resolve external entities, even when http://xml.org/sax/features/external-general-entities and http://xml.org/sax/features/external-parameter-entities are turned on, unless the parser is validating
Not XML 1.1 aware
Uses the non-standard "ENUMERATION" attribute type
Reports white space inside an element declared EMPTY as ignorable, rather than characters
Treats namespace errors as errors rather than fatal errors

Sun is moving to Xerces; and, given these problems, I see little reason why other programmers shouldn't make the switch as well.

Piccolo

When Yuval Oren first released Piccolo two years ago, I had very high hopes for it. It was a very small, very fast parser that filled an important niche of non-validating but entity resolving parser. It was notable for being built using a formal grammar and the parser generator tools JFlex and BYACC/J rather than a handrolled parser, as most implementers working in Java had done up to that point. However, the initial releases had numerous bugs, and no progress has been made on fixing these since July, 2002. My tests uncovered many more problems I had not previously noticed. These include:

For two-tag empty elements like <mixed1></mixed1> calls characters with no text to report, while it does not do so for empty-element tags such as <mixed1/>. It's not 100% obvious that this is illegal, but it's certainly strange.
Uses ENUMERATION type instead of NMTOKEN
Reports namespace prefixes as attributes by default (/sun/invalid/attr08.xml)
Sometimes doesn't call fatalError when encountering a Malformed Document (e.g. not-wf/attlist11.xml)
Sometimes reports attributes that aren't there (e.g. sun/not-wf/element00.xml, xmltest/not-wf/sa/017.xml)
Does not detect namespace well-formedness errors by default; (oasis/p04pass1.xml)
Sometimes fails to report complete attribute local name (la instead of lang, ibm/valid/P33/ibm33v01.xml)
Changes tabs into spaces (eduni/errata-2e/E20.xml)
Doesn't check XML version declaration (eduni/xml-1.1/008.xml)
Doesn't notice when two attributes with different prefixes and same local names have the same namespace URI (eduni/namespaces/1.0/009.xml)
Allows multiple colons in names (eduni/namespaces/1.0/013.xml)
Allows prefix unbinding in 1.0 (eduni/namespaces/1.0/023.xml)
Overall namespace handling is very flaky, see all the rmt cases
Doesn't always call startDocument (xmltest/not-wf/sa/030.xml)
Sometimes reports content from after first well-formedness error (xmltest/not-wf/sa/036.xml through 41.xml, 43.xml, 44.xml)

The overall conformance rate was only 57%. I cannot at this time recommend Piccolo for serious work, though it might make an interesting "fixer-upper" project if someone wished to begin plugging its holes.

Saxon's Ælfred

Michael Kay's Ælfred derivative posted the highest overall conformance scores in the tests (over 90%), until I turned on checking for entity resolution at which point the scores dropped to 0.05%! [3] Such an unbelievably low score makes one question the validity of the tests. However, on investigation the test proved correct. Saxon's Ælfred calls resolveEntity() for the document entity as well as for all external entities. However, the SAX specification specifically prohibits this, "The parser will call this method before opening any external entity except the top-level document entity." (emphasis added) [EntityResolver] Besides this, it had two other significant failure modes:

Does not require entity replacement text to be well-balanced, as long as the final document is well-formed.
Does not detect or complain of unpaired surrogates or a lot of other illegal chars (A quick check of the source code shows it is using Java's rules for name characters rather than the similar but not identical XML rules.)
Allows colon as attribute name
Does not absolutize system IDs of NOTATIONs (ibm/not-wf/P41/ibm41n12.xml.html)
Flunks a lot of the namespace tests such as no colons in PI names

Kay has halted further work on this parser now that an XML parser is bundled with the JDK. [Kay 2002] If anyone is interested in picking this product up again, it would be straight-forward to fix the bugs in character class detection and prevent it from calling resolveEntity() for the document entity. The problems with well-formedness of entity replacement text may run deeper in the code base though.

dom4j's Ælfred

dom4j's Ælfred derivative has the same bug in entity resolution that Saxon's Ælfred exhibited, and consequently scores identically at 0%. However, even when this bug is ignored, this parser performs noticeably worse than Saxon's Ælfred with only 60%. It shared all of Saxon's problems including failure to detect malformed entities used in a well-formed way and allowing unpaired surrogates. However, it also had several new problems:

Passes an empty string for an element's local name (/not-wf/sa/036.xml and 037.xml, 40 through 44, not-wf-sa-151, valid-sa-002) in endElement(). [4]
Does not always call fatalError() for a malformed document (xmltest/not-wf/sa/050.xml)
Does not absolutize unparsed entity URLs (not-wf-sa-083, ibm-invalid-P76-ibm76i01.xml, ibm-not-wf-P11-ibm11n01.xml)
Allows tabs in notation public IDs (o-p12fail7)

I can't see any particular reason to choose this parser over Saxon's Ælfred derivative.

GNU JAXP

At only 46% conformance, GNU JAXP scored significantly worse than the other two Ælfred derivatives. Its problems included most of those of the other two Ælfred derivatives, with one very important exception: it does not call resolveEntity() for the document entity. these tended to be masked by other bugs in GNU JAXP. In addition, it had these unique problems:

Throws various runtime exceptions such as ArrayIndexOutOfBoundsException, rather than SAXException (not-wf-sa-017 through 019, 024-33)
Does not always call startDocument() (not-wf-sa-099, not-wf-sa-152, o-p24fail2, o-p39fail5, ibm-not-wf-P02-ibm02n30.xml, ibm-not-wf-P24-ibm24n03.xml)
Rejects many well-formed and even valid documents. (invalid--002 through empty, and uri01 through o-e2). That it reports a bug in hundreds of consecutive well-formed test cases suggests this may be a problem with parser reuse. However, even if it is, parsers are supposed to be reusable. This is a major failure.

GNU JAXP has some features the other Ælfred derivatives don't, such as validation and DOM support. However, its low conformance level makes it a very poor choice for basic SAX work. Its rejection of many well-formed documents is particularly horrendous. I really can't recommend this to anyone.

XP

James Clark's XP is the oldest parser tested. The parser code itself dates to 1998, prior to the advent of SAX2. However, Hussein Shafie made a few minor fixes and improvements to Clark's original code, and wrote a SAX 2 interface for the parser. It performs very well for such an old parser, scoring 87.25% (though requiring endDocument() events would reduce its score to 25%). Among others, problems included:

Allows attribute name consisting of a single colon (valid-sa-012)
Reports attributes of type NMTOKENS as having type CDATA (valid-sa-058, valid-sa-096, valid-sa-111)
Reports attributes of type ENTITY as having type CDATA (valid-sa-091)
Reports attributes of type ID as having type CDATA (ibm-invalid-P56-ibm56i03.xml)
Rejects some well-formed documents (ibm-invalid-P51-ibm51i01.xml)

There seems little reason to use this product in production today.

SAX Issues

In many ways this experiment was a test of the SAX specification itself. A good specification leaves little room for interpretation and carefully spells out those areas where different implementations may behave differently. How well does SAX meet this criterion? In other words, is it really a testable spec?

With some reasonable assumptions, I think the answer is yes. There is much guaranteed behavior from any conformant SAX parser. However, there are a few tricky areas. Specifically,

Must a parser report the maximum amount of content before reporting a fatal error?
I think here the answer is no. There is no requirement in the either the XML or SAX specification that any content from a malformed document be available. Indeed, it is only the streaming nature of SAX that enables such content to be presented at all. Other tree-based APIs like DOM [Le Hors 2000] , and JDOM [Hunter] do not endeavor to provide any such content.
However, there is an explicit requirement in the XML specification that content which follows the first well-formedness error not be made available through normal channels, "After encountering a fatal error, the processor MAY continue processing the data to search for further errors and MAY report such errors to the application. In order to support correction of errors, the processor MAY make unprocessed data from the document (with intermingled character data and markup) available to the application. Once a fatal error is detected, however, the processor MUST NOT continue normal processing (i.e., it MUST NOT continue to pass character data and information about the document's logical structure to the application in the normal way)." [Bray 2004] Given this, it is important to verify that such content is not provided.
Are qNames passed to startElement() by default, or can they be null? Here, pretty much everyone in the SAX community except the maintainer of the SAX specification agrees. In fact, all parsers I tested always passed the full name for the qName argument including Brownell's own Ælfred. However, after a significant discussion on the sax-devel mailing list in 2002 [sax-devel] Brownell unilaterally rewrote the SAX spec to reflect his view. (Previously it had been unclear.) Since I'm writing a test suite, I can unilaterally write it to reflect my view (and incidentally the behavior of all parsers).
Is endDocument() always called? Even if it's always called for a fatal error, what about these cases:
1. What if an intermediate method such as startElement or characters() throws an unexpected SAXException? [Wilson]
2. What if an intermediate method throws a RuntimeException?
3. What if an intermediate method throws an Error or other non-exception Throwable?
Should any or all of these result in a call to fatalError? I suggest no, because in this case it's the client code that's throwing the exception. However, again both sides of the argument can point to different sentences in the spec to buttress their position.
Is startDocument() always called? Can a parser throw a fatal error before calling startDocument, and then call endDocument()? GNU JAXP does this when encountering a malformed encoding declaration; for instance, in the sun not-wf encoding tests
Is it possible for content to contain both ignorable and non-ignorable whitespace? To contain ignorable white space and now white space text? All parsers that do distinguish ignorable white space seem to agree that the answer is yes.
How are notation system IDs handled when they contain a non-URI as in oasis/p11pass1.xml?
Should IOExceptions be reported to fatalError()? Especially when it's really a character encoding error rather than an I/O error like a broken socket? or only for lower byte-level I/O error such as broken stream?

I also have one feature request for SAX. It would be very helpful to define standard read-only SAX properties analogous to Java's java.version and java.vendor system properties that provide the vendor and version of the parser being used. For example, http://xml.org/sax/properties/vendor and http://xml.org/sax/properties/version.

Ultimately I hope the SAX community will come to consensus on these issues, and issue a revised version of SAX (2.0.2?) which nails down these inconsistencies.

Future Directions for more research

The test suite is far from complete. It pretty thoroughly tests tests three the four required SAX interfaces, ContentHandler, EntityResolver, and DTDHandler. ErrorHandler is tested to the extent possible given SAX's almost complete lack of requirements for what this interface actually does. However, much work remains to be done:

It should be possible to identify the types of errors, and quantify their severity. Not all errors are created equal. For instance, failing to detect a malformed document is much worse than failing to absolutize a notation's system identifier.
SAX is only formally defined for Java. However, it has been unofficially ported to many other languages including C++ [Xerces C] , Python [xml.sax] , and Perl [Perl XML] . It would be possible to port the test framework to these environments as well.
The output and comparison code is actually quite decoupled from the SAX test framework. It might be possible to use the same expected output to test other APIs such as StAX. [Fry 2003]
Currently SAX parsers are tested only with the default combination of features. Tests should also be performed with different combinations of features, at least those which SAX parsers are required to support. In particular, it would be useful to test with different settings for the features that control namespaces, validation, and loading of the external DTD subset. In other work, I have noted that loading the external DTD subset when not validating is a common source of non-conformance for many parsers.
It would be helpful to test the conformance of the optional parts of SAX, LexicalHandler and DeclHandler especially, for those parsers that implement them.

Test Suite Availability

If you'd like to run the test suite for yourself, you can download it from http://www.cafeconleche.org/SAXTest/. An Ant build file is included that will produce both the expected data and the individual parser results from the W3C XML Test Suite. Because the license agreement for the test suite is unclear, you'll also need to download that from the W3C at http://www.w3.org/XML/Test and install it in the same directory. The results cited here were produced using the December 10, 2003 drop of the XML test suite. Of course changes to both the test suite and the parsers are likely to change the exact numbers.

Footnotes

I think this indicates that the designers of XML made the wrong choice about where to cut the difference between validating and non-validating parsers. Since all parsers must support the internal DTD subset without exception, it's really not very hard to also add support for the external DTD subset. Many parser writers have expressed a desire to be able to ignore the internal and external DTD subsets. However, a parser that did this would not be conformant to XML 1.0, much less to SAX.

Xerces is not incorrect here, and indeed passes this test. However, it does not report the maximum possible amount of content before the well-formedness error which some other parsers do. To make the comparison work, this content needs to be added by hand.

To add insult to injury, on further analysis the single test Saxon passed proved to be a false positive due to a bug in the comparison code. It really failed all tests.

Local names must always be passed. Even according to David Brownell's rules, only qualified names are sometimes optional.

Bibliography

[Arnold 2001]: Arnold, Curt and David Brownell. November 12, 2001. http://xmlconf.sourceforge.net/?selected=sax.
[Bray 2004]: Bray, Tim and Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and François Yergeau, eds., Extensible Markup Language (XML) 1.0 (Third Edition), February 4, 2004, http://www.w3.org/TR/2004/REC-xml-20040204/
[SAX]: Brownell, David and David Megginson. http://sax.sourceforge.net/.
[Brownell 2002]: Brownell, David. May 2, 2002. "[Sax-devel] endDocument throwing an exception". http://www.geocrawler.com/archives/3/13179/2002/5/50/8558085/
[Clark27]: Clark, James. http://dev.w3.org/cvsweb/~checkout~/2001/XML-Test-Suite/xmlconf/xmltest/not-wf/sa/027.xml?content-type=text/plain
[Clark45]: Clark, James. http://dev.w3.org/cvsweb/~checkout~/2001/XML-Test-Suite/xmlconf/xmltest/not-wf/sa/045.xml?content-type=text/plain
[Crimson]: Crimson 1.1 Release, http://xml.apache.org/crimson/
[expat]: Clark, James, et al. "The Expat XML Parser". http://expat.sourceforge.net
[XP]: Clark, James and Hussein Shafie, http://www.xmlmind.com/_xpforjaxp/docs/
[ContentHandler]: http://www.saxproject.org/apidoc/org/xml/sax/ContentHandler.html
[EntityResolver]: http://www.saxproject.org/apidoc/org/xml/sax/EntityResolver.html
[ErrorHandler]: http://www.saxproject.org/apidoc/org/xml/sax/ErrorHandler.html
[Fry 2003]: Fry, Christopher, et al. Novemver 3, 2003. JSR 173: Streaming API for XML, http://jcp.org/en/jsr/detail?id=173
[GNU JAXP]: The Gnu JAXP Project. GNU JAXP. http://www.gnu.org/software/classpathx/jaxp/
[Harold 2002]: Harold, Elliotte Rusty. "Receiving Characters" in Processing XML with Java, 2002 Boston: Addison-Wesley, 2002, pp. 284-288 http://www.cafeconleche.org/books/xmljava/chapters/ch06s07.html
[Harold 2004]: Elliotte Rusty Harold, "Too much malformed data is reported", Fenruary 19, 2004. http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27081
[XOM]: Harold, Elliotte Rusty. http://www.cafeconleche.org/XOM/.
[Hunter]: Hunter, Jason. http://www.jdom.org
[ibm10v02]: IBM. http://dev.w3.org/cvsweb/~checkout~/2001/XML-Test-Suite/xmlconf/ibm/valid/P10/ibm10v02.xml?rev=1.1.1.1&content-type=text/plain
[Kay 2002]: Kay, Michael. "The Ælfred XML Parser", November 28, 2002. http://saxon.sourceforge.net/aelfred.html
[Le Hors 2000]: Le Hors, Arnaud, Philippe Le Hégaret, Lauren Wood, Gavin Nicol, Jonathan Robie, Mike Champion, and Steve Byrne. Document Object Model (DOM) Level 2 Core Specification, November 13 2000, http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113
[Megginson 1998]: Megginson, David. June 4, 1998. "Announcement: change in AElfred maintainer". http://listserv.heanet.ie/cgi-bin/wa?A2=ind9806&L=xml-l&T=0&F=&S=&P=9737
[Megginson 2004]: Megginson, David. March 3, 2004. "Re: [xml-dev] SAX - endDocument() confusion again". http://lists.xml.org/archives/xml-dev/200403/msg00048.html
[Megginson 2004 2]: Megginson, David. March 5, 2004. "SAX/Java Proposed Changes". http://lists.xml.org/archives/xml-dev/200403/msg00122.html
[Oracle]: Oracle XML Developer's Kit for Java 9.2.0.6.0, http://otn.oracle.com/tech/xml/xdk/xdk_java.html
[Oracle C]: Oracle XML Developer's Kit for C, http://otn.oracle.com/tech/xml/xdk/xdk_c.html
[Piccolo]: Oren, Yuval. "Piccolo XML Parser for Java". http://piccolo.sourceforge.net
[Perl XML]: The Perl XML Project. Perl::SAX. http://sax.perl.org/
[sax-devel]: http://sourceforge.net/mailarchive/forum.php?forum_id=1472&max_rows=25&style=ultimate&viewmonth=200205
[SAXException]: http://www.saxproject.org/apidoc/org/xml/sax/SAXException.html
[SortedMap]: http://java.sun.com/j2se/1.4.2/docs/api/java/util/SortedMap.html
[DOM4J]: Strachan, James. http://dom4j.org/
[Sun]: Sun Microsystems. http://dev.w3.org/cvsweb/2001/XML-Test-Suite/xmlconf/sun/cxml.html?rev=1.3
[Van Vleet 2001]: VanVleet, Lynda, G. Ken Holman, and David Marston. March 3, 2001. "OASIS XSLT/XPath Conformance Committee Procedures and Deliverables". http://www.w3.org/2001/01/qa-ws/pp/ken-holman-oasis/xsltconf.htm
[Veillard]: Veillard, Daniel. "The XML C parser and toolkit of Gnome" http://xmlsoft.org
[W3C XML Group]: W3C XML Group http://www.w3.org/XML/Test.
[Wilson]: Wilson, John. May 2, 2002. "[Sax-devel] endDocument throwing an exception". http://www.geocrawler.com/mail/msg.php3?msg_id=8558085&list=13179
[Xerces C]: XML Apache Project, Xerces C++ Parser, http://xml.apache.org/xerces-c/
[Xerces J]: XML Apache Project, Xerces 2.6.1, http://xml.apache.org/xerces2-j/
[xml.sax]: xml.sax. http://xpipe.sourceforge.net/cgi-bin/doc.py?module=xml.sax

SAX Conformance Testing

Abstract

Table of Contents