Testing XML

Elliotte Rusty Harold

Thursday, September 14, 2006

elharo@metalab.unc.edu

http://www.cafeaulait.org/

Benefits of Test Driven Development are Well Known

Faster development; faster time to market
More robust, error free code
YAGNI: Writing tests first avoids writing unneeded code.
Easier to find and fix bugs
Easier to add features or change behavior; less worry about unintentionally introducing bugs
Makes refactoring/optimization possible: any change that doesn't break a test suite is de facto acceptable.

More and More Applications Are Generating XML

Web Services
File formats: OpenOffice, Word 12, etc.
Config files: Apple's plist format
RSS/Atom
And this must be tested!

XML is not Just a Text File

Cannot do a straight binary compare
Cannot do a straight text compare
Must use a parser based test tool

Different ways of representing the same syntax

CDATA sections vs. entity references:

<![CDATA[<Oxygen/> has an Eclipse plugin for editing XML]]>

<Oxygen/> has an Eclipse plugin for editing XML
Entity references vs numeric character references

<Oxygen/> has an Eclipse plugin for editing XML

<Oxygen/> has an Eclipse plugin for editing XML
Decimal vs. hexadecimal character references

<Oxygen/> has an Eclipse plugin for editing XML

<Oxygen/> has an Eclipse plugin for editing XML
Attribute order

<property name="packages" value="nu.xom.*"/>

<property value="nu.xom.*" name="packages" />
White space inside tags
<property name="packages" value="nu.xom.*"/>

<property value="nu.xom.*" name="packages" />

Content can change but still be OK

Unexpected content
Different order
Comments
Processing instructions
Namespace prefixes
Boundary whitespace

Key Question

Does this document contain the information it needs to contain?
The question is not: "Does it not contain anything else?"

An XML document

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" 
                       "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>CFBundleDevelopmentRegion</key>
    <string>English</string>
    <key>CFBundleExecutable</key>
    <string>thunderbird-bin</string>
    <key>CFBundleGetInfoString</key>
    <string>Thunderbird 1.0.2, © 2005 The Mozilla Organization</string>
    <key>CFBundleIconFile</key>
    <string>thunderbird</string>
    <key>CFBundleIdentifier</key>
    <string>org.mozilla.thunderbird</string>
    <key>CFBundleInfoDictionaryVersion</key>
    <string>6.0</string>
    <key>CFBundleName</key>
    <string>Thunderbird</string>
    <key>CFBundlePackageType</key>
    <string>APPL</string>
    <key>CFBundleShortVersionString</key>
    <string>1.0.2</string>
    <key>CFBundleSignature</key>
    <string>MOZM</string>
    <key>CFBundleVersion</key>
    <string>1.0.2</string>
    <key>NSAppleScriptEnabled</key>
    <true/>
</dict>
</plist>

An Alternate Representation of the Same document

<?xml version="1.0" encoding="MacRoman"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd" [
  <!ENTITY version "1.0.2">
]>
<plist version = '1.0'>
<dict>
    <key>CFBundleDevelopmentRegion</key>
    <string>English</string>
    <key>CFBundleExecutable</key>
    <string>thunderbird-bin</string>
    <key>CFBundleGetInfoString</key>
    <string>Thunderbird &version;, &#xA9; 2005 The Mozilla Organization</string>
    <key>CFBundleIconFile</key>
    <string>thunderbird</string>
    <key>CFBundleIdentifier</key>
    <string>org.mozilla.thunderbird</string>
    <key>CFBundleInfoDictionaryVersion</key>
    <string>6.0</string>
    <key>CFBundleName</key>
    <string>Thunderbird</string>
    <key>CFBundlePackageType</key>
    <string>APPL</string>
    <key>CFBundleShortVersionString</key>
    <string>&version;</string>
    <key>CFBundleSignature</key>
    <string>MOZM</string>
    <key>CFBundleVersion</key>
    <string>&version;</string>
    <key>NSAppleScriptEnabled</key>
    <true/>
</dict>
</plist>

A Representation of the Same Information in a Different Document

<?xml version="1.0" encoding="MacRoman"?>
<?xml-stylesheet href="plist.css" type="text/css"?>
<!-- Removing all the white space may not make this document as easy to
     read, but it could make it faster to parse since there are fewer nodes
     to handle. -->
<!DOCTYPE plist SYSTEM "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0"><dict>
<key>CFBundleVersion</key><string>1.0.2</string>
<key>NSAppleScriptEnabled</key><true/>
<key>CFBundleIdentifier</key><string>org.mozilla.thunderbird</string>
<key>CFBundleInfoDictionaryVersion</key><string>6.0</string>
<key>CFBundleName</key><string>Thunderbird</string>
<key>CFBundlePackageType</key><string>APPL</string>
<key>CFBundleShortVersionString</key><string>1.0.2</string>
<key>CFBundleSignature</key><string>MOZM</string>
<key>CFBundleDevelopmentRegion</key><string>English</string>
<key>CFBundleExecutable</key><string>thunderbird-bin</string>
<key>CFBundleGetInfoString</key><string>Thunderbird 1.0.2, © 2005 The Mozilla Organization</string>
<key>CFBundleIconFile</key><string>thunderbird</string>
</dict></plist>

The XML Infoset

The InfoSet defines 11 Kinds of Information Items

The Document Information Item
Element Information Items
Attribute Information Items
Processing instruction Information Items
Unexpanded Entity Reference Information Items
Character Information Items
Comment Information Items
The Document Type Declaration Information Item
Unparsed Entity Information Items
Notation Information Items
Namespace Declaration Information Items

Not everyone agrees that this is a good thing! or that this is the right list!

Element Information Items

An Element Information Item includes:

namespace name
local name
children: a list of element, processing instruction, unexpanded entity reference, character, and comment information items, one for each element, processing instruction, unexpanded entity reference, data character, and comment appearing immediately within the current element
attributes: an unordered set of attribute information items, one for each of the attributes (specified or defaulted from the DTD) of this element. xmlns attributes declarations are not include.
declared namespaces: an unordered set of namespace declaration information items, one for each of the namespaces declared either in the start-tag of this element or defaulted from the DTD.
in-scope namespaces: An unordered set of namespace declaration information items, one for each of the namespaces in effect for this element
base URI: The absolute URI of the external entity in which this element appears, as defined in XML Base. If this is not known, this property is null.
parent

The InfoSet Omits:

The internal and external DTD subsets; especially ELEMENT and ATTLIST declarations
Whether an empty element uses two tags or one
What kind of quotes surround attributes
Insignificant white space in attributes
White space that occurs between attributes
Attribute order
CDATA sections
Parsed entities
Comments in the DTD

Infoset significant details that sometimes should ignored anyway

You may well wish to ignore more details when comparing:

Boundary whitespace
Child element order
Comments
Processing instructions

Direct Testing with DOM

More or less Infoset based
Supported out of the box in Java 1.4 and later
But very painful: JDOM, XOM, etc. are much easier to use
Basic approach (irrespective of API):
1. Parse document (possibly in a fixture)
2. Navigate to the piece you want to test
3. Use Java (or Python, or C#, or whatever) to make the test

DOM Example

private Document plist;

protected void setUp() 
  throws IOException, ParserConfigurationException, SAXException {
  DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
  factory.setNamespaceAware(true); // NEVER FORGET THIS!
  DocumentBuilder builder = factory.newDocumentBuilder();
  
  plist = builder.parse(new File("thunderbirdplist.xml"));
}

public void testNoTwoKeyElementsAreAdjacentDOM() {
   
  Element root = plist.getDocumentElement();
  Element dict = (Element) root.getElementsByTagName("dict").item(0);
  NodeList children = dict.getElementsByTagName("*");
  for (int i = 0; i < children.getLength(); i++) {
    Node element = children.item(i);
    if (element.getNodeName().equals("key")) {
      assertFalse(children.item(i+1).getNodeName().equals("key"));
      // effectively also tests that every key 
      // is followed by something
    }
  }
  
}

Canonical XML

Resolves all purely syntactic differences so binary comparisons are possible.
Equal infosets compare equal; non-equal infosets compare unequal
May be too strong: counts boundary white space, element order, etc.
Occasionally too weak: misses attribute types and document type declaration
Comments are included or excluded at user option
Exclusive XML canonicalization avoids a few bugs in the spec

A Canonicalized Document

No DOCTYPE
No entity references except for the five predefined ones
No numeric character references
UTF-8
No XML declaration
No empty-element tags
Double quotes on normalized attribute values
No extra white space inside tags

<plist version="1.0">
<dict>
    <key>CFBundleDevelopmentRegion</key>
    <string>English</string>
    <key>CFBundleExecutable</key>
    <string>thunderbird-bin</string>
    <key>CFBundleGetInfoString</key>
    <string>Thunderbird 1.0.2, © 2005 The Mozilla Organization</string>
    <key>CFBundleIconFile</key>
    <string>thunderbird</string>
    <key>CFBundleIdentifier</key>
    <string>org.mozilla.thunderbird</string>
    <key>CFBundleInfoDictionaryVersion</key>
    <string>6.0</string>
    <key>CFBundleName</key>
    <string>Thunderbird</string>
    <key>CFBundlePackageType</key>
    <string>APPL</string>
    <key>CFBundleShortVersionString</key>
    <string>1.0.2</string>
    <key>CFBundleSignature</key>
    <string>MOZM</string>
    <key>CFBundleVersion</key>
    <string>1.0.2</string>
    <key>NSAppleScriptEnabled</key>
    <true></true>
</dict>
</plist>

JUnit Canonicalization (using XOM)

import java.io.*;
import nu.xom.*;
import nu.xom.canonical.*;
import junit.framework.Assert;

public class CanonicalAssert extends Assert {

    public void assertCanonicalEquals(Document expected, Document actual) {
        
        ByteArrayOutputStream expectedBytes = new ByteArrayOutputStream();
        ByteArrayOutputStream actualBytes = new ByteArrayOutputStream();
        
        try {
            Canonicalizer expectedCanonicalizer 
              = new Canonicalizer(expectedBytes);
            expectedCanonicalizer.write(expected);
            byte[] expectedArray = expectedBytes.toByteArray();
        
            Canonicalizer actualCanonicalizer 
              = new Canonicalizer(actualBytes);
            actualCanonicalizer.write(actual);
            byte[] actualArray = actualBytes.toByteArray();
            
            assertEquals(expectedArray.length, actualArray.length);
            for (int i = 0; i < expectedArray.length; i++) {
                assertEquals(expectedArray[i], actualArray[i]);
            }
        }
        catch (IOException ex) {
            fail("IOException while canonicalizing");
        }        
        
    }    

}

Document Subset Canonicalization

Use an XPath to select the part of the document to canonicalize
Result may not be well-formed, but will be a byte sequence
Infoset inclusions and omissions pretty much the same as with full document canonicalization
Inheritance of xml: attributes

JUnit Document Subset Canonicalization (using XOM 1.1)

public void assertCanonicalEquals(Document expected, Document actual, String xpath) {
  
  ByteArrayOutputStream expectedBytes = new ByteArrayOutputStream();
  ByteArrayOutputStream actualBytes = new ByteArrayOutputStream();
  
  try {
    Canonicalizer expectedCanonicalizer = new Canonicalizer(expectedBytes);
    Nodes expectedNodes = expected.query(xpath);
    expectedCanonicalizer.write(expectedNodes);
    byte[] expectedArray = expectedBytes.toByteArray();
  
    Canonicalizer actualCanonicalizer = new Canonicalizer(actualBytes);
    Nodes actualNodes = actual.query(xpath);
    actualCanonicalizer.write(actualNodes);
    byte[] actualArray = actualBytes.toByteArray();
  
    assertEquals(expectedArray.length, actualArray.length);
    for (int i = 0; i < expectedArray.length; i++) {
      assertEquals(expectedArray[i], actualArray[i]);
    }
  }
  catch (IOException ex) {
    fail("IOException while canonicalizing");
  }    
  
}

Exclusive XML Canonicalization

Same as canonical XML for full documents.
No inheritance of xml: attributes
Namespaces in scope are preserved for document subsets
Normally the better choice for document subset canonicalization

JUnit Exclusive Document Subset Canonicalization

public void assertCanonicalEquals(Document expected, Document actual, String xpath) {
  
  ByteArrayOutputStream expectedBytes = new ByteArrayOutputStream();
  ByteArrayOutputStream actualBytes = new ByteArrayOutputStream();
  
  try {
    Canonicalizer expectedCanonicalizer = new Canonicalizer(
      expectedBytes, Canonicalizer.EXCLUSIVE_XML_CANONICALIZATION);
    Nodes expectedNodes = expected.query(xpath);
    expectedCanonicalizer.write(expectedNodes);
    byte[] expectedArray = expectedBytes.toByteArray();
  
    Canonicalizer actualCanonicalizer = new Canonicalizer(
      actualBytes, Canonicalizer.EXCLUSIVE_XML_CANONICALIZATION);
    Nodes actualNodes = actual.query(xpath);
    actualCanonicalizer.write(actualNodes);
    byte[] actualArray = actualBytes.toByteArray();
  
    assertEquals(expectedArray.length, actualArray.length);
    for (int i = 0; i < expectedArray.length; i++) {
      assertEquals(expectedArray[i], actualArray[i]);
    }
  }
  catch (IOException ex) {
    fail("IOException while canonicalizing");
  }    
  
}

Canonicalization tools and libraries

Validity

Validity is not required, but it's very useful for testing
Very easy to use (as testing XML goes)
Great tool support
Well understood and documented

DTDs

<!ENTITY % plistObject 
  "(array | data | date | dict | real | integer | string | true | false )" >
<!ELEMENT plist %plistObject;>
<!ATTLIST plist version CDATA "1.0" >

<!-- Collections -->
<!ELEMENT array (%plistObject;)*>
<!ELEMENT dict (key, %plistObject;)*>
<!ELEMENT key (#PCDATA)>

<!--- Primitive types -->
<!ELEMENT string (#PCDATA)>
<!ELEMENT data (#PCDATA)> <!-- Contents interpreted as Base-64 encoded -->
<!ELEMENT date (#PCDATA)> <!-- Contents should conform to a subset of ISO 8601 
                               (in particular, YYYY '-' MM '-' DD 'T' HH ':' MM ':' SS 'Z'.  
                               Smaller units may be omitted with a loss of precision) -->

<!-- Numerical primitives -->
<!ELEMENT true EMPTY>  <!-- Boolean constant true -->
<!ELEMENT false EMPTY> <!-- Boolean constant false -->
<!ELEMENT real (#PCDATA)> <!-- Contents should represent a 
                               floating point number matching 
                               ("+" | "-")? d+ ("."d*)? ("E" ("+" | "-") d+)? 
                               where d is a digit 0-9.  -->
<!ELEMENT integer (#PCDATA)> <!-- Contents should represent a (possibly signed) 
                                  integer number in base 10 -->

Validating via JUnit

  public void testValidOutput() throws SAXException, IOEXception {

  File f = new File("filename.xml");
  InputSource in = new InputSource(new FileInputStream(f));
  
  XMLReader parser = XMLReaderFactory.createXMLReader(); 
  parser.setFeature("http://xml.org/sax/features/validation", true);
  parser.setErrorHandler(new ErrorHandler() {

  public void warning(SAXParseException exception) {
    // skip
  }

  public void error(SAXParseException exception) 
   throws SAXException {
    throw exception;  
  }

  public void fatalError(SAXParseException exception) 
   throws SAXException {
    throw exception;   
  }
    
  });
  parser.parse(in);

}

Changing the DTD

DTD validation is always relative to DTD specified by DOCTYPE.
Sometimes you need to check a document that has no DOCTYPE.
Sometimes you want to substitute a different DTD
Use an EntityResolver

import org.xml.sax.*;
import java.io.*;

public class LocalResolver implements EntityResolver {
 
   public InputSource resolveEntity (String publicID, String systemID)
   {
     if (publicID.equals("-//Apple Computer//DTD PLIST 1.0//EN")) {
       InputStream in = new FileInputStream("plist.dtd");
       return new InputSource(in);
     } 
     else {
       return null;
     }
   }
 }

Mock DOCTYPE

Trickier when there's no DOCTYPE at all.
Need to add a DOCTYPE directly into the stream the parser reads.
In Java, do this with SequenceInputStream and mark and reset

W3C Schemas

Additional Data Typing
Much easier to attach different schemas

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

  <xsd:element name="plist">
    <xsd:complexType>
      <xsd:sequence minOccurs="1" maxOccurs="1">
        <xsd:element name="dict">
          <xsd:complexType>
            <xsd:sequence minOccurs="1" maxOccurs="123">
              <xsd:element name="key"   type="xsd:token"/>
              <xsd:choice>
                <xsd:element name="string" type="xsd:string"/>
                <xsd:element name="true"></xsd:element>
                <xsd:element name="false"></xsd:element>
              </xsd:choice>
            </xsd:sequence>
          </xsd:complexType>
        </xsd:element>
      </xsd:sequence>
      <xsd:attribute name="version" type="xsd:string" fixed="1.0"/>
    </xsd:complexType>
  </xsd:element>

</xsd:schema>

JUnit test for W3C Schemas

public void testSchemaValidOutput() throws SAXException {

  File f = new File("filename.xml");
  InputSource in = new InputSource(new FileInputStream(f));

  XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser"); 
  parser.setFeature("http://xml.org/sax/features/validation", true);
  parser.setFeature("http://apache.org/xml/features/validation/schema", true);
  parser.setProperty(
   "http://apache.org/xml/properties/schema/external-noNamespaceSchemaLocation", 
   "examples/plist.xsd");
  // also http://apache.org/xml/properties/schema/external-schemaLocation
  parser.setErrorHandler(new ErrorHandler() {

  public void warning(SAXParseException exception) throws SAXException {
    // skip
  }

  public void error(SAXParseException exception) throws SAXException {
    throw exception;      
  }

  public void fatalError(SAXParseException exception) throws SAXException {
    throw exception;       
  }
    
  });
  parser.parse(in);
  
}

RELAX NG schema

The most powerful schema language of all
Quite easy to specify different schemas

namespace a = "http://relaxng.org/ns/compatibility/annotations/1.0"

plistObject =
  array | data | date | dict | real | integer | \string | true | false
plist = element plist { attlist.plist, plistObject }
attlist.plist &= [ a:defaultValue = "1.0" ] attribute version { text }?

# Collections
array = element array { attlist.array, plistObject* }
attlist.array &= empty
dict = element dict { attlist.dict, (key, plistObject)* }
attlist.dict &= empty
key = element key { attlist.key, text }
attlist.key &= empty

# - Primitive types
\string = element string { attlist.string, text }
attlist.string &= empty
data = element data { attlist.data, text }
attlist.data &= empty

# Contents interpreted as Base-64 encoded
date = element date { attlist.date, text }
attlist.date &= empty
# Contents should conform to a subset of ISO 8601 
# (in particular, YYYY '-' MM '-' DD 'T' HH ':' MM ':' SS 'Z'.  
# Smaller units may be omitted with a loss of precision)

# Numerical primitives
true = element true { attlist.true, empty }
attlist.true &= empty

# Boolean constant true
false = element false { attlist.false, empty }
attlist.false &= empty

# Boolean constant false
real = element real { attlist.real, text }
attlist.real &= empty

# Contents should represent a floating point number
# matching ("+" | "-")? d+ ("."d*)? ("E" ("+" | "-") d+)? 
# where d is a digit 0-9.
integer = element integer { attlist.integer, text }
attlist.integer &= empty
start = plist

JUnit test for RELAX NG Schemas

Not bundled with JDK; must install third party library
Can still use javax.xml.validation API.

public void testRELAXNGValid() {

  // some of this might be moved into fixtures
  DocumentBuilder parser 
    = DocumentBuilderFactory.newInstance().newDocumentBuilder();
  Document document = parser.parse(new File("filename.xml"));

  SchemaFactory factory 
    = SchemaFactory.newInstance(XMLConstants.RELAXNG_NS_URI);
  Source source = new StreamSource(new File("plist.rnc"));
  Schema schema = factory.newSchema(source);
  Validator validator = schema.newValidator();
 
  validator.validate(new DOMSource(document));
  // throws exception if document is invalid

}

XPath

More declarative
Query for presence (or absence) of specific content
Ignore everything else.
More robust, less specific navigation with // and descendant axis
boolean() function reduces many XPaths to true-false answers
Can be plugged into various APIs: DOM, XOM, JDOM, etc.

Some XPath Tests for the plist

There is a CFBundleExecutable:
boolean(//key[. = 'CFBundleExecutable'])
There is exactly one CFBundleIconFile:
count(//key[. = 'CFBundleIconFile']) = 1
The software is copyrighted by Mozilla:
contains(//key[. = 'CFBundleGetInfoString']/following-sibling::string, '© 2005 The Mozilla Organization')
The CFBundleSignature is four letters:
string-length(//key[. = 'CFBundleSignature']/following-sibling::string) = 4
No two key elements are adjacent:
count(//key/following-sibling::*[1]/self::key) = 0

XPath in a JUnit test

On top of javax.xml.xpath
Bundled with Java 1.5; a standard extension for Java 1.4 and earlier
There are other frameworks you could use

import org.xml.sax.InputSource;
import javax.xml.xpath.*;

import junit.framework.*;
import java.io.*;

public class PListXPathTest extends TestCase {

  private InputSource plist;
  private XPath query;
  
  protected void setUp() throws IOException {
    plist = new InputSource(new FileInputStream("thunderbirdplist.xml"));    
    query = XPathFactory.newInstance().newXPath();
  }
  
  public void testNoTwoKeyElementsAreAdjacent() 
    throws XPathExpressionException {
     // //key/following-sibling::*[1]/self::key is empty
     
     Boolean result = (Boolean) query.evaluate(
     "//key/following-sibling::*[1]/self::key", 
     plist, XPathConstants.BOOLEAN);
     assertFalse(result.booleanValue());
    
  }

}

Writing as XSLT

Very convenient for authoring tests
Doesn't work so well for unit testing
Relatively hard to debug when something breaks

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="text"/>

  <xsl:template match="/">

    <xsl:if test="not(//key[. = 'CFBundleExecutable'])">
      No CFBundleExecutable
    </xsl:if> 

    <xsl:if test="count(//key[. = 'CFBundleIconFile']) = 0">
      There is no CFBundleIconFile
    </xsl:if>

    <xsl:if test="count(//key[. = 'CFBundleIconFile']) &gt; 1">
      There is more than one CFBundleIconFile
    </xsl:if>
    
    <xsl:if test="not(contains(//key[. = 'CFBundleGetInfoString']/following-sibling::string, '© 2005 The Mozilla Organization'))">
      Missing copyright
    </xsl:if>

    <xsl:if test="string-length(//key[. = 'CFBundleSignature']/following-sibling::string) != 4">
      The CFBundleSignature is not four letters
    </xsl:if>

    <xsl:if test="count(//key/following-sibling::*[1]/self::key) != 0">
      Adjacent key elements
    </xsl:if>

  </xsl:template>

</xsl:stylesheet>

Combining XPath with Java

XPath is not Turing complete.
Some things are easier (or possible) if you don't do them in pure XPath.
Use XPath to select the relevant element" //key[. = 'CFBundleVersion']/following-sibling::string[1]
Then test its value with Java.

XPath+Java Test

CFBundleVersion looks like a version string
Find it with XPath
Test it with a regular expression: \d+\.\d+(\.\d+)?
Could use XPath 2.0, but not yet widely supported

public void testCFBundleVersionFormat() 
  throws XPathExpressionException {
       
    String regex = "\\d+\\.\\d+(\\.\\d+)?";
    String xpath = "//key[. = 'CFBundleVersion']/following-sibling::string[1]";
      
    String version = (String) query.evaluate(xpath, plist, XPathConstants.STRING);
      
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(version);
    assertTrue(matcher.matches());
      
}

Schematron

According to Schematron inventor Rick Jelliffe:
The Schematron differs in basic concept from other schema languages in that it not based on grammars but on finding tree patterns in the parsed document. This approach allows many kinds of structures to be represented which are inconvenient and difficult in grammar-based schema languages.
Makes it easy to write tests for individual units of the XML document
XPath Based
Validator is implemented in XSLT
W3C Schemas are conservative: everything not permitted is forbidden.
Schematron is liberal: everything not forbidden is permitted.
Handles unordered structures very well
Handles descendant constraints very well
Almost self-documenting
http://www.ascc.net/xml/resource/schematron/schematron.html

Schematron Syntax

A schema contains a title and a pattern
Each pattern contains rule child elements
Each rule contains assert and has a context attribute
Each assert element has a test attribute containing an XPath expression which returns (or can be cast to) a boolean.
The contents of each assert element is printed if the assertion test fails

Schematron schema for plists

Includes all the constraints previously listed as XPaths:

<?xml version="1.0"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
  <title>A Schematron Schema for the Thunderbird PLists</title>

  <pattern>
    <rule context="plist">
      <assert test=
        "//key[. = 'CFBundleExecutable']">
        There is a CFBundleExecutable.
      </assert>
      <assert test=
        "count(//key[. = 'CFBundleIconFile']) = 1">
        There is exactly one CFBundleIconFile
      </assert>
      <assert test=
        "contains(//key[. = 'CFBundleGetInfoString']/following-sibling::string, 
         '© 2005 The Mozilla Organization')">
       The software is copyrighted by Mozilla.
      </assert>
      <assert test="string-length(//key[. = 
        'CFBundleSignature']/following-sibling::string) = 4">
        The CFBundleSignature is four letters
      </assert>
      <assert test="string-length(//key[. = 
        'CFBundleSignature']/following-sibling::string) = 4">
        The CFBundleSignature is four letters
      </assert>
    </rule>
  </pattern>

  <!-- some tests are simpler here -->
  <pattern>
    <rule context="key">
      <assert test=
        "name(following-sibling::*[1]) != 'key'">
        No two key elements are adjacent.
      </assert>
    </rule>
  </pattern>

</schema>

Running Schematron

Use skeleton-1.5.xsl to generate an XSLT stylesheet:

$ xsltproc skeleton1-5.xsl plist.sct
<?xml version="1.0" standalone="yes"?>
<axsl:stylesheet xmlns:axsl="http://www.w3.org/1999/XSL/Transform" xmlns:sch="http://www.ascc.net/xml/schematron" version="1.0">
  <axsl:template match="*|@*" mode="schematron-get-full-path">
    <axsl:apply-templates select="parent::*" mode="schematron-get-full-path"/>
    <axsl:text>/</axsl:text>
    <axsl:if test="count(. | ../@*) = count(../@*)">@</axsl:if>
    <axsl:value-of select="name()"/>
    ...

Apply the stylesheet to the input documents:
```
$ xsltproc plist.xsl thunderbirdplist.xml
$
```

Or after deliberately invalidating the input:

$ xsltproc plist.xsl thunderbirdplist.xml
<?xml version="1.0"?>
The CFBundleSignature is four letters
$

You can customize the skeleton to produce different output.

JUnit test for Schematron Schema

The generated stylesheet can be integrated into testing like any other XSLT solution.

With a little meta work, the original Schematron schema can be integrated into the unit test.

public void testWithSchematron() 
 throws TransformerException, IOException {
  
  StreamSource skeleton = new StreamSource(new File("skeleton1-5.xsl"));
  StreamSource schema = new StreamSource(new File("plist.sct"));
  StringWriter temp = new StringWriter();
  StreamResult result = new StreamResult(temp);
  
  // generate the stylesheet
  TransformerFactory factory = TransformerFactory.newInstance();
  Transformer xform = factory.newTransformer(skeleton);
  xform.transform(schema, result);
  temp.flush();
  temp.close();
  String stylesheet = temp.toString();
  
  // now flip
  StringReader in = new StringReader(stylesheet);
  StreamSource sheet = new StreamSource(in);
  Transformer validator = factory.newTransformer(sheet);
  validator.setOutputProperty("method", "text");
  temp = new StringWriter();
  result = new StreamResult(temp);
  validator.transform(new StreamSource(new File("thunderbirdplist.xml")), result);
  temp.flush();
  String output = temp.toString();
  
  // Check for no output if all tests pass. 
  assertEquals(output, "", output); 
  // note use of output for both assertion message
  // and test
}

Same issue as handcrafted XSLT-stylesheet-based tests: not good for unit tests;
Hard to find the place to set the breakpoint in the debugger and hard to step through since you end up deep in the XSLT code. After all the code is really XSLT, not Java. It's like trying to debug a Python program using an assembly level debugger.
The assertion is funny. We're basically checking that the stylesheet produces no output. This requires the text output method.

XMLUnit

From the XMLUnit web page:

For those of you who've got into it you'll know that test driven development is great. It gives you the confidence to change code safe in the knowledge that if something breaks you'll know about it. Except for those bits you don't know how to test. Until now XML has been one of them. Oh sure you can use "<stuff></stuff>".equals("<stuff></stuff>"); but is that really gonna work when some joker decides to output a <stuff/>? -- damned right it's not ;-)

XML can be used for just about anything so deciding if two documents are equal to each other isn't as easy as a character for character match. Sometimes
<stuff-doc>
  <stuff>
    Stuff Stuff Stuff
  </stuff>
  <more-stuff>
    Some More Stuff
  </more-stuff>
</stuff-doc> 
equals
<stuff-doc>
  <more-stuff>
    Some More Stuff</more-stuff>
  <stuff>Stuff Stuff Stuff</stuff>
</stuff-doc> 
and sometimes it doesn't... With XMLUnit you get the control, and you get to decide.

XMLUnit

Developed by Jeff Martin and Tim Bacon
Open Source: BSD license
Addresses many of the issues we've been discussing today, but wraps it in a nice JUnit based framework
Based on JAXP and DOM
Can use XPath
Can parse badly formed HTML (or just use TagSoup)

A simple test case

import java.io.*;
import javax.xml.parsers.*;
import org.custommonkey.xmlunit.*;
import org.xml.sax.*;

public class SimpleTest extends XMLTestCase {

  public void testHelloWorld() 
    throws SAXException, IOException, ParserConfigurationException {
   
    String expected = "<GREETING>Hello World!</GREETING>";
    String actual = "<GREETING>Hello World!</GREETING>";
    assertXMLEqual(expected, actual);
    
  }

}

But this is not just a String comparison!


  public void testHelloWorld2() 
    throws SAXException, IOException, ParserConfigurationException {
   
    String expected = "<?xml version='1.0'?><GREETING >Hello World!</GREETING>";
    String actual = "<GREETING>Hello World!</GREETING>";
    assertXMLEqual(expected, actual);
    
  }

Readers and Documents

Can also compare two java.io.Reader objects whose content will be parsed

  public void testHelloWorld3() 
    throws SAXException, IOException, ParserConfigurationException {
   
    Reader in1 = new InputStreamReader(new FileInputStream("hello1.xml"), "UTF-8");
    Reader in2 = new InputStreamReader(new FileInputStream("hello2.xml"), "UTF-8");
    assertXMLEqual(in1, in2);
    
  }

This is poor design. Readers do not handle XML encoding properly. Do not use these methods.

Instead compare DOM documents:

  public void testHelloWorld4() 
    throws SAXException, IOException, ParserConfigurationException {
   
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    factory.setNamespaceAware(true); // NEVER FORGET THIS!
    DocumentBuilder builder = factory.newDocumentBuilder();
    
    Document in1 = builder.parse(new File("hello1.xml"));
    Document in2 = builder.parse(new File("hello2.xml"));
    assertXMLEqual(in1, in2);
    
  }

Assertion messages

Of course you can provide your own assertion message:

    public void testHelloWorld5() 
      throws SAXException, IOException, ParserConfigurationException {
        
        String expected = "<GREETING>Hello World!</GREETING>";
        String actual = "<GREETING>\nHello World!\n</GREETING>";
        assertXMLEqual("White space seems to count", expected, actual);
        
    }

Concepts of Equality

Identical: no DOM level differences inside the root element (XMLUnit always ignores the prolog.)
Similar: some differences allowed:
- Element Order
- Namespace prefixes
- Attribute defaulted or present
- Boundary whitespace (optional)
This is tested by assertXMLEqual()/assertXMLNotEqual().
Not equal: clearly different information content

Some documents that are equal but not identical

These tests all pass:

  public void testSiblingOrder() 
    throws SAXException, IOException, ParserConfigurationException {
    
    String expected = "<a><x/><y/></a>";
    String actual = "<a><y/><x/></a>";
    assertXMLEqual("Sibling order seems to count", expected, actual);
    
  }

  public void testNamespacePrefix() 
    throws SAXException, IOException, ParserConfigurationException {
    
    String expected = "<a xmlns='http://www.example.org'><x/></a>";
    String actual = "<pre:a xmlns:pre='http://www.example.org'><pre:x/></pre:a>";
    assertXMLEqual(expected, actual);
    
  }
  
  public void testDOCTYPE() 
    throws SAXException, IOException, ParserConfigurationException {
    
    String expected = "<!DOCTYPE a [<!ATTLIST a b CDATA 'test'>]>\n" +
      "<a><x/></a>";
    String actual = "<a b='test'><x/></a>";
    assertXMLEqual(expected, actual);
    
  }
  
  public void testCommentInProlog() 
    throws SAXException, IOException, ParserConfigurationException {
    
    String expected = "<!-- test -->" +
      "<a><x/></a>";
    String actual = "<a><x/></a>";
    assertXMLEqual(expected, actual);
    
  }

  public void testProcessingInstructionInProlog() 
    throws SAXException, IOException, ParserConfigurationException {
    
    String expected = "<?xml-styleshet type='text/css' href='file.css'?>" +
      "<a><x/></a>";
    String actual = "<a><x/></a>";
    assertXMLEqual(expected, actual);
    
  }

CDATA Handling is Broken

This test fails

    public void testCDATA() 
      throws SAXException, IOException, ParserConfigurationException {
        
        String expected = "<a>Hello</a>";
        String actual = "<a><![CDATA[Hello]]></a>";
        assertXMLEqual(expected, actual);
        
    }

The Diff class

More detailed comparisons

Can compare:

public Diff(String control, String test) throws SAXException, IOException, ParserConfigurationException
public Diff(Reader control, Reader test) throws SAXException, IOException, ParserConfigurationException
public Diff(Document controlDoc, Document testDoc)
public Diff(String control, Transform testTransform) throws IOException, TransformerException, ParserConfigurationException, SAXException
public Diff(InputSource control, InputSource test) throws SAXException, IOException, ParserConfigurationException
public Diff(DOMSource control, DOMSource test)

Distinguishes between similarity and identity:

public boolean similar()
public boolean identical()

Supports custom rules through configurable DifferenceEngine and ElementQualifier:

public Diff(Document controlDoc, Document testDoc, DifferenceEngine comparator)
public Diff(Document controlDoc, Document testDoc, DifferenceEngine comparator, ElementQualifier elementQualifier)

public void overrideDifferenceListener(DifferenceListener delegate)
public void overrideElementQualifier(ElementQualifier delegate)

Testing for identity

These tests all fail:

  public void testSiblingOrderIdentity() 
    throws SAXException, IOException, ParserConfigurationException {
    
    String expected = "<a><x/><y/></a>";
    String actual = "<a><y/><x/></a>";
    Diff diff = new Diff(expected, actual);
    assertTrue(diff.identical());
    
  }

  public void testNamespacePrefixIdentity() 
    throws SAXException, IOException, ParserConfigurationException {
    
    String expected = "<a xmlns='http://www.example.org'><x/></a>";
    String actual = "<pre:a xmlns:pre='http://www.example.org'><pre:x/></pre:a>";
    Diff diff = new Diff(expected, actual);
    assertTrue(diff.identical());
    
  }

  public void testDOCTYPEIdentity() 
    throws SAXException, IOException, ParserConfigurationException {
    
    String expected = "<!DOCTYPE a [<!ATTLIST a b CDATA 'test'>]>\n" +
      "<a><x/></a>";
    String actual = "<a b='test'><x/></a>";
    Diff diff = new Diff(expected, actual);
    assertTrue(diff.identical());
    
  }

Beware assertXMLIdentical. It's at least confusing and possibly exactly backwards.

XPath Based Tests

assertXpathExists: assert that an XPath expression selects at least one node
assertXpathNotExists: assert that an XPath expression does not select any nodes
assertXpathsEqual: assert that the node-sets obtained by evaluating two XPath expressions are similar
assertXpathsNotEqual: assert that the nodes obtained by evaluating two XPath expressions are different
assertXpathValuesEqual: assert that the string-value of two XPath expressions evaluated against two context nodes are similar
assertXpathValuesNotEqual: assert that the string-value of two XPath expressions are different
assertXpathEvaluatesTo: assert that the string-value of an XPath expression is equal to a specified string

XPath Example

  private Document plist;
  
  protected void setUp() 
    throws IOException, ParserConfigurationException, SAXException {
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    factory.setNamespaceAware(true); // NEVER FORGET THIS!
    DocumentBuilder builder = factory.newDocumentBuilder();
    
    plist = builder.parse(new File("thunderbirdplist.xml"));
  }
  
  public void testNoTwoKeyElementsAreAdjacent() 
    throws TransformerException {
     
    assertXpathNotExists(
     "//key/following-sibling::*[1]/self::key", 
     plist);
    
  }

  public void testCreatorCodeIsMOZM() throws TransformerException {
     
    assertXpathEvaluatesTo("MOZM",
     "//key[. = 'CFBundleSignature']/following-sibling::string",
     plist);
    
  }  
  
  
  public void testThereIsAnIcon() throws TransformerException {
     
    assertXpathExists(
     "//key[. = 'CFBundleIconFile']", 
     plist);
    assertXpathExists(
     "//key[. = 'CFBundleIconFile']/following-sibling::string", 
     plist);
    
  }

DifferenceListener

DifferenceListener interface compares two nodes and tells whether they're identical, similar, or different.
differenceFound method is invoked for non-identical nodes
This is one way: we can say two different nodes are identical or similar; but we can't say two identical nodes aren't equal
Can't control the tree walking order or skip nodes completely

package org.custommonkey.xmlunit;

public interface DifferenceListener {

    public final int RETURN_ACCEPT_DIFFERENCE = 0;
    public final int RETURN_IGNORE_DIFFERENCE_NODES_IDENTICAL = 1;
    public final int RETURN_IGNORE_DIFFERENCE_NODES_SIMILAR = 2;

    public int differenceFound(Difference difference);
    public void skippedComparison(Node control, Node test);

}

A DifferenceListener That Consider text nodes and CDATA sections to be equal

import org.custommonkey.xmlunit.*;
import org.w3c.dom.Node;

public class CDATAEqualsText implements DifferenceListener {

  public int differenceFound(Difference diff) {

    Node expected = diff.getControlNodeDetail().getNode();
    Node actual = diff.getTestNodeDetail().getNode();
    
    if ((expected.getNodeType() == Node.CDATA_SECTION_NODE 
       && actual.getNodeType() == Node.TEXT_NODE)
       ||
       (actual.getNodeType() == Node.CDATA_SECTION_NODE 
       && expected.getNodeType() == Node.TEXT_NODE)) {
     
      if (expected.getNodeValue().equals(actual.getNodeValue())) {
        return RETURN_IGNORE_DIFFERENCE_NODES_IDENTICAL;
      }
      
    }
    
    return RETURN_ACCEPT_DIFFERENCE;
    
    // We could really use something like DOM's NodeFilter
    // to indicate whether to process or skip the children
    
  }


  public void skippedComparison(Node expected, Node actual) {}

}

Comparing two documents with the custom DifferenceListener That Ignores Attributes

    String expected = "<root>Hello</root>";
  String actual = "<root><![CDATA[Hello]]></root>";
  DifferenceListener listener = new CDATAEqualsText();
  Diff myDiff = new Diff(expected, actual);
  myDiff.overrideDifferenceListener(listener);
  assertTrue(myDiff.identical());

ElementQualifier

ElementQualifier determines which nodes to compare

Important for comparing elements in different order:

package org.custommonkey.xmlunit;

public interface DifferenceListener {

  public boolean qualifyForComparison(Element control, Element test)

}

Return true if the two elements should be compared, false otherwise

XMLUnit

Contains various global configuration methods:

package org.custommonkey.xmlunit;

public final class XMLUnit {

  public static void setControlParser(String className) throws FactoryConfigurationError;
  public static DocumentBuilder getControlParser() throws ParserConfigurationException;
  
  public static DocumentBuilderFactory getControlDocumentBuilderFactory();
  public static void setControlDocumentBuilderFactory(DocumentBuilderFactory factory);
  
  public static void setTestParser(String className) throws FactoryConfigurationError;
  public static DocumentBuilder getTestParser() throws ParserConfigurationException;
  
  public static DocumentBuilderFactory getTestDocumentBuilderFactory() ;
  public static void setTestDocumentBuilderFactory(DocumentBuilderFactory factory);
  
  public static void setIgnoreWhitespace(boolean ignore);
  public static boolean getIgnoreWhitespace();
}

To Learn More

This presentation: http://www.cafeconleche.org/slides/sdbestpractices2006/testingxml/
XMLUnit: http://xmlunit.sourceforge.net/
XOM: http://www.xom.nu/
JUnit: http://junit.sourceforge.net/
Schematron: http://www.schematron.com/resources.html
Canonical XML: http://www.w3.org/TR/xml-c14n
XML Infoset: http://www.w3.org/TR/xml-infoset

Index | Cafe con Leche