SAX

Elliotte Rusty Harold

Software Development 2000 East

Wednesday, Wednesday, November 1, 2000

elharo@metalab.unc.edu

http://metalab.unc.edu/xml/

Where we're going

A Brief Review of XML Rules and Terminology
Reading XML through SAX2

SAX Requirements

You need a JDK
You need some free class libraries
You need a text editor
You need some data to process

Prerequisites

Are familiar with Java including I/O, classes, objects, polymorphism, etc.
Know XML including well-formedness, validity, namespaces, and so forth
I will briefly review proper terminology

A simple example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->

View in Browser

Markup and Character Data

Markup includes:
- Tags
- Entity References
- Comments
- Processing Instructions
- Document Type Declarations
- XML Declaration
- CDATA Section Delimiters
Character data includes everything else

Markup and Character Data Example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->

Entities

An XML document is made up of one or more physical storage units called entities
Entity references :
- Parsed internal general entity references like &
- Parsed external general entity references
- Unparsed external general entity references
- External parameter entity references
- Internal parameter entity references
Reading an XML document is not the same thing as reading an XML file

The file contains entity references.
The file document contains the entities' replacement text.
When you use a parser to read a document you'll get the text including characters like <. You will not see the entity references.

Parsed Character Data

Character data left after entity references are replaced with their text
Given the element
<PUBLISHER>A & M Records</PUBLISHER>

The parsed character data is

A & M Records

CDATA sections

Used to include large blocks of text with lots of normally illegal literal characters like < and &, typically XML or HTML.

<p>You can use a default <code>xmlns</code>
attribute to avoid having to add the svg prefix to all
your elements:</p>
<![CDATA[
  <svg xmlns="http://www.w3.org/Graphics/SVG/SVG-19991203.dtd" 
       width="12cm" height="10cm">
    <ellipse rx="110" ry="130" />
    <rect x="4cm" y="1cm" width="3cm" height="6cm" />
  </svg>
]]>

CDATA is for human authors, not for programs!

Comments


Comments are for humans, not programs.

Processing Instructions

Divided into a target and data for the target
The target must be an XML name
The data can have an effectively arbitrary format

<?robots index="yes" follow="no"?>
<?xml-stylesheet href="pelicans.css" type="text/css"?>
<?php 
  mysql_connect("database.unc.edu", "clerk", "password"); 
  $result = mysql("CYNW", "SELECT LastName, FirstName FROM Employees 
    ORDER BY LastName, FirstName"); 
  $i = 0;
  while ($i < mysql_numrows ($result)) {
     $fields = mysql_fetch_row($result);
     echo "<person>$fields[1] $fields[0] </person>\r\n";
     $i++;
  }
  mysql_close();
?>

These are for programs

The XML Declaration

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

Looks like a processing instruction but isn't.
version attribute
- required
- always has the value 1.0
encoding attribute
- UTF-8
- 8859_1
- etc.
standalone attribute
- yes
- no

Document Type Declaration

<!DOCTYPE SONG SYSTEM "song.dtd">

Document Type Definition

<!ELEMENT SONG (TITLE, COMPOSER+, PRODUCER*, 
 PUBLISHER*, YEAR?, LENGTH?, ARTIST+)>

<!ELEMENT TITLE (#PCDATA)>

<!ELEMENT COMPOSER (#PCDATA)>
<!ELEMENT PRODUCER (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)>
<!ELEMENT LENGTH (#PCDATA)>
<!-- This should be a four digit year like "1999",
     not a two-digit year like "99" -->
<!ELEMENT YEAR (#PCDATA)>

<!ELEMENT ARTIST (#PCDATA)>

XML Names

Used for element, attribute, and entity names
Can contain any alphabetic, ideographic, or numeric Unicode character
Can contain hyphen, underscore, or period
Can also contain colons but these are reserved for namespaces
Can begin with any alphabetic or ideographic character or the underscore but not digits or other punctuation marks

XML Namespaces

Raison d'etre:
1. To distinguish between elements and attributes from different vocabularies with different meanings.
2. To group all related elements and attributes together so that a parser can easily recognize them.
Each element is given a prefix
Each prefix (as well as the empty prefix) is associated with a URI
Elements with the same URI are in the same namespace
URIs are purely formal. They do not necessarily point to a page.

Namespace Syntax

Elements and attributes that are in namespaces have names that contain exactly one colon. They look like this:
```
rdf:description
xlink:type
xsl:template
```
Everything before the colon is called the prefix
Everything after the colon is called the local part or local name.
The complete name including the colon is called the qualified name or raw name.

Namespace URIs

Each prefix in a qualified name is associated with a URI.
For example, all elements in XSLT 1.0 style sheets are associated with the http://www.w3.org/1999/XSL/Transform URI.
The customary prefix xsl is a shorthand for the longer URI http://www.w3.org/1999/XSL/Transform.
You can't use the URI in the element name directly.

Binding Prefixes to Namespace URIs

Prefixes are bound to namespace URIs by attaching an xmlns:prefix attribute to the prefixed element or one of its ancestors.

<svg:svg xmlns:svg="http://www.w3.org/Graphics/SVG/SVG-19991203.dtd" 
 width="12cm" height="10cm">
  <svg:ellipse rx="110" ry="130" />
  <svg:rect x="4cm" y="1cm" width="3cm" height="6cm" />
</svg:svg>

Bindings have scope within the element where they're declared.
An SVG processor can recognize all three of these elements as SVG elements because they all have prefixes bound to the particular URI defined by the SVG specification.

The Default Namespace

Indicate that an unprefixed element and all its unprefixed descendant elements belong to a particular namespace by attaching an xmlns attribute with no prefix:

<DATASCHEMA xmlns="http://www.w3.org/2000/P3Pv1">
  <DATA name="vehicle.make" type="text" short="Make" 
        category="preference" size="31"/>
  <DATA name="vehicle.model" type="text" short="Model" 
        category="preference" size="31"/>
  <DATA name="vehicle.year" type="number" short="Year" 
        category="preference" size="4"/>
  <DATA name="vehicle.license.state." type="postal." short="State" 
        category="preference" size="2"/>
  <DATA name="vehicle.license.number" type="text" 
        short="License Plate Number" category="preference" size="12"/>
</DATASCHEMA>

Both the DATASCHEMA and DATA elements are in the http://www.w3.org/2000/P3Pv1 namespace.
Default namespaces apply only to elements, not to attributes. Thus in the above example the name, type, short, category, and size attributes are not in any namespace. Unprefixed attributes are never in any namespace.
You can change the default namespace within a particular element by adding an xmlns attribute to the element.

How Parsers Handle Namespaces

Namespaces were added to XML 1.0 after the fact, but care was taken to ensure backwards compatibility.
An XML 1.0 parser that does not know about namespaces will most likely not have any troubles reading a document that uses namespaces.
A namespace aware parser also checks to see that all prefixes are mapped to URIs. Otherwise it behaves almost exactly like a non-namespace aware parser.
Other software that sits on top of the raw XML parser, an XSLT engine for example, may treat elements differently depending on what namespace they belong to. However, the XML parser itself mostly doesn't care as long as all well-formedness and namespace constraints are met.
A possible exception occurs in the unlikely event that elements with different prefixes belong to the same namespace or elements with the same prefix belong to different namespaces
Many parsers have the option of whether to report namespace violations so that you can turn namespace processing on or off as you see fit.

Canonical XML

A W3C standard for determining when two documents are the same after:
- Entity references are resolved
- Document is converted to Unicode
- Unicode combining forms are combined
- Comments are stripped
- White space is normalized
- Default attribute values are added
If at all possible, your programs should depend only on the canonical form of the document
Canonical form of hotcop.xml:
<?xml-stylesheet type="text/css" href="song.css"?><SONG>
 <TITLE>Hot Cop</TITLE>
 <COMPOSER>Jacques Morali</COMPOSER>
 <COMPOSER>Henri Belolo</COMPOSER>
 <COMPOSER>Victor Willis</COMPOSER>
 <PRODUCER>Jacques Morali</PRODUCER>
 <PUBLISHER>A & M Records</PUBLISHER>
 <LENGTH>6:20</LENGTH>
 <YEAR>1978</YEAR>
 <ARTIST>Village People</ARTIST>
</SONG>

Part II: Reading XML Documents with SAX

The stereotypical "Desperate Perl Hacker" (DPH) is supposed to be able to write an XML parser in a weekend.
The parser does the hard work for you.
Your code reads the document through the parser's API.

Parser APIs

SAX, the Simple API for XML
- SAX1
- SAX2
DOM, the Document Object Model
- DOM Level 0
- DOM Level 1
- DOM Level 2
Proprietary APIs
- Parser specific APIs
- Sun's Java API for XML Parsing = SAX1 + DOM1 + a few factory classes
- JSR-000031 XML Data Binding Specification from Bluestone, Sun, webMethods et al.
  The proposed specification will define an XML data-binding facility for the JavaTM Platform. Such a facility compiles an XML schema into one or more Java classes. These automatically-generated classes handle the translation between XML documents that follow the schema and interrelated instances of the derived classes. They also ensure that the constraints expressed in the schema are maintained as instances of the classes are manipulated.

SAX

Public domain, developed on xml-dev mailing list
Maintained by David Megginson
org.xml.sax package
http://www.megginson.com/SAX/
Event based
SAX1 omits:
- Comments
- Lexical Information (CDATA sections, entity references, etc.)
- DTD declarations
- Validation
- Namespaces

SAX Parsers for Java

Parser	URL	Validating	Namespaces	DOM1	DOM2	SAX1	SAX2	License
Apache XML Project's Xerces Java	http://xml.apache.org/xerces-j/index.html	X	X	X	X	X	X	Apache Software License, Version 1.1
IBM's XML for Java	http://www.alphaworks.ibm.com/formula/xml	X	X	X	X	X	X	License
James Clark's XP	http://www.jclark.com/xml/xp/index.html					X		Modified BSD
Microstar's Ælfred	http://home.pacbell.net/david-b/xml/	Namespaces		DOM1	DOM2	SAX1	SAX2	open source
Silfide's SXP	http://www.loria.fr/projets/XSilfide/EN/sxp/			X		X		Non-GPL viral open source license
Sun's Java API for XML	http://java.sun.com/products/xml	X	X	X		X		free beer
Oracle's XML Parser for Java	http://technet.oracle.com/	X	X	X		X		free beer

What SAX1 doesn't do

Completely ignores document type declaration
Validation and other optional results of DTD (attribute defaulting, external entities, etc.) are at parser default
Comments
XML Declaration
Does not report CDATA sections, entity references, and other non-canonical information from the document.
No explicit support for namespaces

SAX2

Adds:
- Namespace support
- Optional Validation
- Optional Lexical events for comments, CDATA sections, entity references
A lot more configurable
Deprecates a lot of SAX1
Adapter classes convert between parsers.

The SAX2 Process

Use the factory method XMLReaderFactory.createXMLReader() to retrieve a parser-specific implementation of the XMLReader interface
Your code registers a ContentHandler with the parser
An InputSource feeds the document into the parser
As the document is read, the parser calls back to the methods of the methods of the ContentHandler to tell it what it's seeing in the document.

Making an XMLReader

The XMLReaderFactory.createXMLReader() method instantiates an XMLReader subclass named by the org.xml.sax.driver system property:
```
try {
  XMLReader parser = XMLReaderFactory.createXMLReader();
} 
catch (SAXException e) {
  System.err.println(e);
}
```

The XMLReaderFactory.createXMLReader(String className) method instantiates an XMLReader subclass named by its argument:

try {
  XMLReader parser 
   = XMLReaderFactory.createXMLReader(   
      "org.apache.xerces.parsers.SAXParser");
} 
catch (SAXException e) {
  System.err.println(e);
}

Or you can use the constructor in the package-specific class:
```
XMLReader parser = new SAXParser();
```

Parsing a Document with XMLReader

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;


public class SAX2Checker {

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java SAX2Checker URL1 URL2..."); 
    } 
    
    // set up the parser 
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    } 
    catch (SAXException e) {
      try {
        parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException e2) {
        System.err.println("Error: could not locate a parser.");
        return;
      }
    }
     
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
        // If there are no well-formedness errors
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber() 
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not check " + args[i] 
         + " because of the IOException " + e);
      }
      
    }  
  
  }

}

Sample Output from SAX2Checker

C:\>java SAX2Checker http://metalab.unc.edu/xml/
http://metalab.unc.edu/xml/ is not well formed.
The element type "dt" must be terminated by the 
matching end-tag "</dt>". 
at line 186, column 5

The ContentHandler interface

package org.xml.sax;


public interface ContentHandler {

    public void setDocumentLocator(Locator locator);
    
    public void startDocument() throws SAXException;
    
    public void endDocument()	throws SAXException;
    
    public void startPrefixMapping(String prefix, String uri) 
     throws SAXException;

    public void endPrefixMapping(String prefix) throws SAXException;

    public void startElement(String namespaceURI, String localName,
		 String qualifiedName, Attributes atts) throws SAXException;

    public void endElement(String namespaceURI, String localName,
     String qualifiedName) throws SAXException;

    public void characters(char[] ch, int start, int length) 
     throws SAXException;

    public void ignorableWhitespace(char ch[], int start, int length)
     throws SAXException;

    public void processingInstruction(String target, String data)
     throws SAXException;

    public void skippedEntity(String name) throws SAXException;
     
}

SAX2 Event Reporter

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;

public class EventReporter implements ContentHandler {

  public void setDocumentLocator(Locator locator) {
    System.out.println("setDocumentLocator(" + locator + ")");         
  }
  
  public void startDocument() throws SAXException {
    System.out.println("startDocument()"); 
  }

  public void endDocument() throws SAXException {
    System.out.println("endDocument()"); 
  }
  
  public void startElement(String namespaceURI, String localName, String qName, Attributes atts)
   throws SAXException {
    namespaceURI = '"' + namespaceURI + '"';
    localName = '"' + localName + '"';
    qName = '"' + qName + '"';
    String attributeString = "{";
    for (int i = 0; i < atts.getLength(); i++) {
      attributeString += atts.getQName(i) + "=\"" + atts.getValue(i) + "\"";
      if (i != atts.getLength()-1) attributeString += ", ";
    }
    attributeString += "}";
    System.out.println("startElement(" + namespaceURI + ", " + localName + ", " 
    + qName + ", " + attributeString + ")"); 
  }
  
  public void endElement(String namespaceURI, String localName, String qName) 
   throws SAXException {
    namespaceURI = '"' + namespaceURI + '"';
    localName = '"' + localName + '"';
    qName = '"' + qName + '"';
    System.out.println("endElement(" + namespaceURI + ", " + localName + ", " 
    + qName + ")"); 
  }
  
  public void characters(char[] text, int start, int length) 
   throws SAXException {
    String textString = "[" + new String(text) + "]";
    System.out.println("characters(" + textString + ", " + start + ", " +  length + ")"); 
  }
  
  public void ignorableWhitespace(char[] text, int start, int length)
   throws SAXException {
    System.out.println("ignorableWhitespace()"); 
  }
  
  public void processingInstruction(String target, String data)
   throws SAXException {
    System.out.println("processingInstruction(" + target + ", " + data + ")"); 
  }

  public void startPrefixMapping(String prefix, String uri) 
   throws SAXException {
    System.out.println("startPrefixMapping(\"" + prefix + "\", \"" + uri + "\")");         
  }
  
  public void endPrefixMapping(String prefix) throws SAXException {
    System.out.println("startPrefixMapping(\"" + prefix + "\")");                 
  }

  public void skippedEntity(String name) throws SAXException {
    System.out.println("skippedEntity(" + name + ")");                         
  }

  // Could easily have put main() method in a separate class
  public static void main(String[] args) {
    
    XMLReader parser;
    try {
     parser = XMLReaderFactory.createXMLReader();
    }
    catch (Exception e) {
      // fall back on Xerces parser by name
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser");
      }
      catch (Exception ee) {
        System.err.println("Couldn't locate a SAX parser");
        return;          
      }
    }

     
    if (args.length == 0) {
      System.out.println(
       "Usage: java EventReporter URL1 URL2..."); 
    } 
      
    // Install the Document Handler      
    parser.setContentHandler(new EventReporter());
    
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber() 
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not report on " + args[i] 
         + " because of the IOException " + e);
      }
      
    }  
  
  }

}

Event Reporter Output

UserLand's RSS based list of Web logs

UserLand's RSS based list of Web logs at http://static.userland.com/weblogMonitor/logs.xml:
```
java.io.FileNotFoundException: \C:\speaking\sd2000east\sax\examples\shortlogs.xml
```

Full list

Goal: Return a list of all the URLs in this list as java.net.URL objects

Design Decisions

Should we return an array, an Enumeration, a List, or what?
Perhaps we should use multiple threads?

SAX Design

We do not know how many URLs there will be when we start parsing so let's use a Vector
Single threaded for simplicity but a real program would use multiple threads
- One to load and parse the data
- Another thread (probably the main thread) to serve the data
- Early data could be provided before the entire document had been read
The character data of each url element needs to be stored. Everything else can be ignored.
A startElement() with the name url indicates that we need to start storing this data.
A stopElement() with the name url indicates that we need to stop storing this data, convert it to a URL and put it in the Vector
Hide the XML parsing inside a non-public class to avoid accidentally calling the methods from unexpected places or threads?

User Interface Class

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.util.*;
import java.io.*;

public class Weblogs {
     
  public static List listChannels() 
   throws IOException, SAXException {
    return listChannels(
     "http://static.userland.com/weblogMonitor/logs.xml"); 
  }
  
  public static List listChannels(String uri) 
   throws IOException, SAXException {
    
    // set up the parser 
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    } 
    catch (SAXException e) {
      try {
        parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException e2) {
        System.err.println("Error: could not locate a parser.");
        return null;
      }
    }
    
    Vector urls = new Vector(1000);
    URIGrabber u = new URIGrabber(urls);
    parser.setContentHandler(u);
    parser.parse(uri);
    return urls;
    
  }
  
  public static void main(String[] args) {
   
    try {
      List urls;
      if (args.length > 0) urls = listChannels(args[0]);
      else urls = listChannels();
      Iterator iterator = urls.iterator();
      while (iterator.hasNext()) {
        System.out.println(iterator.next()); 
      }
    }
    catch (IOException e) {
      System.err.println(e); 
    }
    catch (SAXParseException e) {
      System.err.println(e); 
      System.err.println("at line " + e.getLineNumber() 
       + ", column " + e.getColumnNumber()); 
    }
    catch (SAXException e) {
      System.err.println(e); 
    }
    catch (/* Unexpected */ Exception e) {
      e.printStackTrace(); 
    }
    
  }
  
}

ContentHandler Class

import org.xml.sax.*;
import java.net.*;
import java.util.Vector;

             // conflicts with java.net.ContentHandler
class URIGrabber implements org.xml.sax.ContentHandler {
    
  private Vector urls;
     
  URIGrabber(Vector urls) {
    this.urls = urls;
  }
    
  // do nothing methods  
  public void setDocumentLocator(Locator locator) {}
  public void startDocument() throws SAXException {}
  public void endDocument() throws SAXException {}
  public void startPrefixMapping(String prefix, String uri) 
   throws SAXException {}
  public void endPrefixMapping(String prefix) throws SAXException {}
  public void skippedEntity(String name) throws SAXException {}  
  public void ignorableWhitespace(char[] text, int start, int length)
   throws SAXException {}
  public void processingInstruction(String target, String data)
   throws SAXException {}
  
  
  // Remember, there's no guarantee all the text of the
  // url element will be returned in a single call to characters
  private StringBuffer urlBuffer;
  private boolean collecting = false;
  
  public void startElement(String namespaceURI, String localName,
   String rawName, Attributes atts) throws SAXException {
	  
    if (rawName.equals("url")) {
      collecting = true;
      urlBuffer = new StringBuffer();
    } 
    
  }
  
  public void characters(char[] text, int start, int length) 
   throws SAXException {
    
    if (collecting) {
      urlBuffer.append(text, start, length);
    } 
    
  }
  
  public void endElement(String namespaceURI, String localName,
   String rawName) throws SAXException {
	  
    if (rawName.equals("url")) {
      collecting = false;
      String url = urlBuffer.toString();
      try {
        urls.addElement(new URL(url));
      }
      catch (MalformedURLException e) {
        // skip this url
      }
    }
    
  } 
    
}

Weblogs Output

% java Weblogs shortlogs.xml
http://www.mozillazine.org
http://www.salonherringwiredfool.com/
http://www.scripting.com/
http://www.slashdot.org/

Features and Properties

SAX2 parsers--that is XMLReaders--are configured by features and properties
Feature and property names are absolute URIs
A feature is boolean, on or off, true or false; a property is an object

public boolean getFeature(String name)
 throws SAXNotRecognizedException, SAXNotSupportedException
public void setFeature(String name, boolean value)
 throws SAXNotRecognizedException, SAXNotSupportedException
public Object getProperty(String name)
 throws SAXNotRecognizedException, SAXNotSupportedException
public void setProperty(String name, Object value)
 throws SAXNotRecognizedException, SAXNotSupportedException

Features can be read-only or read/write.
Some features may be modifiable while parsing; others only before parsing starts

For example,

try {
  if (xmlReader.getFeature("http://xml.org/sax/features/validation")) {
    System.out.println("Parser is validating.");
  } 
  else {
    System.out.println("Parser is not validating.");
  }
} 
catch (SAXException e) {
  System.out.println("Do not know if parser validates");
}

Feature/Property SAXExceptions

SAXNotRecognizedException: the parser does not recognize a requested feature or property
SAXNotSupportedException: the parser does not support a requested feature/property or the feature/property is read-only

Required Features

http://xml.org/sax/features/namespaces
- If true, then perform namespace processing.
- If false, then, at parser option, do not perform namespace processing
- access: (parsing) read-only; (not parsing) read/write
- true by default
http://xml.org/sax/features/namespace-prefixes
- If true, then report the original prefixed names and attributes used for namespace declarations.
- If false, then do not report attributes used for namespace declarations, and optionally do not report original prefixed names.
- false by default
- access: (parsing) read-only; (not parsing) read/write

Core Features

http://xml.org/sax/features/namespaces
http://xml.org/sax/features/namespace-prefixes
http://xml.org/sax/features/string-interning
- If true, then all element names, prefixes, attribute names, Namespace URIs, and local names are internalized using java.lang.String.intern().
- If false, then names are not necessarily internalized.
- access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/validation
- If true, then report all validation errors
- If false, then do not report validation errors.
- access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/external-general-entities
- If true, then include all external general (text) entities.
- false: Do not include external general entities.
- access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/external-parameter-entities
- If true, then include all external parameter entities, including the external DTD subset.
- false: Do not include any external parameter entities, even the external DTD subset.
- access: (parsing) read-only; (not parsing) read/write

adapted from SAX2 documentation by David Megginson

Turning on Validation

Not all parsers are validating but Xerces-J is.
Validity errors are not fatal; therefore they do not throw SAXParseExceptions
Must install an ErrorHandler as well as a ContentHandler
Must set the feature http://xml.org/sax/features/validation

Three Levels of Errors

In increasing order of severity
1. A warning; e.g. ambiguous content model, a constraint for compatibility
2. A recoverable error: typically a validity error
3. A fatal error: typically a well-formedness error

The ErrorHandler interface

package org.xml.sax;

public interface ErrorHandler {
 
  public void warning(SAXParseException exception)
   throws SAXException;

  public void error(SAXParseException exception)
   throws SAXException;
    
  public void fatalError(SAXParseException exception)
   throws SAXException;
    
}

An ErrorHandler for Reporting Validity Errors

import org.xml.sax.*;
import java.io.*;


public class ValidityErrorReporter implements ErrorHandler {
 
  Writer out;
 
  public ValidityErrorReporter(Writer out) {
    this.out = out;
  }
 
  public ValidityErrorReporter() {
    this(new OutputStreamWriter(System.out));
  }
 
  public void warning(SAXParseException ex)
   throws SAXException {

    try {
      out.write(ex.getMessage() + "\r\n");
      out.write(" at line " + ex.getLineNumber() + ", column " 
       + ex.getColumnNumber() + "\r\n");
      out.flush();
    }
    catch (IOException e) {
      throw new SAXException(e); 
    }
    
  }

  public void error(SAXParseException ex)
   throws SAXException {
    
    try {
      out.write(ex.getMessage() + "\r\n");
      out.write(" at line " + ex.getLineNumber() + ", column " 
       + ex.getColumnNumber() + "\r\n");
      out.flush();
    }
    catch (IOException e) {
      throw new SAXException(e); 
    }
    
  }
    
  public void fatalError(SAXParseException ex)
   throws SAXException {
    
    try {
      out.write(ex.getMessage() + "\r\n");
      out.write(" at line " + ex.getLineNumber() + ", column " 
       + ex.getColumnNumber() + "\r\n");
      out.flush();
    }
    catch (IOException e) {
      throw new SAXException(e); 
    }
    
  }
    
}

Validating

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;


public class Validator {

  public static void main(String[] args) {

    XMLReader parser = XMLReaderFactory.createXMLReader();

    // turn on validation
    try {
      parser.setFeature(
       "http://xml.org/sax/features/validation", true);
      parser.setErrorHandler(new ValidityErrorReporter());
    }
    catch (SAXNotRecognizedException e) {
      System.err.println(
       "Installed XML parser cannot validate;"
       + " checking for well-formedness instead...");
    }
    catch (SAXNotSupportedException e) {
      System.err.println(
       "Cannot turn on validation here; "
       + "checking for well-formedness instead...");
    }

    if (args.length == 0) {
      System.out.println("Usage: java Validator URL1 URL2...");
    }

    // start parsing...
    for (int i = 0; i < args.length; i++) {

      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
        // If there are no well-formedness errors,
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber()
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not check " + args[i]
         + " because of the IOException " + e);
      }

    }

  }

}

Schemas

An XML syntax
Let you specify the contents of elements
Type derivation
Xerces validates against schemas if the document uses xsi:schemaLocation or xsi:noNamespaceSchemaLocation to point at a schema
Standard is not quite finished yet

Schema for Songs

<xsd:schema xmlns:xsd="http://www.w3.org/1999/XMLSchema">
 
  <xsd:element name="SONG" type="SongType"/>

  <xsd:complexType name="SongType">
  
    <xsd:element name="TITLE"     type="xsd:string" minOccurs="1" maxOccurs="1"/>
    <xsd:element name="COMPOSER"  type="xsd:string" minOccurs="1" maxOccurs="unbounded"/>
    <xsd:element name="PRODUCER"  type="xsd:string" minOccurs="0" maxOccurs="unbounded"/>
    <xsd:element name="PUBLISHER" type="xsd:string" minOccurs="0" maxOccurs="1"/>
  
    <xsd:element name="LENGTH" type="xsd:timeDuration" minOccurs="1" maxOccurs="1"/>
    <xsd:element name="YEAR"   type="xsd:year" minOccurs="1" maxOccurs="1"/>

    <xsd:element name="ARTIST" type="xsd:string" minOccurs="0" maxOccurs="unbounded"/>
    
  </xsd:complexType>

</xsd:schema>

Core Properties

http://xml.org/sax/properties/lexical-handler
- data type: org.xml.sax.ext.LexicalHandler
- description: An optional extension handler for items like comments that are not part of the information set and may be omitted.
- access: read/write
http://xml.org/sax/properties/declaration-handler
- data type: org.xml.sax.ext.DeclHandler
- description: An optional extension handler for ATTLIST and ELEMENT declarations (but not notations and unparsed entities).
- access: read/write
http://xml.org/sax/properties/dom-node
- data type: org.w3c.dom.Node
- description: When parsing, the current DOM node being visited if this is a DOM iterator; when not parsing, the root DOM node for iteration.
- access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/properties/xml-string
- data type: java.lang.String
- description: The literal string of characters that was the source for the current event.
- access: read-only

adapted from SAX2 documentation by David Megginson

Nonstandard Features in Xerces

http://apache.org/xml/features/validation/dynamic
- True: The parser will validate the document if a DTD is specified in a DOCTYPE declaration or using the appropriate schema attributes like xsi:noNamespaceSchemaLocation.
- False: Validation is determined by the state of the http://xml.org/sax/features/validation feature.
- Default is false
http://apache.org/xml/features/validation/warn-on-duplicate-attdef
- True: Warn on duplicate attribute declaration.
- False: Do not warn on duplicate attribute declaration.
- Default: true
http://apache.org/xml/features/validation/warn-on-undeclared-elemdef
- True: Warn if element referenced in content model is not declared.
- False: Do not warn if element referenced in content model is not declared.
- Default: true
http://apache.org/xml/features/allow-java-encodings
- True: Allow Java encoding names like 8859_1 in XML and text declarations.
- False: Do not allow Java encoding names in XML and text declarations.
- Default: false
http://apache.org/xml/features/continue-after-fatal-error
- True: Continue after fatal error.
- False: Stops parse on first fatal error.
- Default: false

Nonstandard Properties in Xerces

None for the SAXParser
The DOM parser has a couple

Properties for Extension Handlers

Extension handlers are non-required interfaces in the org.xml.sax.ext package.
To set the LexicalHandler for an XML reader, set the property http://xml.org/sax/handlers/LexicalHandler.
To set the DeclHandler for an XML reader, set the property http://xml.org/sax/handlers/DeclHandler.
If the reader does not support the requested property, it will throw a SAXNotRecognizedException or a SAXNotSupportedException.

Handling Attributes in SAX2

The startElement() method in ContentHandler receives as an argument an Attribute object containing all attributes on that tag.
public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException

The Attributes interface:

package org.xml.sax;

public interface Attributes {

  public int    getLength();

  /* Look up an attribute's Namespace URI by index.*/
  public String getURI(int index);
  public String getLocalName(int index);
  public String getQName(int index);
  public String getType(int index);
  public String getValue(int index);
  public int    getIndex(String uri, String localPart);
  public int    getIndex(String qualifiedName);
  public String getType(String uri, String localName);
  public String getType(String qualifiedName);
  public String getValue(String uri, String localName);
  public String getValue(String qualifiedName);

}

Attributes Example

import org.xml.sax.*;
import org.apache.xerces.parsers.*;
import java.io.*;
import java.util.*;
import org.xml.sax.helpers.*;


public class XLinkSpider extends DefaultHandler {

  public static Enumeration listURIs(String systemId) 
   throws SAXException, IOException {
    
    // set up the parser 
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    } 
    catch (SAXException e) {
      try {
        parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException e2) {
        System.err.println("Error: could not locate a parser.");
        return null;
      }
    }
      
    // Install the Content Handler   
    XLinkSpider spider = new XLinkSpider();   
    parser.setContentHandler(spider);
    parser.parse(systemId);
    return spider.uris.elements();
      
  }
  
  private Vector uris = new Vector();

  public void startElement(String namespaceURI, String localName, 
   String rawName, Attributes atts) throws SAXException {
    
     String uri = atts.getValue("http://www.w3.org/1999/xlink", "href");
     if (uri != null) uris.addElement(uri);
    
  }
  

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java XLinkSpider URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      try {
        Enumeration uris = listURIs(args[i]);
        while (uris.hasMoreElements()) {
          String s = (String) uris.nextElement();
          System.out.println(s);
        }
      }
      catch (Exception e) {
        System.err.println(e);
        e.printStackTrace(); 
      }
      
    } // end for
  
  } // end main

} // end XLinkSpider

Resolving Entities

The EntityResolver allows you to substitute your own URI lookup scheme for external entities
Especially useful for entities that use URL and URI schemes not supported by Java's protocol handlers; e.g. jdbc: or isbn:

The EntityResolver interface:

package org.xml.sax;

import java.io.IOException;

public interface EntityResolver {  

  public InputSource resolveEntity (String publicId,
   String systemId) throws SAXException, IOException;
    
}

EntityResolver Example

import org.xml.sax.*;

public class RSSResolver implements EntityResolver {

  public InputSource resolveEntity(String publicId, String systemId) {

    if (publicId.equals("-//Netscape Communications//DTD RSS 0.91//EN")
     || systemId.equals("http://my.netscape.com/publish/formats/rss-0.91.dtd")) {
      return new InputSource("http://metalab.unc.edu/xml/dtds/rss.dtd");
    } 
    else {
      // use the default behaviour
      return null;
    }
    
  }
   
}

Handling DTDs

The DTDHandler interface covers those aspects of DTDs a non-validating parser may care about and are not handled by other interfaces:
- Notation Declarations
- Unparsed Entity Declarations
Attribute Defaults are handled transparently by startElement() and the Attributes interface
Parsed entities are handled transparently by ContentHandler unless you install an EntityResolver

The DTDHandler interface:

package org.xml.sax;

public interface DTDHandler {
       
  public void notationDecl(String name, String publicId, String systemId)
   throws SAXException;
 
  public void unparsedEntityDecl(String name, String publicId, 
   String systemId, String notationName) throws SAXException;
    
}

DTDHandler Example

Program to map unparsed entities with notation "text/plain" to CDATA sections
AttributeHandler will have to make actual replacements
Will finish with XMLFilter

TextEntityReplacer

import org.xml.sax.*;
import java.util.*;
import java.net.*;
import java.io.*;


public class TextEntityReplacer implements DTDHandler {

  /* This class stores the notation and entity declarations 
     for a single document. It is not designed to be reused
     for multiple parses, though that would be straightforward
     extension. The public and system IDs of the document
     being parsed are set in the constructor.    
  */ 
  
  private URL systemID;
  private String publicID;
  
  public TextEntityReplacer(String publicID, String systemID) 
   throws MalformedURLException {
    System.err.println("created");
    this.publicID = publicID;
    this.systemID = new URL(systemID);
  }

  // store all notations in a hashtable. We'll need them later
  private Hashtable notations = new Hashtable();

  // for the DTDHandler interface
  public void notationDecl(String name, String publicID, String systemID)
   throws SAXException {
    
    Notation n = new Notation(name, publicID, systemID);
    notations.put(name, n);
    
  }
  
  private class Notation {
    
    String name;
    String publicID;
    String systemID;
    
    Notation(String name, String publicID, String systemID) {
      this.name = name;
      this.publicID = publicID;
      this.systemID = systemID;
    } 
    
  }
 
   
  // store all unparsed entities in a hashtable. We'll need them later
  private Hashtable unparsedEntities = new Hashtable();

  // for the DTDHandler interface
  public void unparsedEntityDecl(String name, String publicID, 
   String systemID, String notationName) throws SAXException {
    
    UnparsedEntity e = new UnparsedEntity(name, publicID, systemID, notationName);
    unparsedEntities.put(name, e);
    
  }    

  private class UnparsedEntity {
    
    String name;
    String publicID;
    String systemID;
    String notationName;
    
    UnparsedEntity(String name, String publicID, String systemID, String notationName) {
      this.name = name;
      this.notationName = notationName;
      this.publicID = publicID;
      this.systemID = systemID;
    } 
    
  }


  public boolean isText(String notationName) {
    
    Object o = notations.get(notationName);
    if (o == null) return false;
    Notation n = (Notation) o;
    if (n.systemID.startsWith("text/")) return true;
    return false;
    
  }
  
  public String getText(String entityName) throws IOException {
    
    Object o = unparsedEntities.get(entityName);
    if (o == null) return "";
    UnparsedEntity entity = (UnparsedEntity) o;
    if (!isText(entity.notationName)) {
      return " binary data "; // could throw an exception instead
    }
    
    URL source;
    try {
      source = new URL(systemID, entity.systemID);     
    }
    catch (Exception e) {
      return " unresolvable entity "; // could throw an exception instead
    }
    
    // I'm not really handling characetr encodings here. 
    // A more detailed look at the MIME type would allow that.
    Reader in = new BufferedReader(new InputStreamReader(source.openStream()));
    StringBuffer result = new StringBuffer();
    int c;
    while ((c = in.read()) != -1) {
      // Is this necessaary or will parser escape string automatically????
   /*   switch (c) {
        case '<': 
          result.append("&lt;");
          break;
        case '>': 
          result.append("&gt;");
          break;
        case '"': 
          result.append("&quot;");
          break;
        case '\'': 
          result.append("&apos;");
          break;
        case '&': 
          result.append("&amp;");
          break;
        default:
          result.append((char) c); 
      }*/
      result.append((char) c);
    }
    
    return result.toString();
    
  }

}

Handling Declarations

The optional DeclHandler interface covers those aspects of DTDs only a validating parser cares about:
- Element declarations
- Attribute declarations
- Internal entity declarations
- External entity declarations
An optional extension that not all parsers (particularly non-validating parsers) support
To set the DeclHandler for a parser, set the "http://xml.org/sax/handlers/DeclHandler" property. A SAXNotRecognizedException or SAXNotSupportedException will be thrown if the parser doesn't support DeclHandler

The DeclHandler interface:

package org.xml.sax.ext;

import org.xml.sax.SAXException;


public interface DeclHandler {

  public void elementDecl(String name, String model)
   throws SAXException;

  public void attributeDecl(String elementName, String attributeName, 
   String type, String defaultValue, String value) throws SAXException;

  public void internalEntityDecl(String name, String value)
   throws SAXException;

  public void externalEntityDecl(String name, String publicId,
   String systemId) throws SAXException;

}

Handling Lexical Events

The LexicalHandler interface reports:
- Comments
- CDATA sections
- Document type declaration
- Entities
An optional extension that not all parsers support
To set the LexicalHandler for a parser, set the "http://xml.org/sax/handlers/LexicalHandler" property. A SAXNotRecognizedException or SAXNotSupportedException will be thrown if the parser doesn't report lexical events

The LexicalHandler interface

package org.xml.sax.ext;

import org.xml.sax.SAXException;


public interface LexicalHandler {

  public void startDTD(String name, String publicId, String systemId)
     throws SAXException;
  public void endDTD() throws SAXException;
  public void startEntity(String name) throws SAXException;
  public void endEntity(String name) throws SAXException;
  public void startCDATA() throws SAXException;
  public void endCDATA() throws SAXException;
  public void comment (char[] text, int start, int length) 
   throws SAXException;

}

LexicalHandler Example

import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.*;
import java.io.IOException;


public class SAXCommentReader implements LexicalHandler {

  public void startDTD(String name, String publicId, String systemId)
   throws SAXException {}
  public void endDTD() throws SAXException {}
  public void startEntity(String name) throws SAXException {}
  public void endEntity(String name) throws SAXException {}
  public void startCDATA() throws SAXException {}
  public void endCDATA() throws SAXException {}

  public void comment (char[] text, int start, int length)
   throws SAXException {

    String comment = new String(text, start, length);
    System.out.println(comment);

  }

  public static void main(String[] args) {

    // set up the parser
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    }
    catch (SAXException e) {
      try {
        parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException e2) {
        System.err.println("Error: could not locate a parser.");
        return;
      }
    }

    // turn on comment handling
    try {
      parser.setProperty("http://xml.org/sax/properties/lexical-handler",
       new SAXCommentReader());
    }
    catch (SAXNotRecognizedException e) {
      System.err.println(
       "Installed XML parser does not provide lexical events...");
      return;
    }
    catch (SAXNotSupportedException e) {
      System.err.println(
       "Cannot turn on comment processing here");
      return;
    }

    if (args.length == 0) {
      System.out.println("Usage: java SAXCommentReader URL1 URL2...");
    }

    // start parsing...
    for (int i = 0; i < args.length; i++) {

      try {
        parser.parse(args[i]);
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber()
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not check " + args[i]
         + " because of the IOException " + e);
      }

    }

  }

}

The Locator interface

Tells the callback class where in the document (line number, column number) a particular event took place
Optional but recommended
Parsers give the callback class a Locator by passing it to the setDocumentLocator() method of ContentHandler

The Locator interface:

package org.xml.sax;


public interface Locator {
    
  public String getPublicId();
  public String getSystemId();
  public int    getLineNumber();
  public int    getColumnNumber();
    
}

Locator Example

import org.xml.sax.*;
import org.apache.xerces.parsers.*; 
import java.io.*;


public class LocationReporter implements ContentHandler {

  Locator locator = null;

  public void setDocumentLocator(Locator locator) {
    this.locator = locator;  
  }
  
  private String reportPosition() {
    
    if (locator != null) {
      
      String publicID = locator.getPublicId();
      String systemID = locator.getSystemId();
      int line        = locator.getLineNumber();
      int column      = locator.getColumnNumber();
      
      String name;
      if (publicID != null) name = publicID;
      else name = systemID;
      
      return " in " + name + " at line " + line 
       + ", column " + column;
    }
    return "";
    
  }
  
  public void startDocument() throws SAXException {
    System.out.println("Document started" + reportPosition()); 
  }

  public void endDocument() throws SAXException {
    System.out.println("Document ended" + reportPosition()); 
  }
  
  public void characters(char[] text, int start, int length) 
   throws SAXException {
    System.out.println("Got some characters" + reportPosition()); 
  }
  
  public void ignorableWhitespace(char[] text, int start, int length)
   throws SAXException {
    System.out.println("Got some ignorable white space" 
     + reportPosition()); 
  }
  
  public void processingInstruction(String target, String data)
   throws SAXException {
    System.out.println("Got a processing instruction" 
     + reportPosition()); 
  }
  
  // Changed methods for SAX2
  public void startElement(String namespaceURI, String localName,
	 String rawName, Attributes atts) throws SAXException {
    System.out.println("Element " + rawName + " started" 
     + reportPosition()); 
  }
  
  public void endElement(String namespaceURI, String localName,
	 String rawName) throws SAXException {
    System.out.println("Element " + rawName + " ended" 
     + reportPosition()); 
  } 

  // new methods for SAX2
  public void startPrefixMapping(String prefix, String uri) 
   throws SAXException {
    System.out.println("Started mapping prefix " + prefix + " to URI " 
     + uri + reportPosition());     
  }

  public void endPrefixMapping(String prefix) throws SAXException {
    System.out.println("Stopped mapping prefix " 
     + prefix + reportPosition());         
  }

  public void skippedEntity(String name) throws SAXException {
    System.out.println("Skipped entity " + name + reportPosition());         
  }  

  // Could easily have put main() method in a separate class
  public static void main(String[] args) {
    
    XMLReader parser = new SAXParser();
     
    if (args.length == 0) {
      System.out.println(
       "Usage: java LocationReporter URL1 URL2..."); 
    } 
      
    // Install the Content Handler      
    parser.setContentHandler(new LocationReporter());
    
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber() 
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not report on " + args[i] 
         + " because of the IOException " + e);
      }
      
    }  
  
  }

}

View Output

The DefaultHandler class

Implements the main interfaces with do-nothing methods
- EntityResolver
- DTDHandler
- ContentHandler
- ErrorHandler
Replaces HandlerBase from SAX1

The NamespaceSupport class

Mostly for internal parser use
Occasionally useful for tasks like finding out whether a document contains any XLinks

The NamespaceSupport class:

package org.xml.sax.helpers;

public class NamespaceSupport {

  public final static String XMLNS = "http://www.w3.org/XML/1998/namespace";

  public NamespaceSupport();

  public void reset();
  public void pushContext();
  public void popContext();
  public boolean declarePrefix(String prefix, String uri);
  public String getURI(String prefix);
  public Enumeration getPrefixes();
  public Enumeration getDeclaredPrefixes();
  public String[] processName(String qualifiedName, String[] parts, 
   boolean isAttribute);
   
}

Filtering XML

The XMLFilter interface is like an XML reader, "except that it obtains its events from another XML reader rather than a primary source like an XML document or database. Filters can modify a stream of events as they pass on to the final application."
The parent is the parser it gets the data from.

Only two methods in the interface:

public void setParent(XMLReader parent)
public XMLReader getParent()

XMLFilterImpl is a default filter that simply passes along all events it receives:
public class XMLFilterImpl implements XMLFilter, EntityResolver, DTDHandler, ContentHandler, ErrorHandler

Only new methods are constructors:

public XMLFilterImpl()
public XMLFilterImpl(XMLReader parent)

XMLFilter Example

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.util.*;
import java.io.IOException;


public class UnparsedTextFilter extends XMLFilterImpl {

  private TextEntityReplacer replacer;

  public UnparsedTextFilter(XMLReader parent) {
    super(parent);
    System.err.println("created UnparsedTextFilter");
  }

  public void parse(InputSource input) throws IOException, SAXException {
    System.err.println("parsing");
    replacer = new TextEntityReplacer(input.getPublicId(), input.getSystemId());
    this.setDTDHandler(replacer); 
  }
  // The other parse() method just calls this one 

  public void parse(String systemId) throws IOException, SAXException {
    parse(new InputSource(systemId)); 
  }

  public void startElement(String uri, String localName, 
   String rawName, Attributes attributes) throws SAXException {
    
    Vector extraText = new Vector();

    // Are there any unparsed entities in the attributes?
    for (int i = 0; i < attributes.getLength(); i++) {
      if (attributes.getType(i).equals("ENTITY")) {
        try {
          System.out.println("replacing");
          String s = replacer.getText(attributes.getValue(i));
          if (s != null) extraText.addElement(s);
        }
        catch (IOException e) {
          System.err.println(e); 
        }
      } 
      
    }    

    super.startElement(uri, localName, rawName, attributes);
    
    // Now spew out the values of the unparsed entities:
    Enumeration e = extraText.elements();
    while (e.hasMoreElements()) {
      Object o = e.nextElement();
      String s = (String) o;
      super.characters(s.toCharArray(), 0, s.length()); 
    }
    
  }

}

TextMerger

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.util.*;
import java.io.IOException;
import org.apache.xerces.parsers.*; 
import org.apache.xml.serialize.*;


public class TextMerger {

  public static void main(String[] args) {
  
    System.err.println("starting");
    XMLReader parser = new UnparsedTextFilter(new SAXParser());
    
    //essentially a pretty printer
    XMLSerializer printer 
     = new XMLSerializer(System.out, new OutputFormat());
    
    parser.setContentHandler(printer);
    
    for (int i = 0; i < args.length; i++) {
      try {
        parser.parse(args[i]);
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber() 
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not report on " + args[i] 
         + " because of the IOException " + e);
      }      
    }
  
  }

}

InputSource

Encapsulates access to data so that it looks the same whether it's coming from a
- URL
- file
- stream
- reader
- database
- something else
Used in SAX1 and SAX2
Allows the source to be changed

The InputSource interface

package org.xml.sax;

import java.io.*;

public class InputSource {

  public InputSource() 
  public InputSource(String systemID) 
  public InputSource(InputStream in)
  public InputSource(Reader in)

  public void setPublicId(String publicID)
  public String getPublicId()
  public void setSystemId(String systemID)
  public String getSystemId()

  public void setByteStream(InputStream byteStream)
  public InputStream getByteStream()
  public void setEncoding(String encoding)
  public String getEncoding()
  public void setCharacterStream(Reader characterStream)
  public Reader getCharacterStream()

}

Example of InputSource

import org.xml.sax;
import java.io.*;
import java.util.zip.*;
...
try {

  URL u = new URL("http://metalab.unc.edu/xml/examples/1998validstats.xml.gz"); 
  InputStream raw = u.openStream();
  InputStream decompressed = new GZIPInputStream(in);
  InputSource in = new InputSource(decompressed);
  // read the document... 

}
catch (IOException e) {
  System.err.println(e);
}
catch (SAXException e) {
  System.err.println(e);
}

What SAX2 doesn't do

ELEMENT, ATTLIST, ENTITY declarations are only optionally reported
Schema declarations aren't reported at all
Lexical events are only optionally reported
SAX2 can be configured on top of a lot of different parsers with different capabilities. What the parser does is more important than what SAX2 does.

Event Based API Caveats

You do not always have all the information you need at the time of a given callback
You may need to store information in various data structures (stacks, queues,vectors, arrays, etc.) and act on it at a later point
For example the characters() method is not guaranteed to give you the maximum number of contiguous characters. It may split a single run of characters over multiple method calls.

To Learn More

This presentation: http://metalab.unc.edu/xml/slides/sd2000east/sax

Questions?

Index | Cafe con Leche