A Brief Review of XML Rules and Terminology
Reading XML through SAX2
You need a JDK
You need some free class libraries
You need a text editor
You need some data to process
Are familiar with Java including I/O, classes, objects, polymorphism, etc.
Know XML including well-formedness, validity, namespaces, and so forth
I will briefly review proper terminology
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/css" href="song.css"?> <!DOCTYPE SONG SYSTEM "song.dtd"> <SONG xmlns="http://metalab.unc.edu/xml/namespace/song" xmlns:xlink="http://www.w3.org/1999/xlink"> <TITLE>Hot Cop</TITLE> <PHOTO xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg" ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/> <COMPOSER>Jacques Morali</COMPOSER> <COMPOSER>Henri Belolo</COMPOSER> <COMPOSER>Victor Willis</COMPOSER> <PRODUCER>Jacques Morali</PRODUCER> <!-- The publisher is actually Polygram but I needed an example of a general entity reference. --> <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/"> A & M Records </PUBLISHER> <LENGTH>6:20</LENGTH> <YEAR>1978</YEAR> <ARTIST>Village People</ARTIST> </SONG> <!-- You can tell what album I was listening to when I wrote this example -->View in Browser
Markup includes:
Tags
Entity References
Comments
Processing Instructions
Document Type Declarations
XML Declaration
CDATA Section Delimiters
Character data includes everything else
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song"
xmlns:xlink="http://www.w3.org/1999/xlink">
<TITLE>Hot Cop</TITLE>
<PHOTO
xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
<COMPOSER>Jacques Morali</COMPOSER>
<COMPOSER>Henri Belolo</COMPOSER>
<COMPOSER>Victor Willis</COMPOSER>
<PRODUCER>Jacques Morali</PRODUCER>
<!-- The publisher is actually Polygram but I needed
an example of a general entity reference. -->
<PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
A & M Records
</PUBLISHER>
<LENGTH>6:20</LENGTH>
<YEAR>1978</YEAR>
<ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was
listening to when I wrote this example -->
An XML document is made up of one or more physical storage units called entities
Entity references :
Parsed internal general entity references like &
Parsed external general entity references
Unparsed external general entity references
External parameter entity references
Internal parameter entity references
Reading an XML document is not the same thing as reading an XML file
The file contains entity references.
The file document contains the entities' replacement text.
When you use a parser to read a document you'll get the text including characters like <. You will not see the entity references.
Character data left after entity references are replaced with their text
Given the element
<PUBLISHER>A & M Records</PUBLISHER>
The parsed character data is
A & M Records
Used to include large blocks of text with lots of normally
illegal literal characters like
<
and &
, typically XML or HTML.
<p>You can use a default <code>xmlns</code>
attribute to avoid having to add the svg
prefix to all
your elements:</p>
<![CDATA[
<svg xmlns="http://www.w3.org/Graphics/SVG/SVG-19991203.dtd"
width="12cm" height="10cm">
<ellipse rx="110" ry="130" />
<rect x="4cm" y="1cm" width="3cm" height="6cm" />
</svg>
]]>
CDATA is for human authors, not for programs!
<!-- Before posting this page, I need to double check the number
of pelicans in Lousiana in 1970 -->
Comments are for humans, not programs.
Divided into a target and data for the target
The target must be an XML name
The data can have an effectively arbitrary format
<?robots index="yes" follow="no"?>
<?xml-stylesheet href="pelicans.css" type="text/css"?>
<?php
mysql_connect("database.unc.edu", "clerk", "password");
$result = mysql("CYNW", "SELECT LastName, FirstName FROM Employees
ORDER BY LastName, FirstName");
$i = 0;
while ($i < mysql_numrows ($result)) {
$fields = mysql_fetch_row($result);
echo "<person>$fields[1] $fields[0] </person>\r\n";
$i++;
}
mysql_close();
?>
These are for programs
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
Looks like a processing instruction but isn't.
version
attribute
required
always has the value 1.0
encoding
attribute
UTF-8
8859_1
etc.
standalone
attribute
yes
no
<!DOCTYPE SONG SYSTEM "song.dtd">
<!ELEMENT SONG (TITLE, COMPOSER+, PRODUCER*, PUBLISHER*, YEAR?, LENGTH?, ARTIST+)> <!ELEMENT TITLE (#PCDATA)> <!ELEMENT COMPOSER (#PCDATA)> <!ELEMENT PRODUCER (#PCDATA)> <!ELEMENT PUBLISHER (#PCDATA)> <!ELEMENT LENGTH (#PCDATA)> <!-- This should be a four digit year like "1999", not a two-digit year like "99" --> <!ELEMENT YEAR (#PCDATA)> <!ELEMENT ARTIST (#PCDATA)>
Used for element, attribute, and entity names
Can contain any alphabetic, ideographic, or numeric Unicode character
Can contain hyphen, underscore, or period
Can also contain colons but these are reserved for namespaces
Can begin with any alphabetic or ideographic character or the underscore but not digits or other punctuation marks
Raison d'etre:
To distinguish between elements and attributes from different vocabularies with different meanings.
To group all related elements and attributes together so that a parser can easily recognize them.
Each element is given a prefix
Each prefix (as well as the empty prefix) is associated with a URI
Elements with the same URI are in the same namespace
URIs are purely formal. They do not necessarily point to a page.
Elements and attributes that are in namespaces have names that contain exactly one colon. They look like this:
rdf:description
xlink:type
xsl:template
Everything before the colon is called the prefix
Everything after the colon is called the local part or local name.
The complete name including the colon is called the qualified name or raw name.
Each prefix in a qualified name is associated with a URI.
For example, all elements in XSLT 1.0 style sheets are associated with the http://www.w3.org/1999/XSL/Transform URI.
The customary prefix xsl
is a shorthand for the longer URI
http://www.w3.org/1999/XSL/Transform.
You can't use the URI in the element name directly.
Prefixes are bound to namespace URIs by attaching an xmlns:prefix
attribute to the prefixed element or one of its ancestors.
<svg:svg xmlns:svg="http://www.w3.org/Graphics/SVG/SVG-19991203.dtd"
width="12cm" height="10cm">
<svg:ellipse rx="110" ry="130" />
<svg:rect x="4cm" y="1cm" width="3cm" height="6cm" />
</svg:svg>
Bindings have scope within the element where they're declared.
An SVG processor can recognize all three of these elements as SVG elements because they all have prefixes bound to the particular URI defined by the SVG specification.
Indicate that an unprefixed element and all its unprefixed descendant
elements belong to a particular namespace by attaching an xmlns
attribute with no prefix:
<DATASCHEMA xmlns="http://www.w3.org/2000/P3Pv1">
<DATA name="vehicle.make" type="text" short="Make"
category="preference" size="31"/>
<DATA name="vehicle.model" type="text" short="Model"
category="preference" size="31"/>
<DATA name="vehicle.year" type="number" short="Year"
category="preference" size="4"/>
<DATA name="vehicle.license.state." type="postal." short="State"
category="preference" size="2"/>
<DATA name="vehicle.license.number" type="text"
short="License Plate Number" category="preference" size="12"/>
</DATASCHEMA>
Both the DATASCHEMA
and DATA
elements are in the
http://www.w3.org/2000/P3Pv1 namespace.
Default namespaces apply only to elements, not to attributes.
Thus in the above example the name
,
type
, short
, category
, and size
attributes are not in any namespace.
Unprefixed attributes are never in any namespace.
You can change the default namespace within a particular
element by adding an xmlns
attribute to the element.
Namespaces were added to XML 1.0 after the fact, but care was taken to ensure backwards compatibility.
An XML 1.0 parser that does not know about namespaces will most likely not have any troubles reading a document that uses namespaces.
A namespace aware parser also checks to see that all prefixes are mapped to URIs. Otherwise it behaves almost exactly like a non-namespace aware parser.
Other software that sits on top of the raw XML parser, an XSLT engine for example, may treat elements differently depending on what namespace they belong to. However, the XML parser itself mostly doesn't care as long as all well-formedness and namespace constraints are met.
A possible exception occurs in the unlikely event that elements with different prefixes belong to the same namespace or elements with the same prefix belong to different namespaces
Many parsers have the option of whether to report namespace violations so that you can turn namespace processing on or off as you see fit.
A W3C standard for determining when two documents are the same after:
Entity references are resolved
Document is converted to Unicode
Unicode combining forms are combined
Comments are stripped
White space is normalized
Default attribute values are added
If at all possible, your programs should depend only on the canonical form of the document
Canonical form of hotcop.xml:
<?xml-stylesheet type="text/css" href="song.css"?><SONG> <TITLE>Hot Cop</TITLE> <COMPOSER>Jacques Morali</COMPOSER> <COMPOSER>Henri Belolo</COMPOSER> <COMPOSER>Victor Willis</COMPOSER> <PRODUCER>Jacques Morali</PRODUCER> <PUBLISHER>A & M Records</PUBLISHER> <LENGTH>6:20</LENGTH> <YEAR>1978</YEAR> <ARTIST>Village People</ARTIST> </SONG>
The stereotypical "Desperate Perl Hacker" (DPH) is supposed to be able to write an XML parser in a weekend.
The parser does the hard work for you.
Your code reads the document through the parser's API.
SAX, the Simple API for XML
SAX1
SAX2
DOM, the Document Object Model
DOM Level 0
DOM Level 1
DOM Level 2
Proprietary APIs
Parser specific APIs
Sun's Java API for XML Parsing = SAX1 + DOM1 + a few factory classes
JSR-000031 XML Data Binding Specification from Bluestone, Sun, webMethods et al.
The proposed specification will define an XML data-binding facility for the JavaTM Platform. Such a facility compiles an XML schema into one or more Java classes. These automatically-generated classes handle the translation between XML documents that follow the schema and interrelated instances of the derived classes. They also ensure that the constraints expressed in the schema are maintained as instances of the classes are manipulated.
Public domain, developed on xml-dev mailing list
Maintained by David Megginson
org.xml.sax package
Event based
SAX1 omits:
Comments
Lexical Information (CDATA sections, entity references, etc.)
DTD declarations
Validation
Namespaces
Parser | URL | Validating | Namespaces | DOM1 | DOM2 | SAX1 | SAX2 | License |
---|---|---|---|---|---|---|---|---|
Apache XML Project's Xerces Java | http://xml.apache.org/xerces-j/index.html | X | X | X | X | X | X | Apache Software License, Version 1.1 |
IBM's XML for Java | http://www.alphaworks.ibm.com/formula/xml | X | X | X | X | X | X | License |
James Clark's XP | http://www.jclark.com/xml/xp/index.html | X | Modified BSD | |||||
Microstar's Ælfred | http://home.pacbell.net/david-b/xml/ | Namespaces | DOM1 | DOM2 | SAX1 | SAX2 | open source | |
Silfide's SXP | http://www.loria.fr/projets/XSilfide/EN/sxp/ | X | X | Non-GPL viral open source license | ||||
Sun's Java API for XML | http://java.sun.com/products/xml | X | X | X | X | free beer | ||
Oracle's XML Parser for Java | http://technet.oracle.com/ | X | X | X | X | free beer |
Completely ignores document type declaration
Validation and other optional results of DTD (attribute defaulting, external entities, etc.) are at parser default
Comments
XML Declaration
Does not report CDATA sections, entity references, and other non-canonical information from the document.
No explicit support for namespaces
Adds:
Namespace support
Optional Validation
Optional Lexical events for comments, CDATA sections, entity references
A lot more configurable
Deprecates a lot of SAX1
Adapter classes convert between parsers.
Use the factory method
XMLReaderFactory.createXMLReader()
to retrieve a parser-specific implementation of the
XMLReader
interface
Your code registers a ContentHandler
with the parser
An InputSource
feeds the document into the parser
As the document is read, the parser calls back to the
methods of the methods of the ContentHandler
to tell it
what it's seeing in the document.
The XMLReaderFactory.createXMLReader()
method
instantiates an XMLReader
subclass named by
the org.xml.sax.driver
system property:
try {
XMLReader parser = XMLReaderFactory.createXMLReader();
}
catch (SAXException e) {
System.err.println(e);
}
The XMLReaderFactory.createXMLReader(String className)
method
instantiates an XMLReader
subclass named by
its argument:
try {
XMLReader parser
= XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser");
}
catch (SAXException e) {
System.err.println(e);
}
Or you can use the constructor in the package-specific class:
XMLReader parser = new SAXParser();
import org.xml.sax.*; import org.xml.sax.helpers.*; import java.io.*; public class SAX2Checker { public static void main(String[] args) { if (args.length == 0) { System.out.println("Usage: java SAX2Checker URL1 URL2..."); } // set up the parser XMLReader parser; try { parser = XMLReaderFactory.createXMLReader(); } catch (SAXException e) { try { parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser"); } catch (SAXException e2) { System.err.println("Error: could not locate a parser."); return; } } // start parsing... for (int i = 0; i < args.length; i++) { // command line should offer URIs or file names try { parser.parse(args[i]); // If there are no well-formedness errors // then no exception is thrown System.out.println(args[i] + " is well formed."); } catch (SAXParseException e) { // well-formedness error System.out.println(args[i] + " is not well formed."); System.out.println(e.getMessage() + " at line " + e.getLineNumber() + ", column " + e.getColumnNumber()); } catch (SAXException e) { // some other kind of error System.out.println(e.getMessage()); } catch (IOException e) { System.out.println("Could not check " + args[i] + " because of the IOException " + e); } } } }
C:\>java SAX2Checker http://metalab.unc.edu/xml/
http://metalab.unc.edu/xml/ is not well formed.
The element type "dt" must be terminated by the
matching end-tag "</dt>".
at line 186, column 5
package org.xml.sax; public interface ContentHandler { public void setDocumentLocator(Locator locator); public void startDocument() throws SAXException; public void endDocument() throws SAXException; public void startPrefixMapping(String prefix, String uri) throws SAXException; public void endPrefixMapping(String prefix) throws SAXException; public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException; public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException; public void characters(char[] ch, int start, int length) throws SAXException; public void ignorableWhitespace(char ch[], int start, int length) throws SAXException; public void processingInstruction(String target, String data) throws SAXException; public void skippedEntity(String name) throws SAXException; }
import org.xml.sax.*; import org.xml.sax.helpers.*; import java.io.*; public class EventReporter implements ContentHandler { public void setDocumentLocator(Locator locator) { System.out.println("setDocumentLocator(" + locator + ")"); } public void startDocument() throws SAXException { System.out.println("startDocument()"); } public void endDocument() throws SAXException { System.out.println("endDocument()"); } public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException { namespaceURI = '"' + namespaceURI + '"'; localName = '"' + localName + '"'; qName = '"' + qName + '"'; String attributeString = "{"; for (int i = 0; i < atts.getLength(); i++) { attributeString += atts.getQName(i) + "=\"" + atts.getValue(i) + "\""; if (i != atts.getLength()-1) attributeString += ", "; } attributeString += "}"; System.out.println("startElement(" + namespaceURI + ", " + localName + ", " + qName + ", " + attributeString + ")"); } public void endElement(String namespaceURI, String localName, String qName) throws SAXException { namespaceURI = '"' + namespaceURI + '"'; localName = '"' + localName + '"'; qName = '"' + qName + '"'; System.out.println("endElement(" + namespaceURI + ", " + localName + ", " + qName + ")"); } public void characters(char[] text, int start, int length) throws SAXException { String textString = "[" + new String(text) + "]"; System.out.println("characters(" + textString + ", " + start + ", " + length + ")"); } public void ignorableWhitespace(char[] text, int start, int length) throws SAXException { System.out.println("ignorableWhitespace()"); } public void processingInstruction(String target, String data) throws SAXException { System.out.println("processingInstruction(" + target + ", " + data + ")"); } public void startPrefixMapping(String prefix, String uri) throws SAXException { System.out.println("startPrefixMapping(\"" + prefix + "\", \"" + uri + "\")"); } public void endPrefixMapping(String prefix) throws SAXException { System.out.println("startPrefixMapping(\"" + prefix + "\")"); } public void skippedEntity(String name) throws SAXException { System.out.println("skippedEntity(" + name + ")"); } // Could easily have put main() method in a separate class public static void main(String[] args) { XMLReader parser; try { parser = XMLReaderFactory.createXMLReader(); } catch (Exception e) { // fall back on Xerces parser by name try { parser = XMLReaderFactory.createXMLReader( "org.apache.xerces.parsers.SAXParser"); } catch (Exception ee) { System.err.println("Couldn't locate a SAX parser"); return; } } if (args.length == 0) { System.out.println( "Usage: java EventReporter URL1 URL2..."); } // Install the Document Handler parser.setContentHandler(new EventReporter()); // start parsing... for (int i = 0; i < args.length; i++) { // command line should offer URIs or file names try { parser.parse(args[i]); } catch (SAXParseException e) { // well-formedness error System.out.println(args[i] + " is not well formed."); System.out.println(e.getMessage() + " at line " + e.getLineNumber() + ", column " + e.getColumnNumber()); } catch (SAXException e) { // some other kind of error System.out.println(e.getMessage()); } catch (IOException e) { System.out.println("Could not report on " + args[i] + " because of the IOException " + e); } } } }
UserLand's RSS based list of Web logs at http://static.userland.com/weblogMonitor/logs.xml:
java.io.FileNotFoundException: \C:\speaking\sd2000east\sax\examples\shortlogs.xml
Design Decisions
Should we return an array, an Enumeration
,
a List
, or what?
Perhaps we should use multiple threads?
We do not know how many URLs there will be when we start parsing
so let's use a Vector
Single threaded for simplicity but a real program would use multiple threads
One to load and parse the data
Another thread (probably the main thread) to serve the data
Early data could be provided before the entire document had been read
The character data of each url
element needs to be stored.
Everything else can be ignored.
A startElement()
with the name
url indicates that we need to start
storing this data.
A stopElement()
with the name url indicates that we need to stop
storing this data, convert it to a URL
and put it in the
Vector
Hide the XML parsing inside a non-public class to avoid accidentally calling the methods from unexpected places or threads?
import org.xml.sax.*; import org.xml.sax.helpers.*; import java.util.*; import java.io.*; public class Weblogs { public static List listChannels() throws IOException, SAXException { return listChannels( "http://static.userland.com/weblogMonitor/logs.xml"); } public static List listChannels(String uri) throws IOException, SAXException { // set up the parser XMLReader parser; try { parser = XMLReaderFactory.createXMLReader(); } catch (SAXException e) { try { parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser"); } catch (SAXException e2) { System.err.println("Error: could not locate a parser."); return null; } } Vector urls = new Vector(1000); URIGrabber u = new URIGrabber(urls); parser.setContentHandler(u); parser.parse(uri); return urls; } public static void main(String[] args) { try { List urls; if (args.length > 0) urls = listChannels(args[0]); else urls = listChannels(); Iterator iterator = urls.iterator(); while (iterator.hasNext()) { System.out.println(iterator.next()); } } catch (IOException e) { System.err.println(e); } catch (SAXParseException e) { System.err.println(e); System.err.println("at line " + e.getLineNumber() + ", column " + e.getColumnNumber()); } catch (SAXException e) { System.err.println(e); } catch (/* Unexpected */ Exception e) { e.printStackTrace(); } } }
import org.xml.sax.*; import java.net.*; import java.util.Vector; // conflicts with java.net.ContentHandler class URIGrabber implements org.xml.sax.ContentHandler { private Vector urls; URIGrabber(Vector urls) { this.urls = urls; } // do nothing methods public void setDocumentLocator(Locator locator) {} public void startDocument() throws SAXException {} public void endDocument() throws SAXException {} public void startPrefixMapping(String prefix, String uri) throws SAXException {} public void endPrefixMapping(String prefix) throws SAXException {} public void skippedEntity(String name) throws SAXException {} public void ignorableWhitespace(char[] text, int start, int length) throws SAXException {} public void processingInstruction(String target, String data) throws SAXException {} // Remember, there's no guarantee all the text of the // url element will be returned in a single call to characters private StringBuffer urlBuffer; private boolean collecting = false; public void startElement(String namespaceURI, String localName, String rawName, Attributes atts) throws SAXException { if (rawName.equals("url")) { collecting = true; urlBuffer = new StringBuffer(); } } public void characters(char[] text, int start, int length) throws SAXException { if (collecting) { urlBuffer.append(text, start, length); } } public void endElement(String namespaceURI, String localName, String rawName) throws SAXException { if (rawName.equals("url")) { collecting = false; String url = urlBuffer.toString(); try { urls.addElement(new URL(url)); } catch (MalformedURLException e) { // skip this url } } } }
% java Weblogs shortlogs.xml
http://www.mozillazine.org
http://www.salonherringwiredfool.com/
http://www.scripting.com/
http://www.slashdot.org/
SAX2 parsers--that is XMLReaders--are configured by features and properties
Feature and property names are absolute URIs
A feature is boolean, on or off, true or false; a property is an object
public boolean getFeature(String name)
throws SAXNotRecognizedException, SAXNotSupportedException
public void setFeature(String name, boolean value)
throws SAXNotRecognizedException, SAXNotSupportedException
public Object getProperty(String name)
throws SAXNotRecognizedException, SAXNotSupportedException
public void setProperty(String name, Object value)
throws SAXNotRecognizedException, SAXNotSupportedException
Features can be read-only or read/write.
Some features may be modifiable while parsing; others only before parsing starts
For example,
try {
if (xmlReader.getFeature("http://xml.org/sax/features/validation")) {
System.out.println("Parser is validating.");
}
else {
System.out.println("Parser is not validating.");
}
}
catch (SAXException e) {
System.out.println("Do not know if parser validates");
}
SAXNotRecognizedException
: the parser does
not recognize a requested feature or property
SAXNotSupportedException
: the parser does
not support a requested feature/property or the
feature/property is read-only
http://xml.org/sax/features/namespaces
If true, then perform namespace processing.
If false, then, at parser option, do not perform namespace processing
access: (parsing) read-only; (not parsing) read/write
true by default
http://xml.org/sax/features/namespace-prefixes
If true, then report the original prefixed names and attributes used for namespace declarations.
If false, then do not report attributes used for namespace declarations, and optionally do not report original prefixed names.
false by default
access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/namespaces
http://xml.org/sax/features/namespace-prefixes
http://xml.org/sax/features/string-interning
If true, then all element names, prefixes, attribute
names, Namespace URIs, and local names are internalized using
java.lang.String.intern()
.
If false, then names are not necessarily internalized.
access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/validation
If true, then report all validation errors
If false, then do not report validation errors.
access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/external-general-entities
If true, then include all external general (text) entities.
false: Do not include external general entities.
access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/external-parameter-entities
If true, then include all external parameter entities, including the external DTD subset.
false: Do not include any external parameter entities, even the external DTD subset.
access: (parsing) read-only; (not parsing) read/write
adapted from SAX2 documentation by David Megginson
Not all parsers are validating but Xerces-J is.
Validity errors are not fatal; therefore they do not throw SAXParseExceptions
Must install an ErrorHandler
as well as a
ContentHandler
Must set the feature http://xml.org/sax/features/validation
In increasing order of severity
A warning; e.g. ambiguous content model, a constraint for compatibility
A recoverable error: typically a validity error
A fatal error: typically a well-formedness error
package org.xml.sax;
public interface ErrorHandler {
public void warning(SAXParseException exception)
throws SAXException;
public void error(SAXParseException exception)
throws SAXException;
public void fatalError(SAXParseException exception)
throws SAXException;
}
import org.xml.sax.*; import java.io.*; public class ValidityErrorReporter implements ErrorHandler { Writer out; public ValidityErrorReporter(Writer out) { this.out = out; } public ValidityErrorReporter() { this(new OutputStreamWriter(System.out)); } public void warning(SAXParseException ex) throws SAXException { try { out.write(ex.getMessage() + "\r\n"); out.write(" at line " + ex.getLineNumber() + ", column " + ex.getColumnNumber() + "\r\n"); out.flush(); } catch (IOException e) { throw new SAXException(e); } } public void error(SAXParseException ex) throws SAXException { try { out.write(ex.getMessage() + "\r\n"); out.write(" at line " + ex.getLineNumber() + ", column " + ex.getColumnNumber() + "\r\n"); out.flush(); } catch (IOException e) { throw new SAXException(e); } } public void fatalError(SAXParseException ex) throws SAXException { try { out.write(ex.getMessage() + "\r\n"); out.write(" at line " + ex.getLineNumber() + ", column " + ex.getColumnNumber() + "\r\n"); out.flush(); } catch (IOException e) { throw new SAXException(e); } } }
import org.xml.sax.*; import org.xml.sax.helpers.*; import java.io.*; public class Validator { public static void main(String[] args) { XMLReader parser = XMLReaderFactory.createXMLReader(); // turn on validation try { parser.setFeature( "http://xml.org/sax/features/validation", true); parser.setErrorHandler(new ValidityErrorReporter()); } catch (SAXNotRecognizedException e) { System.err.println( "Installed XML parser cannot validate;" + " checking for well-formedness instead..."); } catch (SAXNotSupportedException e) { System.err.println( "Cannot turn on validation here; " + "checking for well-formedness instead..."); } if (args.length == 0) { System.out.println("Usage: java Validator URL1 URL2..."); } // start parsing... for (int i = 0; i < args.length; i++) { // command line should offer URIs or file names try { parser.parse(args[i]); // If there are no well-formedness errors, // then no exception is thrown System.out.println(args[i] + " is well formed."); } catch (SAXParseException e) { // well-formedness error System.out.println(args[i] + " is not well formed."); System.out.println(e.getMessage() + " at line " + e.getLineNumber() + ", column " + e.getColumnNumber()); } catch (SAXException e) { // some other kind of error System.out.println(e.getMessage()); } catch (IOException e) { System.out.println("Could not check " + args[i] + " because of the IOException " + e); } } } }
An XML syntax
Let you specify the contents of elements
Type derivation
Xerces validates against schemas if the document uses
xsi:schemaLocation
or xsi:noNamespaceSchemaLocation
to point at a schema
Standard is not quite finished yet
<xsd:schema xmlns:xsd="http://www.w3.org/1999/XMLSchema"> <xsd:element name="SONG" type="SongType"/> <xsd:complexType name="SongType"> <xsd:element name="TITLE" type="xsd:string" minOccurs="1" maxOccurs="1"/> <xsd:element name="COMPOSER" type="xsd:string" minOccurs="1" maxOccurs="unbounded"/> <xsd:element name="PRODUCER" type="xsd:string" minOccurs="0" maxOccurs="unbounded"/> <xsd:element name="PUBLISHER" type="xsd:string" minOccurs="0" maxOccurs="1"/> <xsd:element name="LENGTH" type="xsd:timeDuration" minOccurs="1" maxOccurs="1"/> <xsd:element name="YEAR" type="xsd:year" minOccurs="1" maxOccurs="1"/> <xsd:element name="ARTIST" type="xsd:string" minOccurs="0" maxOccurs="unbounded"/> </xsd:complexType> </xsd:schema>
http://xml.org/sax/properties/lexical-handler
data type:
org.xml.sax.ext.LexicalHandler
description: An optional extension handler for items like comments that are not part of the information set and may be omitted.
access: read/write
http://xml.org/sax/properties/declaration-handler
data type:
org.xml.sax.ext.DeclHandler
description: An optional extension handler for ATTLIST and ELEMENT declarations (but not notations and unparsed entities).
access: read/write
http://xml.org/sax/properties/dom-node
data type: org.w3c.dom.Node
description: When parsing, the current DOM node being visited if this is a DOM iterator; when not parsing, the root DOM node for iteration.
access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/properties/xml-string
data type: java.lang.String
description: The literal string of characters that was the source for the current event.
access: read-only
adapted from SAX2 documentation by David Megginson
http://apache.org/xml/features/validation/dynamic
True: The parser will validate the document
if a DTD is specified in a DOCTYPE
declaration or using the appropriate
schema attributes like xsi:noNamespaceSchemaLocation
.
False: Validation is determined by the state of the http://xml.org/sax/features/validation feature.
Default is false
http://apache.org/xml/features/validation/warn-on-duplicate-attdef
True: Warn on duplicate attribute declaration.
False: Do not warn on duplicate attribute declaration.
Default: true
http://apache.org/xml/features/validation/warn-on-undeclared-elemdef
True: Warn if element referenced in content model is not declared.
False: Do not warn if element referenced in content model is not declared.
Default: true
http://apache.org/xml/features/allow-java-encodings
True: Allow Java encoding names like 8859_1 in XML and text declarations.
False: Do not allow Java encoding names in XML and text declarations.
Default: false
http://apache.org/xml/features/continue-after-fatal-error
True: Continue after fatal error.
False: Stops parse on first fatal error.
Default: false
None for the SAXParser
The DOM parser has a couple
Extension handlers are non-required interfaces in the
org.xml.sax.ext
package.
To set the
LexicalHandler
for an XML reader, set the property
http://xml.org/sax/handlers/LexicalHandler
.
To set the
DeclHandler
for an XML reader, set the property
http://xml.org/sax/handlers/DeclHandler
.
If the reader does not support the requested property, it will throw a
SAXNotRecognizedException
or a SAXNotSupportedException
.
The startElement()
method in
ContentHandler
receives as an argument an
Attribute
object containing all attributes
on that tag.
public void startElement(String namespaceURI,
String localName, String qualifiedName, Attributes atts) throws SAXException
The Attributes
interface:
package org.xml.sax;
public interface Attributes {
public int getLength();
/* Look up an attribute's Namespace URI by index.*/
public String getURI(int index);
public String getLocalName(int index);
public String getQName(int index);
public String getType(int index);
public String getValue(int index);
public int getIndex(String uri, String localPart);
public int getIndex(String qualifiedName);
public String getType(String uri, String localName);
public String getType(String qualifiedName);
public String getValue(String uri, String localName);
public String getValue(String qualifiedName);
}
import org.xml.sax.*; import org.apache.xerces.parsers.*; import java.io.*; import java.util.*; import org.xml.sax.helpers.*; public class XLinkSpider extends DefaultHandler { public static Enumeration listURIs(String systemId) throws SAXException, IOException { // set up the parser XMLReader parser; try { parser = XMLReaderFactory.createXMLReader(); } catch (SAXException e) { try { parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser"); } catch (SAXException e2) { System.err.println("Error: could not locate a parser."); return null; } } // Install the Content Handler XLinkSpider spider = new XLinkSpider(); parser.setContentHandler(spider); parser.parse(systemId); return spider.uris.elements(); } private Vector uris = new Vector(); public void startElement(String namespaceURI, String localName, String rawName, Attributes atts) throws SAXException { String uri = atts.getValue("http://www.w3.org/1999/xlink", "href"); if (uri != null) uris.addElement(uri); } public static void main(String[] args) { if (args.length == 0) { System.out.println("Usage: java XLinkSpider URL1 URL2..."); } // start parsing... for (int i = 0; i < args.length; i++) { try { Enumeration uris = listURIs(args[i]); while (uris.hasMoreElements()) { String s = (String) uris.nextElement(); System.out.println(s); } } catch (Exception e) { System.err.println(e); e.printStackTrace(); } } // end for } // end main } // end XLinkSpider
The EntityResolver
allows you to substitute your own URI
lookup scheme for external entities
Especially useful for entities that use URL and URI schemes not supported by Java's protocol handlers; e.g. jdbc: or isbn:
The EntityResolver
interface:
package org.xml.sax;
import java.io.IOException;
public interface EntityResolver {
public InputSource resolveEntity (String publicId,
String systemId) throws SAXException, IOException;
}
import org.xml.sax.*; public class RSSResolver implements EntityResolver { public InputSource resolveEntity(String publicId, String systemId) { if (publicId.equals("-//Netscape Communications//DTD RSS 0.91//EN") || systemId.equals("http://my.netscape.com/publish/formats/rss-0.91.dtd")) { return new InputSource("http://metalab.unc.edu/xml/dtds/rss.dtd"); } else { // use the default behaviour return null; } } }
The DTDHandler
interface covers those aspects of DTDs
a non-validating parser may care about and are not handled by other interfaces:
Notation Declarations
Unparsed Entity Declarations
Attribute Defaults are handled transparently by startElement()
and
the Attributes
interface
Parsed entities are handled transparently by ContentHandler
unless you install an EntityResolver
The DTDHandler
interface:
package org.xml.sax;
public interface DTDHandler {
public void notationDecl(String name, String publicId, String systemId)
throws SAXException;
public void unparsedEntityDecl(String name, String publicId,
String systemId, String notationName) throws SAXException;
}
Program to map unparsed entities with notation "text/plain" to CDATA sections
AttributeHandler will have to make actual replacements
Will finish with XMLFilter
import org.xml.sax.*; import java.util.*; import java.net.*; import java.io.*; public class TextEntityReplacer implements DTDHandler { /* This class stores the notation and entity declarations for a single document. It is not designed to be reused for multiple parses, though that would be straightforward extension. The public and system IDs of the document being parsed are set in the constructor. */ private URL systemID; private String publicID; public TextEntityReplacer(String publicID, String systemID) throws MalformedURLException { System.err.println("created"); this.publicID = publicID; this.systemID = new URL(systemID); } // store all notations in a hashtable. We'll need them later private Hashtable notations = new Hashtable(); // for the DTDHandler interface public void notationDecl(String name, String publicID, String systemID) throws SAXException { Notation n = new Notation(name, publicID, systemID); notations.put(name, n); } private class Notation { String name; String publicID; String systemID; Notation(String name, String publicID, String systemID) { this.name = name; this.publicID = publicID; this.systemID = systemID; } } // store all unparsed entities in a hashtable. We'll need them later private Hashtable unparsedEntities = new Hashtable(); // for the DTDHandler interface public void unparsedEntityDecl(String name, String publicID, String systemID, String notationName) throws SAXException { UnparsedEntity e = new UnparsedEntity(name, publicID, systemID, notationName); unparsedEntities.put(name, e); } private class UnparsedEntity { String name; String publicID; String systemID; String notationName; UnparsedEntity(String name, String publicID, String systemID, String notationName) { this.name = name; this.notationName = notationName; this.publicID = publicID; this.systemID = systemID; } } public boolean isText(String notationName) { Object o = notations.get(notationName); if (o == null) return false; Notation n = (Notation) o; if (n.systemID.startsWith("text/")) return true; return false; } public String getText(String entityName) throws IOException { Object o = unparsedEntities.get(entityName); if (o == null) return ""; UnparsedEntity entity = (UnparsedEntity) o; if (!isText(entity.notationName)) { return " binary data "; // could throw an exception instead } URL source; try { source = new URL(systemID, entity.systemID); } catch (Exception e) { return " unresolvable entity "; // could throw an exception instead } // I'm not really handling characetr encodings here. // A more detailed look at the MIME type would allow that. Reader in = new BufferedReader(new InputStreamReader(source.openStream())); StringBuffer result = new StringBuffer(); int c; while ((c = in.read()) != -1) { // Is this necessaary or will parser escape string automatically???? /* switch (c) { case '<': result.append("<"); break; case '>': result.append(">"); break; case '"': result.append("""); break; case '\'': result.append("'"); break; case '&': result.append("&"); break; default: result.append((char) c); }*/ result.append((char) c); } return result.toString(); } }
The optional
DeclHandler
interface covers those aspects of DTDs
only a validating parser cares about:
Element declarations
Attribute declarations
Internal entity declarations
External entity declarations
An optional extension that not all parsers (particularly non-validating parsers) support
To set the DeclHandler
for a parser,
set the
"http://xml.org/sax/handlers/DeclHandler" property.
A SAXNotRecognizedException
or SAXNotSupportedException
will be thrown if the parser
doesn't support DeclHandler
package org.xml.sax.ext;
import org.xml.sax.SAXException;
public interface DeclHandler {
public void elementDecl(String name, String model)
throws SAXException;
public void attributeDecl(String elementName, String attributeName,
String type, String defaultValue, String value) throws SAXException;
public void internalEntityDecl(String name, String value)
throws SAXException;
public void externalEntityDecl(String name, String publicId,
String systemId) throws SAXException;
}
The
LexicalHandler
interface reports:
Comments
CDATA sections
Document type declaration
Entities
An optional extension that not all parsers support
To set the LexicalHandler
for a parser,
set the
"http://xml.org/sax/handlers/LexicalHandler" property.
A SAXNotRecognizedException
or SAXNotSupportedException
will be thrown if the parser
doesn't report lexical events
package org.xml.sax.ext;
import org.xml.sax.SAXException;
public interface LexicalHandler {
public void startDTD(String name, String publicId, String systemId)
throws SAXException;
public void endDTD() throws SAXException;
public void startEntity(String name) throws SAXException;
public void endEntity(String name) throws SAXException;
public void startCDATA() throws SAXException;
public void endCDATA() throws SAXException;
public void comment (char[] text, int start, int length)
throws SAXException;
}
import org.xml.sax.*; import org.xml.sax.ext.*; import org.xml.sax.helpers.*; import java.io.IOException; public class SAXCommentReader implements LexicalHandler { public void startDTD(String name, String publicId, String systemId) throws SAXException {} public void endDTD() throws SAXException {} public void startEntity(String name) throws SAXException {} public void endEntity(String name) throws SAXException {} public void startCDATA() throws SAXException {} public void endCDATA() throws SAXException {} public void comment (char[] text, int start, int length) throws SAXException { String comment = new String(text, start, length); System.out.println(comment); } public static void main(String[] args) { // set up the parser XMLReader parser; try { parser = XMLReaderFactory.createXMLReader(); } catch (SAXException e) { try { parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser"); } catch (SAXException e2) { System.err.println("Error: could not locate a parser."); return; } } // turn on comment handling try { parser.setProperty("http://xml.org/sax/properties/lexical-handler", new SAXCommentReader()); } catch (SAXNotRecognizedException e) { System.err.println( "Installed XML parser does not provide lexical events..."); return; } catch (SAXNotSupportedException e) { System.err.println( "Cannot turn on comment processing here"); return; } if (args.length == 0) { System.out.println("Usage: java SAXCommentReader URL1 URL2..."); } // start parsing... for (int i = 0; i < args.length; i++) { try { parser.parse(args[i]); } catch (SAXParseException e) { // well-formedness error System.out.println(args[i] + " is not well formed."); System.out.println(e.getMessage() + " at line " + e.getLineNumber() + ", column " + e.getColumnNumber()); } catch (SAXException e) { // some other kind of error System.out.println(e.getMessage()); } catch (IOException e) { System.out.println("Could not check " + args[i] + " because of the IOException " + e); } } } }
Tells the callback class where in the document (line number, column number) a particular event took place
Optional but recommended
Parsers give the callback class a Locator
by passing it to the setDocumentLocator()
method of ContentHandler
The Locator
interface:
package org.xml.sax;
public interface Locator {
public String getPublicId();
public String getSystemId();
public int getLineNumber();
public int getColumnNumber();
}
import org.xml.sax.*; import org.apache.xerces.parsers.*; import java.io.*; public class LocationReporter implements ContentHandler { Locator locator = null; public void setDocumentLocator(Locator locator) { this.locator = locator; } private String reportPosition() { if (locator != null) { String publicID = locator.getPublicId(); String systemID = locator.getSystemId(); int line = locator.getLineNumber(); int column = locator.getColumnNumber(); String name; if (publicID != null) name = publicID; else name = systemID; return " in " + name + " at line " + line + ", column " + column; } return ""; } public void startDocument() throws SAXException { System.out.println("Document started" + reportPosition()); } public void endDocument() throws SAXException { System.out.println("Document ended" + reportPosition()); } public void characters(char[] text, int start, int length) throws SAXException { System.out.println("Got some characters" + reportPosition()); } public void ignorableWhitespace(char[] text, int start, int length) throws SAXException { System.out.println("Got some ignorable white space" + reportPosition()); } public void processingInstruction(String target, String data) throws SAXException { System.out.println("Got a processing instruction" + reportPosition()); } // Changed methods for SAX2 public void startElement(String namespaceURI, String localName, String rawName, Attributes atts) throws SAXException { System.out.println("Element " + rawName + " started" + reportPosition()); } public void endElement(String namespaceURI, String localName, String rawName) throws SAXException { System.out.println("Element " + rawName + " ended" + reportPosition()); } // new methods for SAX2 public void startPrefixMapping(String prefix, String uri) throws SAXException { System.out.println("Started mapping prefix " + prefix + " to URI " + uri + reportPosition()); } public void endPrefixMapping(String prefix) throws SAXException { System.out.println("Stopped mapping prefix " + prefix + reportPosition()); } public void skippedEntity(String name) throws SAXException { System.out.println("Skipped entity " + name + reportPosition()); } // Could easily have put main() method in a separate class public static void main(String[] args) { XMLReader parser = new SAXParser(); if (args.length == 0) { System.out.println( "Usage: java LocationReporter URL1 URL2..."); } // Install the Content Handler parser.setContentHandler(new LocationReporter()); // start parsing... for (int i = 0; i < args.length; i++) { // command line should offer URIs or file names try { parser.parse(args[i]); } catch (SAXParseException e) { // well-formedness error System.out.println(args[i] + " is not well formed."); System.out.println(e.getMessage() + " at line " + e.getLineNumber() + ", column " + e.getColumnNumber()); } catch (SAXException e) { // some other kind of error System.out.println(e.getMessage()); } catch (IOException e) { System.out.println("Could not report on " + args[i] + " because of the IOException " + e); } } } }View Output
Implements the main interfaces with do-nothing methods
EntityResolver
DTDHandler
ContentHandler
ErrorHandler
Replaces HandlerBase
from SAX1
Mostly for internal parser use
Occasionally useful for tasks like finding out whether a document contains any XLinks
The NamespaceSupport
class:
package org.xml.sax.helpers;
public class NamespaceSupport {
public final static String XMLNS = "http://www.w3.org/XML/1998/namespace";
public NamespaceSupport();
public void reset();
public void pushContext();
public void popContext();
public boolean declarePrefix(String prefix, String uri);
public String getURI(String prefix);
public Enumeration getPrefixes();
public Enumeration getDeclaredPrefixes();
public String[] processName(String qualifiedName, String[] parts,
boolean isAttribute);
}
The XMLFilter
interface is like an XML reader,
"except that it obtains its events from another XML reader
rather than a primary source like an XML document or database.
Filters can modify a stream of
events as they pass on to the final application."
The parent is the parser it gets the data from.
Only two methods in the interface:
public void setParent(XMLReader parent)
public XMLReader getParent()
XMLFilterImpl
is a default filter that simply passes along
all events it receives:
public class XMLFilterImpl implements XMLFilter, EntityResolver, DTDHandler,
ContentHandler, ErrorHandler
Only new methods are constructors:
public XMLFilterImpl()
public XMLFilterImpl(XMLReader parent)
import org.xml.sax.*; import org.xml.sax.helpers.*; import java.util.*; import java.io.IOException; public class UnparsedTextFilter extends XMLFilterImpl { private TextEntityReplacer replacer; public UnparsedTextFilter(XMLReader parent) { super(parent); System.err.println("created UnparsedTextFilter"); } public void parse(InputSource input) throws IOException, SAXException { System.err.println("parsing"); replacer = new TextEntityReplacer(input.getPublicId(), input.getSystemId()); this.setDTDHandler(replacer); } // The other parse() method just calls this one public void parse(String systemId) throws IOException, SAXException { parse(new InputSource(systemId)); } public void startElement(String uri, String localName, String rawName, Attributes attributes) throws SAXException { Vector extraText = new Vector(); // Are there any unparsed entities in the attributes? for (int i = 0; i < attributes.getLength(); i++) { if (attributes.getType(i).equals("ENTITY")) { try { System.out.println("replacing"); String s = replacer.getText(attributes.getValue(i)); if (s != null) extraText.addElement(s); } catch (IOException e) { System.err.println(e); } } } super.startElement(uri, localName, rawName, attributes); // Now spew out the values of the unparsed entities: Enumeration e = extraText.elements(); while (e.hasMoreElements()) { Object o = e.nextElement(); String s = (String) o; super.characters(s.toCharArray(), 0, s.length()); } } }
import org.xml.sax.*; import org.xml.sax.helpers.*; import java.util.*; import java.io.IOException; import org.apache.xerces.parsers.*; import org.apache.xml.serialize.*; public class TextMerger { public static void main(String[] args) { System.err.println("starting"); XMLReader parser = new UnparsedTextFilter(new SAXParser()); //essentially a pretty printer XMLSerializer printer = new XMLSerializer(System.out, new OutputFormat()); parser.setContentHandler(printer); for (int i = 0; i < args.length; i++) { try { parser.parse(args[i]); } catch (SAXParseException e) { // well-formedness error System.out.println(args[i] + " is not well formed."); System.out.println(e.getMessage() + " at line " + e.getLineNumber() + ", column " + e.getColumnNumber()); } catch (SAXException e) { // some other kind of error System.out.println(e.getMessage()); } catch (IOException e) { System.out.println("Could not report on " + args[i] + " because of the IOException " + e); } } } }
Encapsulates access to data so that it looks the same whether it's coming from a
URL
file
stream
reader
database
something else
Used in SAX1 and SAX2
Allows the source to be changed
package org.xml.sax;
import java.io.*;
public class InputSource {
public InputSource()
public InputSource(String systemID)
public InputSource(InputStream in)
public InputSource(Reader in)
public void setPublicId(String publicID)
public String getPublicId()
public void setSystemId(String systemID)
public String getSystemId()
public void setByteStream(InputStream byteStream)
public InputStream getByteStream()
public void setEncoding(String encoding)
public String getEncoding()
public void setCharacterStream(Reader characterStream)
public Reader getCharacterStream()
}
import org.xml.sax;
import java.io.*;
import java.util.zip.*;
...
try {
URL u = new URL("http://metalab.unc.edu/xml/examples/1998validstats.xml.gz");
InputStream raw = u.openStream();
InputStream decompressed = new GZIPInputStream(in);
InputSource in = new InputSource(decompressed);
// read the document...
}
catch (IOException e) {
System.err.println(e);
}
catch (SAXException e) {
System.err.println(e);
}
ELEMENT, ATTLIST, ENTITY declarations are only optionally reported
Schema declarations aren't reported at all
Lexical events are only optionally reported
SAX2 can be configured on top of a lot of different parsers with different capabilities. What the parser does is more important than what SAX2 does.
You do not always have all the information you need at the time of a given callback
You may need to store information in various data structures (stacks, queues,vectors, arrays, etc.) and act on it at a later point
For example the characters()
method is not guaranteed
to give you the maximum number of contiguous characters. It may
split a single run of characters over multiple method calls.
This presentation: http://metalab.unc.edu/xml/slides/sd2000east/sax