SAX uses the Observer design pattern to tell client applications what’s in a document.[1] Java developers are most familiar with this pattern from the event architecture of the AWT and Swing. In that context, the client programmer implements an interface such as MouseListener that receives events through well-known methods. Then the programmer registers the MouseListener object with the component such as a Button using the setMouseListener() method. When the end user moves or clicks the mouse in the button’s area, the button invokes a method in the registered MouseListener object. In this example, the Button class plays the role of the Subject, the MouseListener interface plays the role of the Observer, and the client-defined implementation of that interface plays the role of the ConcreteObserver.
SAX works in a very similar way. However in SAX, XMLReader plays the role of Subject and the org.xml.sax.ContentHandler interface plays the role of Observer. The biggest difference between the AWT and SAX is that SAX does not allow more than one listener to be registered with each XMLReader. Otherwise, the pattern is exactly the same.
Example 6.2 shows the SAX ContentHandler interface. Almost any significant SAX program you write is going to use this interface in one form or another.
Example 6.2. The SAX ContentHandler interface
package org.xml.sax; public interface ContentHandler { public void setDocumentLocator(Locator locator); public void startDocument() throws SAXException; public void endDocument() throws SAXException; public void startPrefixMapping(String prefix, String uri) throws SAXException; public void endPrefixMapping(String prefix) throws SAXException; public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException; public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException; public void characters(char[] text, int start, int length) throws SAXException; public void ignorableWhitespace(char[] text, int start, int length) throws SAXException; public void processingInstruction(String target, String data) throws SAXException; public void skippedEntity(String name) throws SAXException; }
The ContentHandler interface declares eleven methods. As the parser—that is, the XMLReader—reads a document, it invokes the methods in this interface. When the parser reads a start-tag, it calls the startElement() method. When the parser reads some text content, it calls the characters() method. When the parser reads an end-tag, it calls the endElement() method. When the parser reads a processing instruction, it calls the processingInstruction() method. The details of what the parser’s read, e.g. the name and attributes of a start-tag, are passed as arguments to the method.
Order is maintained throughout. That is, the parser always invokes these methods in the same order it sees items in the document. In many cases, the parser calls back to these methods immediately. For example, the parser calls the startElement() method as soon as it’s read a complete start-tag. It will not read past that start-tag until the startElement() method has returned. This means you’ll generally receive some content from invalid and even malformed documents before the parser detects the error. Consequently you should be careful not to take undoable actions until you’ve reached the end of a document.
A concrete example should help make this clearer. I’m going to write a very simple program that extracts all the text content from an XML document while stripping out all the tags, comments, and processing instructions. This will be divided into two parts, a class that implements ContentHandler and a class that feeds the document into the parser.
Example 6.3, TextExtractor, is the class that implements ContentHandler. It has to provide all eleven methods declared in ContentHandler. However, the only one that’s actually needed is characters(). The other ten are do-nothing methods. They have empty method bodies, and nothing happens when the parser invokes them.
Example 6.3. A SAX ContentHandler that writes all #PCDATA onto a java.io.Writer
import org.xml.sax.*; import java.io.*; public class TextExtractor implements ContentHandler { private Writer out; public TextExtractor(Writer out) { this.out = out; } public void characters(char[] text, int start, int length) throws SAXException { try { out.write(text, start, length); } catch (IOException e) { throw new SAXException(e); } } // do-nothing methods public void setDocumentLocator(Locator locator) {} public void startDocument() {} public void endDocument() {} public void startPrefixMapping(String prefix, String uri) {} public void endPrefixMapping(String prefix) {} public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) {} public void endElement(String namespaceURI, String localName, String qualifiedName) {} public void ignorableWhitespace(char[] text, int start, int length) throws SAXException {} public void processingInstruction(String target, String data){} public void skippedEntity(String name) {} } // end TextExtractor
Besides the eleven methods declared in ContentHandler, TextExtractor has a constructor and an out field. The constructor sets this field to the Writer on which the parsed text will be output. You can always add as many additional methods, fields, and constructors as you need. You’re not limited to just those declared in the interface.
All the real work of this class happens inside characters(). When the parser reads content between tags, it passes this text to the characters() method inside an array of chars. The index of the first character of the text inside that array is given by the start argument. The number of characters is given by the length argument. In this class, the characters() method writes the sub-array of text from start to start+length onto the Writer stored in the out field.
The characters() method in this class invokes the write() method in java.io.Writer. It happens that the write() method is declared to throw an IOException. The ContentHandler interface does not declare that characters() throws IOException. Therefore this exception must be caught. However, rather than simply ignoring it or printing a pointless message on System.err, we can wrap the IOException inside SAXException, which characters() is declared to throw, and then throw that exception. This signals the parser that something went wrong, and the parser will pass the exception along to the client application. If the client application wants to know what originally went wrong, it can find out by invoking SAXException’s getException() method.
In contrast, none of the do-nothing methods such as startElement() and processingInstruction() will ever throw any exceptions. Therefore, they are not declared to throw SAXException even though ContentHandler would support this declaration. There’s no need to clutter up the code with unnecessary throws clauses, nor is it good programming practice to advertise a possible exception in the method signature when you know that exception will never occur.
By itself the TextExtractor class does nothing. There’s no code in the class to actually invoke any of the methods or parse a document. Although code to do this could be placed in a main() method in TextExtractor, I prefer to place it in a class of its own called ExtractorDriver which is shown in Example 6.4.
Example 6.4. The driver method for the text extractor program
import org.xml.sax.*; import org.xml.sax.helpers.XMLReaderFactory; import java.io.*; public class ExtractorDriver { public static void main(String[] args) { if (args.length <= 0) { System.out.println( "Usage: java ExtractorDriver url" ); return; } try { XMLReader parser = XMLReaderFactory.createXMLReader(); // Since this just writes onto the console, it's best // to use the system default encoding, which is what // we get by not specifying an explicit encoding here. Writer out = new OutputStreamWriter(System.out); ContentHandler handler = new TextExtractor(out); parser.setContentHandler(handler); parser.parse(args[0]); out.flush(); } catch (Exception e) { System.err.println(e); } } }
The main() method in this class performs the following steps:
Build an instance of XMLReader using the XMLReaderFactory.createXMLReader() method.
Construct a new TextExtractor object.
Pass this object to the setContentHandler() method of the XMLReader.
Pass the URL of the document you want to parse (read from the command line) to the XMLReader’s parse() method.
One thing to note: there’s still no code that actually invokes the characters() or any other method in the TextExtractor class! This is for the same reason that you never see any code to invoke actionPerformed() or mouseClicked() when writing GUI programs in Java. The code that actually calls these methods is hidden deep inside the class library. You rarely need to concern yourself with it directly. Here the relevant code that calls characters() is hiding somewhere inside the parser-specific implementation of the XMLReader interface.
Let’s suppose you run this program over the original XML order document, Example 1.2 from Chapter 1. The results look like this:
C:\>java ExtractorDriver order.xml Chez Fred Birdsong Clock 244 12 21.95 135 Airline Highway Narragansett RI 02882 263.40 18.44 8.95 290.79
The text of the original document, including white space, has been preserved. However, the markup has all been stripped. This is exactly what we asked for.
In the next few sections, we’ll explore the individual methods of the ContentHandler interface and their behavior in more detail.
TextExtractor only really used one of the eleven methods declared in ContentHandler. The other ten methods were all do-nothing methods with empty bodies. In fact, few SAX programs actually use all eleven methods. Most of the time about half suffice. To take advantage of this, SAX includes the org.xml.sax.helpers.DefaultHandler convenience class that implements the ContentHandler interface (and several other callback interfaces discussed in upcoming chapters) with do-nothing methods:
public class DefaultHandler implements ContentHandler, DTDHandler, EntityResolver, ErrorHandler
Instead of implementing, ContentHandler directly and cluttering up your code with irrelevant methods, you can instead extend DefaultHandler. Then you only have to override those methods you actually care about, not all eleven.
For example, if TextExtractor was built on top of DefaultHandler, it would be the smaller and simpler class shown in Example 6.5.
Example 6.5. A subclass of DefaultHandler that writes all #PCDATA onto a java.io.Writer
import org.xml.sax.*; import org.xml.sax.helpers.DefaultHandler; import java.io.*; public class TextExtractor extends DefaultHandler { private Writer out; public TextExtractor(Writer out) { this.out = out; } public void characters(char[] text, int start, int length) throws SAXException { try { out.write(text, start, length); } catch (IOException e) { throw new SAXException(e); } } }
Programs in this book use content handlers that implement ContentHandler directly and content handlers that extend DefaultHandler, mostly depending on which subjectively feels more natural to the problem at hand. You should feel free to use whichever variation you prefer.
[1] Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, Addison-Wesley, 1995, pp. 293-303
Copyright 2001, 2002 Elliotte Rusty Harold | elharo@metalab.unc.edu | Last Modified May 26, 2002 |
Up To Cafe con Leche |