XOM makes XML Easier

XOM Makes XML Easier

Elliotte Rusty Harold

Software Development 2004 West

Thursday, March 18, 2004

elharo@metalab.unc.edu

http://www.cafeconleche.org/

Outline

What's wrong with existing APIs
Design Principles
XOM Basics
Cool Stuff!

A few opinions

XML was, as has been fretted over before, ugly, hard, and boring to code with. Not any more :). XOM rocks! I'm using it in all my projects now.
Keep it up!

--Patrick Collison

I did some XML Programming during the last month with Java's DOM. this was not funny !! I also played with Ruby's powerful REXML. this is a great API becaue it uses the power of Ruby and it was designed for Ruby and is not a generic interface like DOM. this is way REXML is so popular in the Ruby world.

and this is why I like XOM. for me it fits much better to Java than DOM. I hope that XOM will become for Java what REXML is for Ruby now.

--Markus Jais

Overall, I found XOM to be an amazingly well-organized, intuitive API that's easy to learn and to use. I like how it reinforces good practices and provides insight about XML -- such as the lack of whitespace when XML is produced without a serializer and the identical treatment of text whether it consists of character entities, CDATA sections, or regular characters.

I can't compare it to JDOM, but it's appreciably more pleasant to work with than the Simple API for XML Processing.

--Rogers Cadenhead

i spent yesterday writing the code to render my application config as xml. using xom was like falling off a log. no muss, no fuss, the methods did what i expected, and any confusion was quickly ironed out by a visit to the (copious) examples, or the javadocs. i did run into what might be a bug, but it only showed up because i made a dumb cut-n-paste error (see my other email).

after i get the output tidied up, i'll move on to reading it back in. i'm confident that that will be almost as easy...

--Dirk Bergstrom

Current version

1.0d25: pre-alpha, final API?
1.0 alpha 1: API freeze
1.0 beta 1: all known bugs fixed
1.0: Documentation complete

Why Me?

Four Styles of XML API

Event Based Push: SAX, XNI
Event Based Pull: XMLPULL, CyberNeko, StAX
Tree: DOM, JDOM, dom4j, Sparta, etc.
Data Binding: Castor, Zeus, JAXB, JaxMe, etc.

Push APIs

Read-only
Fast
Streamable
Memory efficient
Complete
Essentially correct
Client programs can get quite complex and confusing

Pull APIs

Read-only
Fast
Streamable
Memory efficient
Client programs can be much simpler than SAX

Data Binding APIs

Map XML documents to Java classes
Read/Write
Allow in-memory manipulation
Hide the XML details
Common assumptions:
- Documents have schemas
- Documents are valid.
- Structures are fairly flat and definitely not recursive.
- Narrative documents aren't worth considering.
- Mixed content doesn't exist.
- Choices don't exist.
- Order doesn't matter.
- Sees the world through object-colored glasses

Tree APIs

Model an XML document using classes that represent nodes
Composition builds a tree
Read/Write
Allow in-memory manipulation
The simplest arbitrary XML API
Tend to be profligate with memory

DOM

Uses factories and interfaces, and yet not interoperable
Fails to enforce all XML constraints; allows creation of malformed documents
Namespaces properties and attributes
Live lists
Just plain ugly; does not adhere to Java conventions

DOM Ugliness

No method overloading
Short type constants for node types
Methods in the Node superinterface that only work for one or two subinterfaces
Incomplete: no standard loading or serialization
A single exception class with short type codes
Does not guarantee Java features like equals(), hashCode(), and toString()

Reasons for DOM Ugliness

Had to be backwards compatible with unplanned object models in third generation web browsers.
Designed by a committee trying to reconcile differences between the object models implemented by Netscape, Microsoft, and other vendors.
A cross-language API defined in IDL
Needed to support weak scripting languages like JavaScript and AppleScript
Must work for both HTML and XML.

What I learned from DOM

A node supertype is very useful
Interfaces are a bad idea
Successful APIs must be simple

JDOM

Simplest of the existing APIs (but it could be simpler)
There's more than one way to do it:
- 3+ ways to read an attribute value
- 5+ ways to read a child element
Not always well-formed:
- Processing instruction data
- Text content
- Internal DTD subset
Setter methods don't return void

Is JDOM too Java-centric?

Too weakly typed: Everything is an Object
Too strongly typed: nothing is a Node
Cloneable
Serializable
Many checked exceptions
Are JDOM committers committed to classes?

What I learned from JDOM

Classes and constructors are good
Thread safety is not necessary
Live lists are trouble
Keep everything in one package
Don't release too early
Don't optimize until the API is right
You don't need to build your own parser, transformer, or search engine
You can fight the W3C

dom4j

Forked from JDOM
More complex
Uses interfaces instead of classes

Conclusion: We can do better

nu.xom: A New XML Object Model

A complete streaming tree model for XML 1.0 instance documents
Free as in speech (LGPL)
Pure Java
Java 1.2 and later (internal dependence on Collections API)

Design Goals

Easy to use
Easy to learn
Fast enough
Small enough
No gotchas

Design Principles

Principle of Least Surprise
As simple as it can be and no simpler!
Use Java idioms where they fit (and only where they fit)
There's exactly one way to do it
Start small and grow as necessary:
- It's easier to put something in than take something out.
  if I may make one point that highly influenced the end-game when we were finishing up XML 1.0 in 1998: if you leave something out, you can always put it in later. The reverse is not true.
  
  --Tim Bray on the public-qt-comments mailing list
- During the design I added methods that were necessary to produce certain sample programs.

Principles of API Design

APIs are written by experts for non-experts
It is the class's responsibility to enforce its class invariants
Verify preconditions
Do not allow clients to do bad things.
Hide as much of the implementation as possible.
Design for subclassing or prohibit it

XML Principles

All objects can be written as well-formed XML text
Impossible to create malformed documents
Validity can be enforced by subclasses
Syntax sugar is not represented:
- CDATA sections
- Character and entity references
- Attribute order
- Defaulted vs. specified attributes

Java Design Principles

Not thread safe
Classes do not implement Serializable; use XML.
Classes do not implement Cloneable; use copy constructors.
Lack of generics really hurts in the Collections API. Hence, don't use it.
Problems detectable in testing throw runtime exceptions
Assertions that can be turned off are pointless

Development Style

This is a cathedral, not a bazaar
Unit testing
Massive samples

Create and serialize a document

import java.math.BigInteger;
import nu.xom.Element;
import nu.xom.Document;

public class FibonacciXML {

  public static void main(String[] args) {
   
      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;      
      
      Element root = new Element("Fibonacci_Numbers");  
      for (int i = 1; i <= 10; i++) {
        Element fibonacci = new Element("fibonacci");
        fibonacci.appendChild(low.toString());
        root.appendChild(fibonacci);
		
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      Document doc = new Document(root);
      System.out.println(doc.toXML());  

  }

}

FibonacciXML Output

% java -classpath ~/XOM/build/classes:. FibonacciXML
<?xml version="1.0"?>
<Fibonacci_Numbers><fibonacci>1</fibonacci><fibonacci>1</fibonacci><fibonacci>2</fibonacci><fibonacci>3</fibonacci><fibonacci>5</fibonacci><fibonacci>8</fibonacci><fibonacci>13</fibonacci><fibonacci>21</fibonacci><fibonacci>34</fibonacci><fibonacci>55</fibonacci></Fibonacci_Numbers>

Parsing a document

try {
  Builder parser = new Builder();
  Document doc = parser.build(url);
  System.out.println(doc.toXML());
}
catch (ParsingException ex) {
  System.out.println(url + " is not well-formed.");
  System.out.println(ex.getMessage());
}
catch (IOException ex) { 
  System.out.println("Due to an IOException, "
  + "the parser could not check " + args[0]); 
}

The Node Class

public abstract class Node {

  public       String     getValue();
  public final Document   getDocument();
  public       String     getBaseURI();
  public final ParentNode getParent();
  public       Node       getChild(int position);
  public       int        getChildCount();

  public final void       detach();
  public       Node       copy();    
  public       String     toXML(); 
  
  public final boolean    equals(Object o);
  public final int        hashCode();
      
}

Example: PropertyPrinter

getValue() returns the XPath string value of a node
toXML() returns a String containing the XML form of the node

import java.io.*;
import nu.xom.*;


public class PropertyPrinter {

    private Writer out;
    
    public PropertyPrinter(Writer out) {
      if (out == null) {
        throw new NullPointerException("Writer must be non-null.");
      }
      this.out = out;
    }
    
    public PropertyPrinter() {
      this(new OutputStreamWriter(System.out));
    }
    
    private int nodeCount = 0;
    
    public void writeNode(Node node) throws IOException {
      
        if (node == null) {
            throw new NullPointerException("Node must be non-null.");
        }
        if (node instanceof Document) { 
            // starting a new document, reset the node count
            nodeCount = 1; 
        }
      
        String type      = node.getClass().getName(); // never null
        String value     = node.getValue();
        
        String name      = null; 
        String localName = null;
        String uri       = null;
        String prefix    = null;

        if (node instanceof Element) {
            Element element = (Element) node;
            name = element.getQualifiedName();
            localName = element.getLocalName();
            uri = element.getNamespaceURI();
            prefix = element.getNamespacePrefix();
        }
        else if (node instanceof Attribute) {
            Element element = (Element) node;
            name = element.getQualifiedName();
            localName = element.getLocalName();
            uri = element.getNamespaceURI();
            prefix = element.getNamespacePrefix();
        }

      
        StringBuffer result = new StringBuffer();
        result.append("Node " + nodeCount + ":\r\n");
        result.append("  Type: " + type + "\r\n");
        if (name != null) {
            result.append("  Name: " + name + "\r\n");
        }
        if (localName != null) {
            result.append("  Local Name: " + localName + "\r\n");
        }
        if (prefix != null) {
            result.append("  Prefix: " + prefix + "\r\n");
        }
        if (uri != null) {
            result.append("  Namespace URI: " + uri + "\r\n");
        }
        if (value != null) {
            result.append("  Value: " + value + "\r\n");
        }
      
        out.write(result.toString());
        out.write("\r\n");
        out.flush();
      
        nodeCount++;
      
    }
    
    public static void main(String[] args) throws Exception {
     
      Builder builder = new Builder();
      for (int i = 0; i < args.length; i++) {
          PropertyPrinter p = new PropertyPrinter();
          File f = new File(args[i]);
          Document doc = builder.build(f);
          p.writeNode(doc);
      }   
        
    }

}

PropertyPrinter Output

% java -classpath ~/XOM/build/classes:. PropertyPrinter hotcop.xml
Node 1:
  Type: nu.xom.Document
  Value:
  Hot Cop

  Jacques Morali
  Henri Belolo
  Victor Willis
  Jacques Morali


    A & M Records

  6:20
  1978
  Village People

Example: TreeReporter

Recursive, pre-order traversal
getFirstChild()
Indexed navigation is the key
No iterators; no siblings

import java.io.IOException;
import nu.xom.*;


public class TreeReporter {

    public static void main(String[] args) {
     
        if (args.length <= 0) {
          System.out.println("Usage: java TreeReporter URL");
          return; 
        }
         
        TreeReporter iterator = new TreeReporter();
        try {
          Builder parser = new Builder();
          
          // Read the entire document into memory
          Node document = parser.build(args[0]); 
          
          // Process it starting at the root
          iterator.followNode(document);
    
        }
        catch (IOException ex) { 
          System.out.println(ex); 
        }
        catch (ParsingException ex) { 
          System.out.println(ex); 
        }
  
    } // end main

    private PropertyPrinter printer = new PropertyPrinter();
  
    // note use of recursion
    public void followNode(Node node) throws IOException {
    
        printer.writeNode(node);
        for (int i = 0; i < node.getChildCount(); i++) {
            followNode(node.getChild(i));
        }
    
  }

}

TreeReporter Output

% java -classpath ~/XOM/build/classes:. TreeReporter
elharo@stallion examples]$ java -classpath ~/XOM/build/classes:. TreeReporter hotcop.xml
Node 1:
  Type: nu.xom.Document
  Value:
  Hot Cop

  Jacques Morali
  Henri Belolo
  Victor Willis
  Jacques Morali


    A & M Records

  6:20
  1978
  Village People


Node 2:
  Type: nu.xom.ProcessingInstruction
  Value: type="text/css" href="song.css"

Node 3:
  Type: nu.xom.DocType
  Value:

Node 4:
  Type: nu.xom.Element
  Name: SONG
  Local Name: SONG
  Prefix:
  Namespace URI: http://metalab.unc.edu/xml/namespace/song
  Value:
  Hot Cop

  Jacques Morali
  Henri Belolo
  Victor Willis
  Jacques Morali


    A & M Records

  6:20
  1978
  Village People

...

The Document Class

Subclass of ParentNode
Document children are:
- Comments
- Processing Instructions
- Zero or one DocType
- One Root Element

package nu.xom;

public class Document extends ParentNode {

  public Document(Element root);
  public Document(Document doc);
  
  public final DocType getDocType() ;
  public final Element getRootElement();
  public       void    setRootElement(Element root)
  public       void    setBaseURI(String URI);
  public final String  getBaseURI();
  
  public       void    insertChild(int position, Node c);
  public       void    removeChild(int position);
  public       void    removeChild(Node child);

  public final String  getValue() ;
  public final String  toXML();
  public       Node    copy();
  
}

Example: Validating XHTML

The document must validate against one of the three DTDs found in Appendix A.
The root element of the document must be <html>.
The root element of the document must designate the XHTML namespace using the xmlns attribute [XMLNAMES]. The namespace for XHTML is defined to be http://www.w3.org/1999/xhtml.
There must be a DOCTYPE declaration in the document prior to the root element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in Appendix A using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions.

Verify Root Element is html in the XHTML namespace

      boolean valid = true;       
      DocType doctype = document.getDocType();
    
      if (doctype == null) {
        valid = false;
      }
      else {
        // check doctype
      }
    
      Element root = document.getRootElement();
      String uri = root.getNamespaceURI();
      String prefix = root.getNamespacePrefix();
      if (!uri.equals("http://www.w3.org/1999/xhtml")) {
        valid = false;
      }
      if (!prefix.equals("")) valid = false;

The Element Class

Largest class in XOM
Subclass of ParentNode
Every Element has:
- Local name
- Namespace prefix (which can be the empty string)
- Namespace URI (which can be the empty string)
- A collection of Attributes
- A collection of additional namespaces
- A list of children
- A ParentNode (which may be null)
- An owner Document (which may be null)

Element Constructors:

    public Element(String name);
    public Element(String name, String uri);
    public Element(Element element);

    Element para = new Element("para");
    Element p = new Element("p", "http://www.w3.org/1999/xhtml");
    Element text = new Element("svg:text", "http://www.w3.org/TR/2000/svg");

Element Properties

Getters:

    public final String getLocalName();
    public final String getQualifiedName();
    public final String getNamespacePrefix();
    public final String getNamespaceURI();
    public final String getNamespaceURI(String prefix);

Setters:

    public void setLocalName(String localName);
    public void setNamespaceURI(String URI);
    public void setNamespacePrefix(String prefix);

Methods to get child elements

    public final Elements getChildElements(String name);
    public final Elements getChildElements(String localName, String namespace);
    public final Element  getFirstChildElement(String name);
    public final Element  getFirstChildElement(String localName, String namespace);

The Elements class

A read-only list containing only Element objects

public final class Elements {

    public int     size();
    public Element get(int index);
    
}

Recursive Descent

public void process(Element element) {

  Elements children = element.getChildElements();
  for (int i = 0; i < children.size(); i++) {
    process(children.get(i));
  }

}

Example: TreeViewer

import javax.swing.*;
import javax.swing.tree.*;
import nu.xom.*;

public class TreeViewer {

    // Initialize the per-element data structures
    public static MutableTreeNode processElement(Element element) {

        String data;
        if (element.getNamespaceURI().equals(""))
            data = element.getLocalName();
        else {
            data =
                '{'
                    + element.getNamespaceURI()
                    + "} "
                    + element.getQualifiedName();
        }

        MutableTreeNode node = new DefaultMutableTreeNode(data);
        Elements children = element.getChildElements();
        for (int i = 0; i < children.size(); i++) {
            node.insert(processElement(children.get(i)), i);
        }

        return node;

    }

    public static void display(Document doc) {

        Element root = doc.getRootElement();
        JTree tree = new JTree(processElement(root));
        JScrollPane treeView = new JScrollPane(tree);
        JFrame f = new JFrame("XML Tree");


        String version = System.getProperty("java.version");
        if (version.startsWith("1.2") || version.startsWith("1.1")) {
            f.setDefaultCloseOperation(JFrame.HIDE_ON_CLOSE); 
        }
        else {
            // JFrame.EXIT_ON_CLOSE == 3 but this named constant is not
            // available in Java 1.2
            f.setDefaultCloseOperation(3);
        }
        f.getContentPane().add(treeView);
        f.pack();
        f.show();

    }

    public static void main(String[] args) {

        try {
            Builder builder = new Builder();
            for (int i = 0; i < args.length; i++) {
                Document doc = builder.build(args[i]);
                display(doc);
            }
        }
        catch (Exception ex) {
            System.err.println(ex);
        }

    } // end main()

} // end TreeViewer

Attribute Methods on Element

    public       void      addAttribute(Attribute attribute);
    public       void      removeAttribute(Attribute attribute);
    public final Attribute getAttribute(String name);
    public final Attribute getAttribute(String localName, String namespaceURI);
    public final String    getAttributeValue(String name);
    public final String    getAttributeValue(String localName, String namespaceURI);
    public final int       getAttributeCount();
    public final Attribute getAttribute(int i);

Example: IDTagger

import java.io.IOException;
import nu.xom.*;

public class IDTagger {

  private static int id = 1;

  public static void processElement(Element element) {
    
    if (element.getAttribute("ID") == null) {
      element.addAttribute(new Attribute("ID", "_" + id));
      id = id + 1; 
    }
    
    // recursion
    Elements children = element.getChildElements();
    for (int i = 0; i < children.size(); i++) {
      processElement(children.get(i));   
    }
    
  }

  public static void main(String[] args) {
     
    Builder builder = new Builder();
    
    for (int i = 0; i < args.length; i++) {
        
      try {
        // Read the entire document into memory
        Document document = builder.build(args[i]); 
       
        processElement(document.getRootElement());
        
        System.out.println(document.toXML());         
      }
      catch (ParsingException ex) {
        System.err.println(ex);
        continue; 
      }
      catch (IOException ex) {
        System.err.println(ex);
        continue; 
      }
      
    }
  
  } // end main

}

Additional Namespaces

Only for namespace prefixes used in attribute values and element content (e.g. XSLT and W3C Schemas)
Never used when an element or attribute in scope already has the prefix

public void addNamespaceDeclaration(String prefix, String URI);
public void removeNamespaceDeclaration(String prefix);

Enumerating Namespaces

Don't normally need to do this; most of the time the namespace of any given element or attribute is sufficient

These methods allow you to list all the namespaces in-scope on any given element:

public final int    getNamespaceDeclarationCount()
public final String getNamespacePrefix(int index)
public final String getNamespaceURI(String prefix)

The Text Class

Represents character data in element content
By default, the Builder places the maximum possible contiguous amount of text in each node.
CDATA sections are silently preserved from build to serialization when possible

package nu.xom;

public class Text extends Node {

  public Text(String data);
  public Text(Text text);

  public       void   setValue(String data);
  public final String getValue();
  
  public final Node getChild(int i);
  public final int  getChildCount();

  public final String toString();

  public       Node    copy();
  public final String  toXML();

}

ROT13XML

import java.io.IOException;
import nu.xom.*;

public class ROT13XML {

    // note use of recursion
    public static void encode(Node node) {
    
        if (node instanceof Text) {
          Text text = (Text) node;
          String data = text.getValue();
          text.setValue(rot13(data));
        }
        
        // recurse the children
        for (int i = 0; i < node.getChildCount(); i++) {
            encode(node.getChild(i));
        } 
    
    }
  
    public static String rot13(String s) {
    
        StringBuffer out = new StringBuffer(s.length());
        for (int i = 0; i < s.length(); i++) {
          int c = s.charAt(i);
          if (c >= 'A' && c <= 'M') out.append((char) (c+13));
          else if (c >= 'N' && c <= 'Z') out.append((char) (c-13));
          else if (c >= 'a' && c <= 'm') out.append((char) (c+13));
          else if (c >= 'n' && c <= 'z') out.append((char) (c-13));
          else out.append((char) c);
        } 
        return out.toString();
    
    }

  public static void main(String[] args) {

    if (args.length <= 0) {
      System.out.println("Usage: java ROT13XML URL");
      return;
    }
    
    String url = args[0];
    
    try {
      Builder parser = new Builder();
      
      // Read the document
      Document document = parser.build(url); 
      
      // Modify the document
      ROT13XML.encode(document);

      // Write it out again
      System.out.println(document.toXML());

    }
    catch (IOException ex) { 
      System.out.println(
      "Due to an IOException, the parser could not encode " + url
      ); 
    }
    catch (ParsingException ex) { 
      System.out.println(ex);
    }
     
  } // end main

}

ROT13XML Output

% java -classpath ~/XOM/build/classes:. ROT13XML hotcop.xml
% java -classpath ~/XOM/build/classes:. ROT13XML hotcop.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song" xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Ubg Pbc</TITLE>
  <PHOTO xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg" ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200" />
  <COMPOSER>Wnpdhrf Zbenyv</COMPOSER>
  <COMPOSER>Uraev Orybyb</COMPOSER>
  <COMPOSER>Ivpgbe Jvyyvf</COMPOSER>
  <PRODUCER>Wnpdhrf Zbenyv</PRODUCER>
  <!-- The publisher is actually Polygram but I needed
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    N &amp; Z Erpbeqf
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Ivyyntr Crbcyr</ARTIST>
</SONG>
<!-- You can tell what album I was
     listening to when I wrote this example -->

The Attribute Class

Subclasses Node
Each Attribute has:
- Local name
- Namespace prefix (which can be the empty string)
- Namespace URI (which can be the empty string)
- A type
- A value
- A parent Element (which may be null)
- An owner Document (which may be null)

Attribute Constructors

   public Attribute(String localName, String value);
   public Attribute(String localName, String value, Type type);
   public Attribute(String name, String URI, String value, Type type);
   public Attribute(Attribute attribute);

Attribute Getter and Setter Methods

public final Type    getType();
public       void   setType(Type type);
public final String getValue();
public       void   setValue(String value);
public final String getLocalName();
public       void   setLocalName(String localName);
public final String getQualifiedName()
public final String getNamespaceURI();
public final String getPrefix();
public       void   setNamespace(String prefix, String URI);

Example: XLinkSpider

import java.net.*;
import java.util.*;
import nu.xom.*;

public class XLinkSpider {

    private Set spidered   = new HashSet();
    private Builder parser = new Builder();
    private List queue     = new LinkedList();
    
    public static final String XLINK_NS 
      = "http://www.w3.org/1999/xlink";
    public static final String XML_NS 
      = "http://www.w3.org/XML/1998/namespace";
    
    public void search(URL url) {
        
        try {
            String systemID = url.toExternalForm();
            Document doc = parser.build(systemID);
            System.out.println(url);
            search(doc.getRootElement(), url);
        }
        catch (Exception ex) {
            // just skip this document
        }
        
        if (queue.isEmpty()) return;
        
        URL discovered = (URL) queue.remove(0);
        spidered.add(discovered);
        search(discovered);      
        
    }

    private void search(Element element, URL base) {

        Attribute href = element.getAttribute("href", XLINK_NS); 
        Attribute xmlbase = element.getAttribute("base", XML_NS);
        try {
            if (xmlbase != null) {
                base = new URL(base, xmlbase.getValue());
            }
        }
        catch (MalformedURLException ex) {
            // Probably just no protocol handler for the 
            // kind of URLs used inside this element
            return;
        }
        if (href != null) {
            String uri = href.getValue();
            // absolutize URL
            try {
                URL discovered = new URL(base, uri);
                // strip ref field if any
                discovered = new URL(
                  discovered.getProtocol(),
                  discovered.getHost(),
                  discovered.getFile()
                );
                
                if (!spidered.contains(discovered) 
                  && !queue.contains(discovered)) {
                    queue.add(discovered);   
                }
            }
            catch (MalformedURLException ex) {
                // skip this one   
            }
        }
        Elements children = element.getChildElements();
        for (int i = 0; i < children.size(); i++) {
            search(children.get(i), base);
        }
        
    }

    public static void main(String[] args) {
      
        XLinkSpider spider = new XLinkSpider();
        for (int i = 0; i < args.length; i++) { 
            try { 
                spider.search(new URL(args[i]));
            }
            catch (MalformedURLException ex) {
                System.err.println(ex);   
            }
        }
      
    }   // end main()

}

XLinkSpider Output

% java -classpath ~/XOM/build/classes:. XLinkSpider http://www.rddl.org
http://www.rddl.org
http://www.rddl.org/purposes
http://www.rddl.org/rddl.rdfs
http://www.rddl.org/rddl-integration.rxg
http://www.rddl.org/modules/rddl-1.rxm
http://www.rddl.org/modules/xhtml-attribs-1.rxm
http://www.rddl.org/modules/xhtml-base-1.rxm
http://www.rddl.org/modules/xhtml-basic-form-1.rxm
http://www.rddl.org/modules/xhtml-basic-table-1.rxm
http://www.rddl.org/modules/xhtml-basic10-model-1.rxm
http://www.rddl.org/modules/xhtml-basic10.rxm
http://www.rddl.org/modules/xhtml-blkphras-1.rxm
http://www.rddl.org/modules/xhtml-blkstruct-1.rxm
http://www.rddl.org/modules/xhtml-for-rddl.rxm
http://www.rddl.org/modules/xhtml-framework-1.rxm
http://www.rddl.org/modules/xhtml-hypertext-1.rxm
http://www.rddl.org/modules/xhtml-image-1.rxm
http://www.rddl.org/modules/xhtml-inlphras-1.rxm
http://www.rddl.org/modules/xhtml-inlstruct-1.rxm
http://www.rddl.org/modules/xhtml-link-1.rxm
http://www.rddl.org/modules/xhtml-list-1.rxm
http://www.rddl.org/modules/xhtml-meta-1.rxm
...
http://www.w3.org/TR/xhtml-basic
http://www.w3.org/TR/xml-infoset/
http://www.w3.org/tr/xhtml1
http://www.w3.org/TR/xhtml-modularization/
http://www.rddl.org/purposes/software
http://www.ascc.net/xml/schematron
http://www.w3.org/2001/XMLSchema
http://www.examplotron.org
...

Attribute.Type

Inner class that uses the type-safe enum pattern for the 10 DTD types::
- Attribute.TYPE.CDATA
- Attribute.TYPE.ID
- Attribute.TYPE.IDREF
- Attribute.TYPE.IDREFS
- Attribute.TYPE.NMTOKEN
- Attribute.TYPE.NMTOKENS
- Attribute.TYPE.NOTATION
- Attribute.TYPE.ENTITY
- Attribute.TYPE.ENTITIES
- Attribute.TYPE.ENUMERATION
- Attribute.TYPE.UNDECLARED

The ProcessingInstruction Class

ProcessingInstruction extends Node
Each ProcessingInstruction has:
- Target, a string
- Data, a string
- plus the usual properties of any Node
Pseudo-attributes are not specifically supported

package nu.xom;

public class ProcessingInstruction extends Node{

  public ProcessingInstruction(String target, String data) {
  public ProcessingInstruction(ProcessingInstruction instruction)

  public final String getTarget();
  public       void   setTarget(String target);
  protected    void   checkTarget(String target);
  public final String getValue();
  public       void   setValue(String data);
  protected    void   checkValue(String data);
  
  public final Node getChild(int i);
  public final int  getChildCount();

  public final Node   copy();
  public final String toXML();

  public final String toString();

}

Example: PoliteSpider

Robots processing instruction:

<?robots index="yes | no"
         follow="yes | no" ?>

package nu.xom.samples;

import java.net.*;
import java.util.*;
import nu.xom.*;

public class PoliteSpider {

    private Set spidered   = new HashSet();
    private Builder parser = new Builder();
    private List queue     = new LinkedList();
    
    public static final String XLINK_NS 
     = "http://www.w3.org/1999/xlink";
    public static final String XML_NS 
     = "http://www.w3.org/XML/1998/namespace";
    
    public void search(URL url) {
        
        try {
            String systemID = url.toExternalForm();
            Document doc = parser.build(systemID);
            
            boolean follow = true;
            boolean index = true;
            for (int i = 0; i < doc.getChildCount(); i++) {
                Node child = doc.getChild(i); 
                if (child instanceof Element) break;  
                if (child instanceof ProcessingInstruction){
                    ProcessingInstruction instruction 
                      = (ProcessingInstruction) child;
                    if (instruction.getTarget().equals("robots")) {
                        Element data 
                          = PseudoAttributes.getAttributes(instruction); 
                        Attribute indexAtt = data.getAttribute("index"); 
                        if (indexAtt != null) {
                            String value = indexAtt.getValue().trim();
                            if (value.equals("no")) index = false;
                        }
                        Attribute followAtt = data.getAttribute("follow"); 
                        if (followAtt != null) {
                            String value = followAtt.getValue().trim();
                            if (value.equals("no")) follow = false;
                        }
                    }   
                }  
            }
            
            if (index) System.out.println(url);
            if (follow) search(doc.getRootElement(), url);
        }
        catch (Exception ex) {
            // just skip this document
        }
        
        if (queue.isEmpty()) return;
        
        URL discovered = (URL) queue.remove(0);
        spidered.add(discovered);
        search(discovered);      
        
    }

    private void search(Element element, URL base) {

        Attribute href = element.getAttribute("href", XLINK_NS); 
        Attribute xmlbase = element.getAttribute("base", XML_NS);
        try {
            if (xmlbase != null) base = new URL(base, xmlbase.getValue());
        }
        catch (MalformedURLException ex) {
            //Java can't handle the kind of URLs used inside this element
            return;
        }
        if (href != null) {
            String uri = href.getValue();
            // absolutize URL
            try {
                URL discovered = new URL(base, uri);
                // strip ref field if any
                discovered = new URL(
                  discovered.getProtocol(),
                  discovered.getHost(),
                  discovered.getFile()
                );
                
                if (!spidered.contains(discovered) 
                  && !queue.contains(discovered)) {
                    queue.add(discovered);   
                }
            }
            catch (MalformedURLException ex) {
                // skip this one   
            }
        }
        Elements children = element.getChildElements();
        for (int i = 0; i < children.size(); i++) {
            search(children.get(i), base);
        }
        
    }

    public static void main(String[] args) {
      
        XLinkSpider spider = new XLinkSpider();
        for (int i = 0; i < args.length; i++) { 
            try { 
                spider.search(new URL(args[i]));
            }
            catch (MalformedURLException ex) {
                System.err.println(ex);   
            }
        }
      
    }   // end main()

}

The DocType Class

Represents the document type declaration
Not the document type definition!
Properties:
- Root element name
- Public ID (may be null)
- System ID (may be null)
- Internal DTD subset (read-only, may be null)
Limited to one per document, in the prolog only

public class DocType extends Node{

 public DocType(String rootElementName, String publicID, String systemID);
 public DocType(String rootElementName, String systemID);
 public DocType(String rootElementName);
 public DocType(DocType doctype);
    
 public       void   setRootElementName(String name);
 public final String getRootElementName();
 public final String getInternalDTDSubset();
 public       void   setPublicID(String id);
 public final String getPublicID();
 public       void   setSystemID(String id);
 public final String getSystemID();
 
 public final Node getChild(int i);
 public final int  getChildCount();

 public final Node   copy();
 public final String toXML();
 
}

Validating XHTML

It must validate against one of the three DTDs found in Appendix A.
The root element of the document must be <html>.
The root element of the document must designate the XHTML namespace using the xmlns attribute [XMLNAMES]. The namespace for XHTML is defined to be http://www.w3.org/1999/xhtml.
There must be a DOCTYPE declaration in the document prior to the root element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in Appendix A using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions.

Three XHTML DTDs:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "DTD/xhtml1-frameset.dtd">

XHTMLValidator

import java.io.IOException;
import nu.xom.*;

public class XHTMLValidator {

  public static void main(String[] args) {
    
    for (int i = 0; i < args.length; i++) {
      validate(args[i]);
    }   
    
  }

  private static Builder builder = new Builder(true);
                         /* turn on validation ^^^^ */
  
  // not thread safe
  public static void validate(String source) {
        
      Document document;
      try {
        document = builder.build(source); 
      }
      catch (ParsingException ex) {  
        System.out.println(source 
         + " is invalid XML, and thus not XHTML."); 
        return; 
      }
      catch (IOException ex) {  
        System.out.println("Could not read: " + source); 
        return; 
      }
      
      // If we get this far, then the document is valid XML.
      // Check to see whether the document is actually XHTML 
      boolean valid = true;       
      DocType doctype = document.getDocType();
    
      if (doctype == null) {
        System.out.println("No DOCTYPE");
        valid = false;
      }
      else {
        // verify the DOCTYPE
        String name     = doctype.getRootElementName();
        String publicID = doctype.getPublicID();
      
        if (!name.equals("html")) {
          System.out.println(
           "Incorrect root element name " + name);
          valid = false;
        }
    
        if (publicID == null
         || (!publicID.equals("-//W3C//DTD XHTML 1.0 Strict//EN")
           && !publicID.equals(
            "-//W3C//DTD XHTML 1.0 Transitional//EN")
           && !publicID.equals(
            "-//W3C//DTD XHTML 1.0 Frameset//EN"))) {
          valid = false;
          System.out.println(source 
           + " does not seem to use an XHTML 1.0 DTD");
        }
      }
    
    
      // Check the namespace on the root element
      Element root = document.getRootElement();
      String uri = root.getNamespaceURI();
      String prefix = root.getNamespacePrefix();
      if (!uri.equals("http://www.w3.org/1999/xhtml")) {
        valid = false;
        System.out.println(source 
         + " does not properly declare the"
         + " http://www.w3.org/1999/xhtml namespace"
         + " on the root element");        
      }
      if (!prefix.equals("")) {
        valid = false;
        System.out.println(source 
         + " does not use the empty prefix for XHTML");        
      }
      
      if (valid) System.out.println(source + " is valid XHTML.");
    
  }

}

XHTMLValidator Output

% java -classpath ~/XOM/build/classes:. XHTMLValidator http://www.w3.org/ http://www.cafeconleche.org/
http://www.w3.org/ is valid XHTML.
http://www.cafeconleche.org/ is invalid XML, and thus not XHTML.

The Comment Class

package nu.xom;

public class Comment extends Node {

  public Comment(String data);
  public Comment(Comment comment);

  public final String getValue();
  public       void   setValue(String data);
  
  public final Node getChild(int i);
  public final int  getChildCount();
  
  public final Node   copy();
  public final String toXML();
  
  public final String toString();
	
}

Example: CommentReader

import java.io.IOException;
import nu.xom.*;

public class CommentReader {

    public static void list(Node node) {
        
        for (int i = 0; i < node.getChildCount(); i++) {           
            Node child = node.getChild(i);
            if (child instanceof Comment) {
                System.out.println(child.toXML());
            }
            else {
                list(child);   
            }
        }
        
    } 

    public static void main(String[] args) {
  
        if (args.length <= 0) {
          System.out.println("Usage: java CommentReader URL");
          return;
        }
        
        try {
          Builder parser = new Builder();
          Document doc = parser.build(args[0]);
          list(doc);
        }
        catch (ParsingException ex) {
          System.out.println(args[0] + " is not well-formed.");
          System.out.println(ex.getMessage());
        }
        catch (IOException ex) { 
          System.out.println(
           "Due to an IOException, the parser could not read " 
           + args[0]
          ); 
        }
  
    }

}

CommentReader Output

$ java -classpath  ~/XOM/build/classes/:. CommentReader http://www.w3.org/TR/2004/REC-DOM-Level-3-Val-20040127/xml-source.xml
<!-- $Id: xml-source.xml,v 1.7 2004/01/26 22:31:28 plehegar Exp $ -->
<!--
  *************************************************************************
  * FRONT MATTER                                                          *
  *************************************************************************
  -->
<!--
  ******************************************************
  | filenames to be used for each section              |
  ******************************************************
-->
<!--
    ******************************************************
    * DOCUMENT ABSTRACT                                  *
    ******************************************************
    -->
<!-- $Id: xml-source.xml,v 1.7 2004/01/26 22:31:28 plehegar Exp $ -->
<!-- $Id: xml-source.xml,v 1.7 2004/01/26 22:31:28 plehegar Exp $ -->
<!--
 *************************************************************************
 * BEGINNING OF COPYRIGHT NOTICE                                         *
 *************************************************************************
-->
<!--
 *************************************************************************
 * END OF COPYRIGHT NOTICE                                               *
 *************************************************************************
-->
<!-- $Id: xml-source.xml,v 1.7 2004/01/26 22:31:28 plehegar Exp $ -->
<!--
 *************************************************************************
 * BEGINNING OF VALIDATION
 *************************************************************************
-->
<!--
  ******************************************************
  Last known edit 12/03/2003
  Suggestions welcome, especially if accompanied by
  proposed revisions already marked up as per spec.dtd!
  ******************************************************
  -->
<!--
  ******************************************************
  | OVERVIEW                                            |
  ******************************************************
  -->
<!--
  ******************************************************
  | ISSUES                                             |
  ******************************************************
<div2 id="Level-3-VAL-Issue-List">
  <head>Issue List</head>

  <div3 id="VAL-Issues-List-Resolved">
    <head>Resolved Issues</head>

    <issue id="VAL-Issue-8" status="open">
      <p>For Validation interfaces there should be no dependency on DOM Core.
      </p>
      <p>The <code>NodeEditVAL</code> interface will not extend DOM Core.  It is simply an object that expresses similar interfaces.</p>
    </issue>

  </div3>

-->...

The Builder Class

package nu.xom;

public class Builder {

    public Builder();
    public Builder(boolean validate);
    public Builder(boolean validate, NodeFactory factory);
    public Builder(XMLReader parser);
    public Builder(NodeFactory factory);
    public Builder(XMLReader parser, boolean validate);
    public Builder(XMLReader parser, boolean validate, NodeFactory factory);
    
    public Document build(String systemID) 
      throws ParsingException, ValidityException, IOException;
    public Document build(InputStream in) 
      throws ParsingException, ValidityException, IOException;
    public Document build(InputStream in, String baseURI) 
      throws ParsingException, ValidityException, IOException;
    public Document build(File in) 
      throws ParsingException, ValidityException, IOException;
    public Document build(Reader in) 
      throws ParsingException, ValidityException, IOException;
    public Document build(Reader in, String baseURI) 
      throws ParsingException, ValidityException, IOException;
    public Document build(String document, String baseURI) 
      throws ParsingException, ValidityException, IOException;
      
    public NodeFactory getNodeFactory();
    
}

Example: Schema Validating

try {      
  XMLReader xerces = XMLReaderFactory.createXMLReader(
   "org.apache.xerces.parsers.SAXParser"); 
  xerces.setFeature(
   "http://apache.org/xml/features/validation/schema",
    true);                         
  Builder parser = new Builder(xerces, true);
  parser.build(url);
  System.out.println(url + " is schema valid.");
}
catch (SAXException ex) {
  System.out.println("Could not load Xerces.");
}
catch (ParseException ex) {
  System.out.println(url + " is not schema valid.");
  System.out.println(ex.getMessage());
}
catch (IOException ex) { 
  System.out.println("Due to an IOException, Xerces could not check " 
  + url); 
}

Serializer

public class Serializer {

    public Serializer(OutputStream out);
    public Serializer(OutputStream out, String encoding);
 
    public int     getIndent();
    public void    setIndent(int indent);
    public String  getLineSeparator();
    public void    setLineSeparator(String lineSeparator);
    public int     getMaxLength();
    public void    setMaxLength(int length);
    public boolean getPreserveBaseURI();
    public void    setPreserveBaseURI(boolean preserve);
    public boolean getNormalizationFormC();
    public void    setNormalizationFormC(boolean preserve);

    public void    write(Document doc) throws IOException;
    public void    flush() throws IOException;

}

Example: Pretty Printing

import java.io.IOException;
import nu.xom.*;

public class PrettyPrinter {

    public static void main(String[] args) {
  
        if (args.length <= 0) {
          System.out.println("Usage: java PrettyPrinter URL");
          return;
        }
        
        try {
          Builder parser = new Builder();
          Document doc = parser.build(args[0]);
          Serializer serializer = new Serializer(System.out, "ISO-8859-1");
          serializer.setIndent(4);
          serializer.setMaxLength(64);
          serializer.setPreserveBaseURI(true);
          serializer.write(doc);
          serializer.flush();
        }
        catch (ParsingException ex) {
          System.out.println(args[0] + " is not well-formed.");
          System.out.println(ex.getMessage());
        }
        catch (IOException ex) { 
          System.out.println(
           "Due to an IOException, the parser could not check " 
           + args[0]
          ); 
        }
  
    }

}

Encoding

Serializer supports all encodings available in the VM
Understands:
- UTF-8, UTF-16, UTF-32
- ISO-8859-1 through ISO-8859-15
- TIS-620
- US-ASCII
- GB18030
- EBCDIC-37
Modular design makes it fairly easy to add more by contribution

Connecting to other Models

SAXConverter feeds data into a SAX ContentHandler
DOMConverter does two-way conversion of DOM Document objects

The Wrong Side of 80/20

Notations
Unparsed entities
Skipped entities
DTD model
Original encoding
Standalone declaration
Version declaration

Subclassing

Classes are designed and documented for subclassing.
Subclasses cannot relax constraints
Subclasses can add constraints by overriding setter methods
Subclasses can add functionality or utility
Factories can be used to build in the subclasses during parsing

NodeFactory

Can change classes of nodes
Can change node types
Can change node numbers
Can filter
Can process arbitrarily large documents
Can process in a stream

package nu.xom;

public class NodeFactory {

    public Element  makeRootElement(String name, String namespace);
    public Element  startMakingElement(String name, String namespace);
    public Nodes    finishMakingElement(Element element);
    
    public Document startMakingDocument();
    public void     finishMakingDocument(Document document);
    public Nodes    makeAttribute(String name, String URI, String value, Attribute.Type type);
    public Nodes    makeComment(String data);
    public Nodes    makeDocType(String rootElementName, String publicID, String systemID);
    public Nodes    makeText(String data);
    public Nodes    makeProcessingInstruction(String target, String data);
    
}

Factories

Builder uses a factory to build nodes
Default factory builds standard classes
Can change factories using the setFactory() method in Builder

Subclassing enables:

Extra utility methods:

public String getAttributeValue(String name, String uri, String default)
public String getAttributeValue(String name, String uri, String default)
public String getAttributeValue(String name, String uri, String default)

Read-only tree
Application specific classes:
- XHTMLElement
- PElement
- DivElement
- etc.

Processing Arbitrarily Large Documents

Subclass NodeFactory
Override finishMakingElement()
Process each element inside finishMakingElement()
Return null if you're finished with the element and want to remove it from the tree
Return super.finishMakingElement() if you're not finished wiht the element

Streaming Processing of Large Documents

Goal: Print all the headlines in an RSS feed without storing the entire document in memory

import java.io.IOException;
import nu.xom.*;

public class RSSHeadlines extends NodeFactory {

    private boolean inTitle = false;
    private Nodes empty = new Nodes();

    public Element startMakingElement(String name, String namespace) {              
        if ("title".equals(name) ) {
            inTitle = true; 
            return new Element(name, namespace);
        }
        return null;            
    }

    public Nodes finishMakingElement(Element element) {
        if ("title".equals(element.getQualifiedName()) ) {
            System.out.println(element.getValue());
            inTitle = false;
        }
        return empty;
    }

    public Nodes makeComment(String data) {
        return empty;  
    }    

    public Element makeRootElement(String name, String namespace) {
        return new Element(name, namespace); 
    }

    public Nodes makeAttribute(String name, String namespace, 
      String value, Attribute.Type type) {
        return empty;
    }

    public Nodes makeDocType(String rootElementName, 
      String publicID, String systemID) {
        return empty;    
    }

    public Nodes makeProcessingInstruction(
      String target, String data) {
        return empty; 
    }    
    
    public static void main(String[] args) {
  
        String url = "http://www.bbc.co.uk/syndication/feeds/news/ukfs_news/world/rss091.xml";
        if (args.length > 0) {
          url = args[0];
        }
        
        try {
          Builder parser = new Builder(new RSSHeadlines());
          parser.build(url);
        }
        catch (ParsingException ex) {
          System.out.println(url + " is not well-formed.");
          System.out.println(ex.getMessage());
        }
        catch (IOException ex) { 
          System.out.println(
           "Due to an IOException, the parser could not read " + url
          ); 
        }
  
    }

}

Output of RSS Headlines

% java -classpath ~/XOM/build/classes:. RSSHeadlines
BBC News | World | UK Edition
BBC News
Qurei calls for action on barrier
US reveals 'al-Qaeda Iraq plot'
Sudan's western rebels 'crushed'
Russia seeks missing politician
Greek militants' trial begins
Pakistan warned on nuclear trade
US crackdown on 'Karadzic aides'
Battle over Twin Towers payout
Kerry looks good heading south
Moscow mourns train blast victims
Football: Mboma to quit
Tennis: Rusedski hearing due
S Africa election date announced
Kenyan judge faces graft tribunal
Haitian government warns of coup
US killer gets stay of execution
Food aid to North Korea dries up
Farmers decry US-Australia pact
Dutch MPs split over asylum bill
Tourists hurt in Cairo stabbing
Israeli court hears barrier case
Prince meets Iran quake victims
...

Performance

"Premature optimization is the root of all evil" -- Donald Knuth, 1974
Pretty damn good
Fast enough
Certainly one of the most memory efficient, tree-based APIs; possibly the most memory efficient

Candidates for Optimization

Replace ArrayLists with direct arrays
Cache index when traversing
Remove casts?

How does XOM differ from JDOM?

Absolutely correct; no malformedness
Fewer "convenience" methods and classes
toXML()
JDOM Elements contain a list; a XOM Element is a list; thus
Typed navigation via loops instead of the Java Collections API, Lists, and Iterators
No support for skipped entities
XOM classes do not implement Serializable or Cloneable

In JDOM's Favor

XPath Support

In XOM's Favor

Streaming
Canonical XML Support
XInclude Support

XOM is simpler!

Number of public methods (and constructors) in	DOM2	JDOM b10	XOM 1.0d25
Node	25	6 ^*	11
Attribute	5	28	21
Element	16	78	37
ProcessingInstruction	3	16	9
Comment	0	7	9
Builder	N/A	33 ^**	16
Document	17	43	13

^* Content

^** SAXBuilder

Future Directions

DTD API
XPath
Catalog support
XML Encryption
XML Digital Signatures

Props

Joshua Bloch for Effective Java
Ken Arnold for Perfection and Simplicity
Bruce Eckel for Does Java need Checked Exceptions?
Bertrand Meyer for Object Oriented Software Construction
Jason Hunter and Brett McLaughlin for JDOM
Kent Beck and Erich Gamma for JUnit
The members of the xom-interest mailing list for numerous helpful suggestions and critiques

To Learn More

XOM Site: http://www.cafeconleche.org/XOM/
XOM-interest mailing list: http://lists.ibiblio.org/mailman/listinfo/xom-interest
Getting Started with XOM by Michael Fitzgerald, http://www.xml.com/pub/a/2002/11/27/xom.html
XML Made Simpler by Rogers Cadenhead, Linux Magazine, March 2003, http://www.linux-mag.com/2003-03/java_xom_01.html
nu.xom.samples package has simple example of many XOM features.
This presentation: http://cafeconleche.org/slides/sd2004west/xom/

Index | Cafe con Leche