What's wrong with existing APIs
Design Principles
XOM Basics
Cool Stuff!
XML was, as has been fretted over before, ugly, hard, and boring to code with. Not any more :). XOM rocks! I'm using it in all my projects now.
Keep it up!
--Patrick Collison
I did some XML Programming during the last month with Java's DOM. this was not funny !! I also played with Ruby's powerful REXML. this is a great API becaue it uses the power of Ruby and it was designed for Ruby and is not a generic interface like DOM. this is way REXML is so popular in the Ruby world.
and this is why I like XOM. for me it fits much better to Java than DOM. I hope that XOM will become for Java what REXML is for Ruby now.
--Markus Jais
Overall, I found XOM to be an amazingly well-organized, intuitive API that's easy to learn and to use. I like how it reinforces good practices and provides insight about XML -- such as the lack of whitespace when XML is produced without a serializer and the identical treatment of text whether it consists of character entities, CDATA sections, or regular characters.
I can't compare it to JDOM, but it's appreciably more pleasant to work with than the Simple API for XML Processing.
--Rogers Cadenhead
i spent yesterday writing the code to render my application config as xml. using xom was like falling off a log. no muss, no fuss, the methods did what i expected, and any confusion was quickly ironed out by a visit to the (copious) examples, or the javadocs. i did run into what might be a bug, but it only showed up because i made a dumb cut-n-paste error (see my other email).
after i get the output tidied up, i'll move on to reading it back in. i'm confident that that will be almost as easy...
--Dirk Bergstrom
I just started to use XOM in my beanshell scripts and have found it intuitive and very simple to use. It produces code that is very clear at a higher level of abstraction than I usually am forced to work.
--Gary Furash
1.0: Stable API
1.0.1: A few bug fixes
1.1: New features: XPath, Exclusive XML Canonicalization, etc.
Event Based Push: SAX, XNI
Event Based Pull: XMLPULL, CyberNeko, StAX
Tree: DOM, JDOM, dom4j, Sparta, etc.
Data Binding: Castor, Zeus, JAXB, JaxMe, etc.
Read-only
Fast
Streamable
Memory efficient
Complete
Essentially correct
Client programs can get quite complex and confusing
Read-only
Fast
Streamable
Memory efficient
Client programs can be much simpler than SAX
Map XML documents to Java classes
Read/Write
Allow in-memory manipulation
Hide the XML details
Common assumptions:
Documents have schemas
Documents are valid.
Structures are fairly flat and definitely not recursive.
Narrative documents aren't worth considering.
Mixed content doesn't exist.
Choices don't exist.
Order doesn't matter.
Sees the world through object-colored glasses
Model an XML document using classes that represent nodes
Composition builds a tree
Read/Write
Allow in-memory manipulation
The simplest arbitrary XML API
Tend to be profligate with memory
Uses factories and interfaces, and yet not interoperable
Fails to enforce all XML constraints; allows creation of malformed documents
Namespaces properties and attributes
Live lists
Just plain ugly; does not adhere to Java conventions
No method overloading
Short type constants for node types
Methods in the Node superinterface that only work for one or two subinterfaces
Incomplete: no standard loading or serialization
A single exception class with short type codes
Does not guarantee Java features like equals()
, hashCode()
, and toString()
Had to be backwards compatible with unplanned object models in third generation web browsers.
Designed by a committee trying to reconcile differences between the object models implemented by Netscape, Microsoft, and other vendors.
A cross-language API defined in IDL
Needed to support weak scripting languages like JavaScript and AppleScript
Must work for both HTML and XML.
A node supertype is very useful
Interfaces are a bad idea
Successful APIs must be simple
Simplest of the existing APIs (but it could be simpler)
There's more than one way to do it:
3+ ways to read an attribute value
5+ ways to read a child element
Not always well-formed:
Processing instruction data
Text content
Internal DTD subset
Setter methods don't return void
Too weakly typed: Everything is an Object
Too strongly typed: nothing is a Node
Cloneable
Serializable
Many checked exceptions
Classes and constructors are good
Thread safety is not necessary
Live lists are trouble
Keep everything in one package
Don't release too early
Don't optimize until the API is right
You don't need to build your own parser, transformer, or search engine
You can fight the W3C
Forked from JDOM
More complex
Uses interfaces instead of classes
A complete streaming tree model for XML 1.0 instance documents
Free as in speech (LGPL)
Pure Java
Java 1.2 and later (internal dependence on Collections API)
Easy to use
Easy to learn
Fast enough
Small enough
No gotchas
Principle of Least Surprise
As simple as it can be and no simpler!
Use Java idioms where they fit (and only where they fit)
There's exactly one way to do it
Start small and grow as necessary:
It's easier to put something in than take something out.
if I may make one point that highly influenced the end-game when we were finishing up XML 1.0 in 1998: if you leave something out, you can always put it in later. The reverse is not true.
--Tim Bray on the public-qt-comments mailing list
During the design I added methods that were necessary to produce certain sample programs.
APIs are written by experts for non-experts
It is the class's responsibility to enforce its class invariants
Verify preconditions
Do not allow clients to do bad things.
Hide as much of the implementation as possible.
Design for subclassing or prohibit it
All objects can be written as well-formed XML text
Impossible to create malformed documents
Validity can be enforced by subclasses
Syntax sugar is not represented:
CDATA sections
Character and entity references
Attribute order
Defaulted vs. specified attributes
Not thread safe
Classes do not implement Serializable
; use XML.
Classes do not implement Cloneable
; use copy constructors.
Lack of generics really hurts in the Collections API. Hence, don't use it.
Problems detectable in testing throw runtime exceptions
Assertions that can be turned off are pointless
This is a cathedral, not a bazaar
Unit testing
Massive samples
import java.math.BigInteger; import nu.xom.Element; import nu.xom.Document; public class FibonacciXML { public static void main(String[] args) { BigInteger low = BigInteger.ONE; BigInteger high = BigInteger.ONE; Element root = new Element("Fibonacci_Numbers"); for (int i = 1; i <= 10; i++) { Element fibonacci = new Element("fibonacci"); fibonacci.appendChild(low.toString()); root.appendChild(fibonacci); BigInteger temp = high; high = high.add(low); low = temp; } Document doc = new Document(root); System.out.println(doc.toXML()); } }
% java -classpath ~/XOM/build/classes:. FibonacciXML <?xml version="1.0"?> <Fibonacci_Numbers><fibonacci>1</fibonacci><fibonacci>1</fibonacci><fibonacci>2</fibonacci><fibonacci>3</fibonacci><fibonacci>5</fibonacci><fibonacci>8</fibonacci><fibonacci>13</fibonacci><fibonacci>21</fibonacci><fibonacci>34</fibonacci><fibonacci>55</fibonacci></Fibonacci_Numbers>
try { Builder parser = new Builder(); Document doc = parser.build(url); System.out.println(doc.toXML()); } catch (ParsingException ex) { System.out.println(url + " is not well-formed."); System.out.println(ex.getMessage()); } catch (IOException ex) { System.out.println("Due to an IOException, " + "the parser could not check " + args[0]); }
public abstract class Node { public String getValue(); public final Document getDocument(); public String getBaseURI(); public final ParentNode getParent(); public Node getChild(int position); public int getChildCount(); public final void detach(); public Node copy(); public String toXML(); public final boolean equals(Object o); public final int hashCode(); }
getValue()
returns the XPath string value of a node
toXML()
returns a String
containing the XML form of the node
import java.io.*; import nu.xom.*; public class PropertyPrinter { private Writer out; public PropertyPrinter(Writer out) { if (out == null) { throw new NullPointerException("Writer must be non-null."); } this.out = out; } public PropertyPrinter() { this(new OutputStreamWriter(System.out)); } private int nodeCount = 0; public void writeNode(Node node) throws IOException { if (node == null) { throw new NullPointerException("Node must be non-null."); } if (node instanceof Document) { // starting a new document, reset the node count nodeCount = 1; } String type = node.getClass().getName(); // never null String value = node.getValue(); String name = null; String localName = null; String uri = null; String prefix = null; if (node instanceof Element) { Element element = (Element) node; name = element.getQualifiedName(); localName = element.getLocalName(); uri = element.getNamespaceURI(); prefix = element.getNamespacePrefix(); } else if (node instanceof Attribute) { Element element = (Element) node; name = element.getQualifiedName(); localName = element.getLocalName(); uri = element.getNamespaceURI(); prefix = element.getNamespacePrefix(); } StringBuffer result = new StringBuffer(); result.append("Node " + nodeCount + ":\r\n"); result.append(" Type: " + type + "\r\n"); if (name != null) { result.append(" Name: " + name + "\r\n"); } if (localName != null) { result.append(" Local Name: " + localName + "\r\n"); } if (prefix != null) { result.append(" Prefix: " + prefix + "\r\n"); } if (uri != null) { result.append(" Namespace URI: " + uri + "\r\n"); } if (value != null) { result.append(" Value: " + value + "\r\n"); } out.write(result.toString()); out.write("\r\n"); out.flush(); nodeCount++; } public static void main(String[] args) throws Exception { Builder builder = new Builder(); for (int i = 0; i < args.length; i++) { PropertyPrinter p = new PropertyPrinter(); File f = new File(args[i]); Document doc = builder.build(f); p.writeNode(doc); } } }
% java -classpath ~/XOM/build/classes:. PropertyPrinter hotcop.xml Node 1: Type: nu.xom.Document Value: Hot Cop Jacques Morali Henri Belolo Victor Willis Jacques Morali A & M Records 6:20 1978 Village People
Recursive, pre-order traversal
getFirstChild()
Indexed navigation is the key
No iterators; no siblings
import java.io.IOException; import nu.xom.*; public class TreeReporter { public static void main(String[] args) { if (args.length <= 0) { System.out.println("Usage: java TreeReporter URL"); return; } TreeReporter iterator = new TreeReporter(); try { Builder parser = new Builder(); // Read the entire document into memory Node document = parser.build(args[0]); // Process it starting at the root iterator.followNode(document); } catch (IOException ex) { System.out.println(ex); } catch (ParsingException ex) { System.out.println(ex); } } // end main private PropertyPrinter printer = new PropertyPrinter(); // note use of recursion public void followNode(Node node) throws IOException { printer.writeNode(node); for (int i = 0; i < node.getChildCount(); i++) { followNode(node.getChild(i)); } } }
% java -classpath ~/XOM/build/classes:. TreeReporter elharo@stallion examples]$ java -classpath ~/XOM/build/classes:. TreeReporter hotcop.xml Node 1: Type: nu.xom.Document Value: Hot Cop Jacques Morali Henri Belolo Victor Willis Jacques Morali A & M Records 6:20 1978 Village People Node 2: Type: nu.xom.ProcessingInstruction Value: type="text/css" href="song.css" Node 3: Type: nu.xom.DocType Value: Node 4: Type: nu.xom.Element Name: SONG Local Name: SONG Prefix: Namespace URI: http://metalab.unc.edu/xml/namespace/song Value: Hot Cop Jacques Morali Henri Belolo Victor Willis Jacques Morali A & M Records 6:20 1978 Village People ...
Subclass of ParentNode
Document children are:
Comments
Processing Instructions
Zero or one DocType
One Root Element
package nu.xom;
public class Document extends ParentNode {
public Document(Element root);
public Document(Document doc);
public final DocType getDocType() ;
public final Element getRootElement();
public void setRootElement(Element root)
public void setBaseURI(String URI);
public final String getBaseURI();
public void insertChild(int position, Node c);
public void removeChild(int position);
public void removeChild(Node child);
public final String getValue() ;
public final String toXML();
public Node copy();
}
The document must validate against one of the three DTDs found in Appendix A.
The root element of the document must be <html>.
The root element of the document must designate the XHTML namespace using the xmlns attribute [XMLNAMES]. The namespace for XHTML is defined to be http://www.w3.org/1999/xhtml.
There must be a DOCTYPE declaration in the document prior to the root element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in Appendix A using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions.
boolean valid = true; DocType doctype = document.getDocType(); if (doctype == null) { valid = false; } else { // check doctype } Element root = document.getRootElement(); String uri = root.getNamespaceURI(); String prefix = root.getNamespacePrefix(); if (!uri.equals("http://www.w3.org/1999/xhtml")) { valid = false; } if (!prefix.equals("")) valid = false;
Largest class in XOM
Subclass of ParentNode
Every Element has:
Local name
Namespace prefix (which can be the empty string)
Namespace URI (which can be the empty string)
A collection of Attributes
A collection of additional namespaces
A list of children
A ParentNode (which may be null)
An owner Document (which may be null)
public Element(String name);
public Element(String name, String uri);
public Element(Element element);
Element para = new Element("para");
Element p = new Element("p", "http://www.w3.org/1999/xhtml");
Element text = new Element("svg:text", "http://www.w3.org/TR/2000/svg");
Getters:
public final String getLocalName();
public final String getQualifiedName();
public final String getNamespacePrefix();
public final String getNamespaceURI();
public final String getNamespaceURI(String prefix);
Setters:
public void setLocalName(String localName);
public void setNamespaceURI(String URI);
public void setNamespacePrefix(String prefix);
public final Elements getChildElements(String name);
public final Elements getChildElements(String localName, String namespace);
public final Element getFirstChildElement(String name);
public final Element getFirstChildElement(String localName, String namespace);
A read-only list containing only Element
objects
public final class Elements {
public int size();
public Element get(int index);
}
public void process(Element element) {
Elements children = element.getChildElements();
for (int i = 0; i < children.size(); i++) {
process(children.get(i));
}
}
import javax.swing.*; import javax.swing.tree.*; import nu.xom.*; public class TreeViewer { // Initialize the per-element data structures public static MutableTreeNode processElement(Element element) { String data; if (element.getNamespaceURI().equals("")) data = element.getLocalName(); else { data = '{' + element.getNamespaceURI() + "} " + element.getQualifiedName(); } MutableTreeNode node = new DefaultMutableTreeNode(data); Elements children = element.getChildElements(); for (int i = 0; i < children.size(); i++) { node.insert(processElement(children.get(i)), i); } return node; } public static void display(Document doc) { Element root = doc.getRootElement(); JTree tree = new JTree(processElement(root)); JScrollPane treeView = new JScrollPane(tree); JFrame f = new JFrame("XML Tree"); String version = System.getProperty("java.version"); if (version.startsWith("1.2") || version.startsWith("1.1")) { f.setDefaultCloseOperation(JFrame.HIDE_ON_CLOSE); } else { // JFrame.EXIT_ON_CLOSE == 3 but this named constant is not // available in Java 1.2 f.setDefaultCloseOperation(3); } f.getContentPane().add(treeView); f.pack(); f.show(); } public static void main(String[] args) { try { Builder builder = new Builder(); for (int i = 0; i < args.length; i++) { Document doc = builder.build(args[i]); display(doc); } } catch (Exception ex) { System.err.println(ex); } } // end main() } // end TreeViewer
public void addAttribute(Attribute attribute);
public void removeAttribute(Attribute attribute);
public final Attribute getAttribute(String name);
public final Attribute getAttribute(String localName, String namespaceURI);
public final String getAttributeValue(String name);
public final String getAttributeValue(String localName, String namespaceURI);
public final int getAttributeCount();
public final Attribute getAttribute(int i);
import java.io.IOException; import nu.xom.*; public class IDTagger { private static int id = 1; public static void processElement(Element element) { if (element.getAttribute("ID") == null) { element.addAttribute(new Attribute("ID", "_" + id)); id = id + 1; } // recursion Elements children = element.getChildElements(); for (int i = 0; i < children.size(); i++) { processElement(children.get(i)); } } public static void main(String[] args) { Builder builder = new Builder(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory Document document = builder.build(args[i]); processElement(document.getRootElement()); System.out.println(document.toXML()); } catch (ParsingException ex) { System.err.println(ex); continue; } catch (IOException ex) { System.err.println(ex); continue; } } } // end main }
Only for namespace prefixes used in attribute values and element content (e.g. XSLT and W3C Schemas)
Never used when an element or attribute in scope already has the prefix
public void addNamespaceDeclaration(String prefix, String URI);
public void removeNamespaceDeclaration(String prefix);
Don't normally need to do this; most of the time the namespace of any given element or attribute is sufficient
These methods allow you to list all the namespaces declared on any given element:
public final int getNamespaceDeclarationCount()
public final String getNamespacePrefix(int index)
public final String getNamespaceURI(String prefix)
Represents character data in element content
By default, the Builder
places the maximum
possible contiguous amount of text in each node.
CDATA sections are silently preserved from build to serialization when possible
package nu.xom;
public class Text extends Node {
public Text(String data);
public Text(Text text);
public void setValue(String data);
public final String getValue();
public final Node getChild(int i);
public final int getChildCount();
public final String toString();
public Node copy();
public final String toXML();
}
import java.io.IOException; import nu.xom.*; public class ROT13XML { // note use of recursion public static void encode(Node node) { if (node instanceof Text) { Text text = (Text) node; String data = text.getValue(); text.setValue(rot13(data)); } // recurse the children for (int i = 0; i < node.getChildCount(); i++) { encode(node.getChild(i)); } } public static String rot13(String s) { StringBuffer out = new StringBuffer(s.length()); for (int i = 0; i < s.length(); i++) { int c = s.charAt(i); if (c >= 'A' && c <= 'M') out.append((char) (c+13)); else if (c >= 'N' && c <= 'Z') out.append((char) (c-13)); else if (c >= 'a' && c <= 'm') out.append((char) (c+13)); else if (c >= 'n' && c <= 'z') out.append((char) (c-13)); else out.append((char) c); } return out.toString(); } public static void main(String[] args) { if (args.length <= 0) { System.out.println("Usage: java ROT13XML URL"); return; } String url = args[0]; try { Builder parser = new Builder(); // Read the document Document document = parser.build(url); // Modify the document ROT13XML.encode(document); // Write it out again System.out.println(document.toXML()); } catch (IOException ex) { System.out.println( "Due to an IOException, the parser could not encode " + url ); } catch (ParsingException ex) { System.out.println(ex); } } // end main }
% java -classpath ~/XOM/build/classes:. ROT13XML hotcop.xml % java -classpath ~/XOM/build/classes:. ROT13XML hotcop.xml <?xml version="1.0"?> <?xml-stylesheet type="text/css" href="song.css"?> <!DOCTYPE SONG SYSTEM "song.dtd"> <SONG xmlns="http://metalab.unc.edu/xml/namespace/song" xmlns:xlink="http://www.w3.org/1999/xlink"> <TITLE>Ubg Pbc</TITLE> <PHOTO xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg" ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200" /> <COMPOSER>Wnpdhrf Zbenyv</COMPOSER> <COMPOSER>Uraev Orybyb</COMPOSER> <COMPOSER>Ivpgbe Jvyyvf</COMPOSER> <PRODUCER>Wnpdhrf Zbenyv</PRODUCER> <!-- The publisher is actually Polygram but I needed an example of a general entity reference. --> <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/"> N & Z Erpbeqf </PUBLISHER> <LENGTH>6:20</LENGTH> <YEAR>1978</YEAR> <ARTIST>Ivyyntr Crbcyr</ARTIST> </SONG> <!-- You can tell what album I was listening to when I wrote this example -->
Subclasses Node
Each Attribute has:
Local name
Namespace prefix (which can be the empty string)
Namespace URI (which can be the empty string)
A type
A value
A parent Element (which may be null)
An owner Document (which may be null)
public Attribute(String localName, String value);
public Attribute(String localName, String value, Type type);
public Attribute(String name, String URI, String value, Type type);
public Attribute(Attribute attribute);
public final Type getType();
public void setType(Type type);
public final String getValue();
public void setValue(String value);
public final String getLocalName();
public void setLocalName(String localName);
public final String getQualifiedName()
public final String getNamespaceURI();
public final String getPrefix();
public void setNamespace(String prefix, String URI);
import java.net.*; import java.util.*; import nu.xom.*; public class XLinkSpider { private Set spidered = new HashSet(); private Builder parser = new Builder(); private List queue = new LinkedList(); public static final String XLINK_NS = "http://www.w3.org/1999/xlink"; public static final String XML_NS = "http://www.w3.org/XML/1998/namespace"; public void search(URL url) { try { String systemID = url.toExternalForm(); Document doc = parser.build(systemID); System.out.println(url); search(doc.getRootElement(), url); } catch (Exception ex) { // just skip this document } if (queue.isEmpty()) return; URL discovered = (URL) queue.remove(0); spidered.add(discovered); search(discovered); } private void search(Element element, URL base) { Attribute href = element.getAttribute("href", XLINK_NS); Attribute xmlbase = element.getAttribute("base", XML_NS); try { if (xmlbase != null) { base = new URL(base, xmlbase.getValue()); } } catch (MalformedURLException ex) { // Probably just no protocol handler for the // kind of URLs used inside this element return; } if (href != null) { String uri = href.getValue(); // absolutize URL try { URL discovered = new URL(base, uri); // strip ref field if any discovered = new URL( discovered.getProtocol(), discovered.getHost(), discovered.getFile() ); if (!spidered.contains(discovered) && !queue.contains(discovered)) { queue.add(discovered); } } catch (MalformedURLException ex) { // skip this one } } Elements children = element.getChildElements(); for (int i = 0; i < children.size(); i++) { search(children.get(i), base); } } public static void main(String[] args) { XLinkSpider spider = new XLinkSpider(); for (int i = 0; i < args.length; i++) { try { spider.search(new URL(args[i])); } catch (MalformedURLException ex) { System.err.println(ex); } } } // end main() }
% java -classpath ~/XOM/build/classes:. XLinkSpider http://www.rddl.org http://www.rddl.org http://www.rddl.org/purposes http://www.rddl.org/rddl.rdfs http://www.rddl.org/rddl-integration.rxg http://www.rddl.org/modules/rddl-1.rxm http://www.rddl.org/modules/xhtml-attribs-1.rxm http://www.rddl.org/modules/xhtml-base-1.rxm http://www.rddl.org/modules/xhtml-basic-form-1.rxm http://www.rddl.org/modules/xhtml-basic-table-1.rxm http://www.rddl.org/modules/xhtml-basic10-model-1.rxm http://www.rddl.org/modules/xhtml-basic10.rxm http://www.rddl.org/modules/xhtml-blkphras-1.rxm http://www.rddl.org/modules/xhtml-blkstruct-1.rxm http://www.rddl.org/modules/xhtml-for-rddl.rxm http://www.rddl.org/modules/xhtml-framework-1.rxm http://www.rddl.org/modules/xhtml-hypertext-1.rxm http://www.rddl.org/modules/xhtml-image-1.rxm http://www.rddl.org/modules/xhtml-inlphras-1.rxm http://www.rddl.org/modules/xhtml-inlstruct-1.rxm http://www.rddl.org/modules/xhtml-link-1.rxm http://www.rddl.org/modules/xhtml-list-1.rxm http://www.rddl.org/modules/xhtml-meta-1.rxm ... http://www.w3.org/TR/xhtml-basic http://www.w3.org/TR/xml-infoset/ http://www.w3.org/tr/xhtml1 http://www.w3.org/TR/xhtml-modularization/ http://www.rddl.org/purposes/software http://www.ascc.net/xml/schematron http://www.w3.org/2001/XMLSchema http://www.examplotron.org ...
Inner class that uses the type-safe enum pattern for the 10 DTD types:
Attribute.TYPE.CDATA
Attribute.TYPE.ID
Attribute.TYPE.IDREF
Attribute.TYPE.IDREFS
Attribute.TYPE.NMTOKEN
Attribute.TYPE.NMTOKENS
Attribute.TYPE.NOTATION
Attribute.TYPE.ENTITY
Attribute.TYPE.ENTITIES
Attribute.TYPE.ENUMERATION
Attribute.TYPE.UNDECLARED
ProcessingInstruction extends Node
Each ProcessingInstruction has:
Target, a string
Data, a string
plus the usual properties of any Node
Pseudo-attributes are not specifically supported
package nu.xom;
public class ProcessingInstruction extends Node{
public ProcessingInstruction(String target, String data) {
public ProcessingInstruction(ProcessingInstruction instruction)
public final String getTarget();
public void setTarget(String target);
protected void checkTarget(String target);
public final String getValue();
public void setValue(String data);
protected void checkValue(String data);
public final Node getChild(int i);
public final int getChildCount();
public final Node copy();
public final String toXML();
public final String toString();
}
Robots processing instruction:
<?robots index="yes | no"
follow="yes | no" ?>
package nu.xom.samples; import java.net.*; import java.util.*; import nu.xom.*; public class PoliteSpider { private Set spidered = new HashSet(); private Builder parser = new Builder(); private List queue = new LinkedList(); public static final String XLINK_NS = "http://www.w3.org/1999/xlink"; public static final String XML_NS = "http://www.w3.org/XML/1998/namespace"; public void search(URL url) { try { String systemID = url.toExternalForm(); Document doc = parser.build(systemID); boolean follow = true; boolean index = true; for (int i = 0; i < doc.getChildCount(); i++) { Node child = doc.getChild(i); if (child instanceof Element) break; if (child instanceof ProcessingInstruction){ ProcessingInstruction instruction = (ProcessingInstruction) child; if (instruction.getTarget().equals("robots")) { Element data = PseudoAttributes.getAttributes(instruction); Attribute indexAtt = data.getAttribute("index"); if (indexAtt != null) { String value = indexAtt.getValue().trim(); if (value.equals("no")) index = false; } Attribute followAtt = data.getAttribute("follow"); if (followAtt != null) { String value = followAtt.getValue().trim(); if (value.equals("no")) follow = false; } } } } if (index) System.out.println(url); if (follow) search(doc.getRootElement(), url); } catch (Exception ex) { // just skip this document } if (queue.isEmpty()) return; URL discovered = (URL) queue.remove(0); spidered.add(discovered); search(discovered); } private void search(Element element, URL base) { Attribute href = element.getAttribute("href", XLINK_NS); Attribute xmlbase = element.getAttribute("base", XML_NS); try { if (xmlbase != null) base = new URL(base, xmlbase.getValue()); } catch (MalformedURLException ex) { //Java can't handle the kind of URLs used inside this element return; } if (href != null) { String uri = href.getValue(); // absolutize URL try { URL discovered = new URL(base, uri); // strip ref field if any discovered = new URL( discovered.getProtocol(), discovered.getHost(), discovered.getFile() ); if (!spidered.contains(discovered) && !queue.contains(discovered)) { queue.add(discovered); } } catch (MalformedURLException ex) { // skip this one } } Elements children = element.getChildElements(); for (int i = 0; i < children.size(); i++) { search(children.get(i), base); } } public static void main(String[] args) { XLinkSpider spider = new XLinkSpider(); for (int i = 0; i < args.length; i++) { try { spider.search(new URL(args[i])); } catch (MalformedURLException ex) { System.err.println(ex); } } } // end main() }
Represents the document type declaration
Not the document type definition!
Properties:
Root element name
Public ID (may be null)
System ID (may be null)
Internal DTD subset (read-only, may be null)
Limited to one per document, in the prolog only
public class DocType extends Node{
public DocType(String rootElementName, String publicID, String systemID);
public DocType(String rootElementName, String systemID);
public DocType(String rootElementName);
public DocType(DocType doctype);
public void setRootElementName(String name);
public final String getRootElementName();
public final String getInternalDTDSubset();
public String setInternalDTDSubset(String subset); // 1.1 and later
public void setPublicID(String id);
public final String getPublicID();
public void setSystemID(String id);
public final String getSystemID();
public final Node getChild(int i);
public final int getChildCount();
public final Node copy();
public final String toXML();
}
It must validate against one of the three DTDs found in Appendix A.
The root element of the document must be <html>.
The root element of the document must designate the XHTML namespace using the xmlns attribute [XMLNAMES]. The namespace for XHTML is defined to be http://www.w3.org/1999/xhtml.
There must be a DOCTYPE declaration in the document prior to the root element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in Appendix A using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "DTD/xhtml1-frameset.dtd">
import java.io.IOException; import nu.xom.*; public class XHTMLValidator { public static void main(String[] args) { for (int i = 0; i < args.length; i++) { validate(args[i]); } } private static Builder builder = new Builder(true); /* turn on validation ^^^^ */ // not thread safe public static void validate(String source) { Document document; try { document = builder.build(source); } catch (ParsingException ex) { System.out.println(source + " is invalid XML, and thus not XHTML."); return; } catch (IOException ex) { System.out.println("Could not read: " + source); return; } // If we get this far, then the document is valid XML. // Check to see whether the document is actually XHTML boolean valid = true; DocType doctype = document.getDocType(); if (doctype == null) { System.out.println("No DOCTYPE"); valid = false; } else { // verify the DOCTYPE String name = doctype.getRootElementName(); String publicID = doctype.getPublicID(); if (!name.equals("html")) { System.out.println( "Incorrect root element name " + name); valid = false; } if (publicID == null || (!publicID.equals("-//W3C//DTD XHTML 1.0 Strict//EN") && !publicID.equals( "-//W3C//DTD XHTML 1.0 Transitional//EN") && !publicID.equals( "-//W3C//DTD XHTML 1.0 Frameset//EN"))) { valid = false; System.out.println(source + " does not seem to use an XHTML 1.0 DTD"); } } // Check the namespace on the root element Element root = document.getRootElement(); String uri = root.getNamespaceURI(); String prefix = root.getNamespacePrefix(); if (!uri.equals("http://www.w3.org/1999/xhtml")) { valid = false; System.out.println(source + " does not properly declare the" + " http://www.w3.org/1999/xhtml namespace" + " on the root element"); } if (!prefix.equals("")) { valid = false; System.out.println(source + " does not use the empty prefix for XHTML"); } if (valid) System.out.println(source + " is valid XHTML."); } }
% java -classpath ~/XOM/build/classes:. XHTMLValidator http://www.w3.org/ http://www.cafeconleche.org/ http://www.w3.org/ is valid XHTML. http://www.cafeconleche.org/ is invalid XML, and thus not XHTML.
package nu.xom;
public class Comment extends Node {
public Comment(String data);
public Comment(Comment comment);
public final String getValue();
public void setValue(String data);
public final Node getChild(int i);
public final int getChildCount();
public final Node copy();
public final String toXML();
public final String toString();
}
import java.io.IOException; import nu.xom.*; public class CommentReader { public static void list(Node node) { for (int i = 0; i < node.getChildCount(); i++) { Node child = node.getChild(i); if (child instanceof Comment) { System.out.println(child.toXML()); } else { list(child); } } } public static void main(String[] args) { if (args.length <= 0) { System.out.println("Usage: java CommentReader URL"); return; } try { Builder parser = new Builder(); Document doc = parser.build(args[0]); list(doc); } catch (ParsingException ex) { System.out.println(args[0] + " is not well-formed."); System.out.println(ex.getMessage()); } catch (IOException ex) { System.out.println( "Due to an IOException, the parser could not read " + args[0] ); } } }
$ java -classpath ~/XOM/build/classes/:. CommentReader http://www.w3.org/TR/2004/REC-DOM-Level-3-Val-20040127/xml-source.xml <!-- $Id: xml-source.xml,v 1.7 2004/01/26 22:31:28 plehegar Exp $ --> <!-- ************************************************************************* * FRONT MATTER * ************************************************************************* --> <!-- ****************************************************** | filenames to be used for each section | ****************************************************** --> <!-- ****************************************************** * DOCUMENT ABSTRACT * ****************************************************** --> <!-- $Id: xml-source.xml,v 1.7 2004/01/26 22:31:28 plehegar Exp $ --> <!-- $Id: xml-source.xml,v 1.7 2004/01/26 22:31:28 plehegar Exp $ --> <!-- ************************************************************************* * BEGINNING OF COPYRIGHT NOTICE * ************************************************************************* --> <!-- ************************************************************************* * END OF COPYRIGHT NOTICE * ************************************************************************* --> <!-- $Id: xml-source.xml,v 1.7 2004/01/26 22:31:28 plehegar Exp $ --> <!-- ************************************************************************* * BEGINNING OF VALIDATION ************************************************************************* --> <!-- ****************************************************** Last known edit 12/03/2003 Suggestions welcome, especially if accompanied by proposed revisions already marked up as per spec.dtd! ****************************************************** --> <!-- ****************************************************** | OVERVIEW | ****************************************************** --> <!-- ****************************************************** | ISSUES | ****************************************************** <div2 id="Level-3-VAL-Issue-List"> <head>Issue List</head> <div3 id="VAL-Issues-List-Resolved"> <head>Resolved Issues</head> <issue id="VAL-Issue-8" status="open"> <p>For Validation interfaces there should be no dependency on DOM Core. </p> <p>The <code>NodeEditVAL</code> interface will not extend DOM Core. It is simply an object that expresses similar interfaces.</p> </issue> </div3> -->...
package nu.xom; public class Builder { public Builder(); public Builder(boolean validate); public Builder(boolean validate, NodeFactory factory); public Builder(XMLReader parser); public Builder(NodeFactory factory); public Builder(XMLReader parser, boolean validate); public Builder(XMLReader parser, boolean validate, NodeFactory factory); public Document build(String systemID) throws ParsingException, ValidityException, IOException; public Document build(InputStream in) throws ParsingException, ValidityException, IOException; public Document build(InputStream in, String baseURI) throws ParsingException, ValidityException, IOException; public Document build(File in) throws ParsingException, ValidityException, IOException; public Document build(Reader in) throws ParsingException, ValidityException, IOException; public Document build(Reader in, String baseURI) throws ParsingException, ValidityException, IOException; public Document build(String document, String baseURI) throws ParsingException, ValidityException, IOException; public NodeFactory getNodeFactory(); }
try {
XMLReader xerces = XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser");
xerces.setFeature(
"http://apache.org/xml/features/validation/schema",
true);
Builder parser = new Builder(xerces, true);
parser.build(url);
System.out.println(url + " is schema valid.");
}
catch (SAXException ex) {
System.out.println("Could not load Xerces.");
}
catch (ParseException ex) {
System.out.println(url + " is not schema valid.");
System.out.println(ex.getMessage());
}
catch (IOException ex) {
System.out.println("Due to an IOException, Xerces could not check "
+ url);
}
public class Serializer {
public Serializer(OutputStream out);
public Serializer(OutputStream out, String encoding);
public int getIndent();
public void setIndent(int indent);
public String getLineSeparator();
public void setLineSeparator(String lineSeparator);
public int getMaxLength();
public void setMaxLength(int length);
public boolean getPreserveBaseURI();
public void setPreserveBaseURI(boolean preserve);
public boolean getNormalizationFormC();
public void setNormalizationFormC(boolean preserve);
public void write(Document doc) throws IOException;
public void flush() throws IOException;
}
import java.io.IOException; import nu.xom.*; public class PrettyPrinter { public static void main(String[] args) { if (args.length <= 0) { System.out.println("Usage: java PrettyPrinter URL"); return; } try { Builder parser = new Builder(); Document doc = parser.build(args[0]); Serializer serializer = new Serializer(System.out, "ISO-8859-1"); serializer.setIndent(4); serializer.setMaxLength(64); serializer.setPreserveBaseURI(true); serializer.write(doc); serializer.flush(); } catch (ParsingException ex) { System.out.println(args[0] + " is not well-formed."); System.out.println(ex.getMessage()); } catch (IOException ex) { System.out.println( "Due to an IOException, the parser could not check " + args[0] ); } } }
Serializer supports all encodings available in the VM
Understands:
UTF-8, UTF-16, UTF-32
ISO-8859-1 through ISO-8859-15
TIS-620
US-ASCII
GB18030
EBCDIC-37
Modular design makes it fairly easy to add more by contribution
SAXConverter
feeds data into a SAX ContentHandler
DOMConverter
does two-way conversion of DOM Document
objects
Notations
Unparsed entities
Skipped entities
DTD model
Original encoding
Standalone declaration
Version declaration
Classes are designed and documented for subclassing.
Subclasses cannot relax constraints
Subclasses can add constraints by overriding setter methods
Subclasses can add functionality or utility
Factories can be used to build in the subclasses during parsing
Can change classes of nodes
Can change node types
Can change node numbers
Can filter
Can process arbitrarily large documents
Can process in a stream
package nu.xom; public class NodeFactory { public Element makeRootElement(String name, String namespace); public Element startMakingElement(String name, String namespace); public Nodes finishMakingElement(Element element); public Document startMakingDocument(); public void finishMakingDocument(Document document); public Nodes makeAttribute(String name, String URI, String value, Attribute.Type type); public Nodes makeComment(String data); public Nodes makeDocType(String rootElementName, String publicID, String systemID); public Nodes makeText(String data); public Nodes makeProcessingInstruction(String target, String data); }
Builder
uses a factory to build nodes
Default factory builds standard classes
Can change factories using the setFactory()
method in
Builder
Subclassing enables:
Extra utility methods:
public String getAttributeValue(String name, String uri, String default)
public String getAttributeValue(String name, String uri, String default)
public String getAttributeValue(String name, String uri, String default)
Read-only tree
Application specific classes:
XHTMLElement
PElement
DivElement
etc.
Subclass NodeFactory
Override finishMakingElement()
Process each element inside finishMakingElement()
Return null if you're finished with the element and want to remove it from the tree
Return super.finishMakingElement()
if you're not finished with the element
Goal: Print all the headlines in an RSS feed without storing the entire document in memory
import java.io.IOException; import nu.xom.*; public class RSSHeadlines extends NodeFactory { private boolean inTitle = false; private Nodes empty = new Nodes(); public Element startMakingElement(String name, String namespace) { if ("title".equals(name) ) { inTitle = true; return new Element(name, namespace); } return null; } public Nodes finishMakingElement(Element element) { if ("title".equals(element.getQualifiedName()) ) { System.out.println(element.getValue()); inTitle = false; } return empty; } public Nodes makeComment(String data) { return empty; } public Element makeRootElement(String name, String namespace) { return new Element(name, namespace); } public Nodes makeAttribute(String name, String namespace, String value, Attribute.Type type) { return empty; } public Nodes makeDocType(String rootElementName, String publicID, String systemID) { return empty; } public Nodes makeProcessingInstruction( String target, String data) { return empty; } public static void main(String[] args) { String url = "http://www.bbc.co.uk/syndication/feeds/news/ukfs_news/world/rss091.xml"; if (args.length > 0) { url = args[0]; } try { Builder parser = new Builder(new RSSHeadlines()); parser.build(url); } catch (ParsingException ex) { System.out.println(url + " is not well-formed."); System.out.println(ex.getMessage()); } catch (IOException ex) { System.out.println( "Due to an IOException, the parser could not read " + url ); } } }
% java -classpath ~/XOM/build/classes:. RSSHeadlines BBC News | World | UK Edition BBC News Ailing Pope to stay in hospital UK's Kenya envoy in fresh attack Tsunami survivors found on island 'Nepal crisis cabinet' unveiled Bush to make key policy speech Sunnis say Iraq poll illegitimate Egypt to host Middle East summit Germany renews pledge to Israel US hostage photo 'is doll hoax' Golf: Langer gives up captaincy Football: Ref scandal escalates Zimbabwe expels SA union leaders Africans 'worst-hit by warming' Clinton made UN's tsunami envoy Jet skids off New Jersey runway Trauma risk for tsunami survivors US 'ties N Korea to nuclear deal' Five million Germans out of work Conference examines Roma plight Syria and Jordan talk about peace Ex-UN chief warns of water wars South Asia group postpones talks Couple arrested over tsunami baby Heroes who defied the Holocaust ...
Goal: Print all the headlines in an RSS feed
Requires XOM 1.1
import java.io.IOException; import nu.xom.*; public class XPathHeadlines { public static void main(String[] args) { String url = "http://www.bbc.co.uk/syndication/feeds/news/ukfs_news/world/rss091.xml"; if (args.length > 0) { url = args[0]; } try { Builder parser = new Builder(); Document doc = parser.build(url); Nodes titles = doc.query("//title"); for (int i = 0; i < titles.size(); i++) { System.out.println(titles.get(i).getValue()); } } catch (ParsingException ex) { System.out.println(url + " is not well-formed."); System.out.println(ex.getMessage()); } catch (IOException ex) { System.out.println( "Due to an IOException, the parser could not read " + url ); } } }
"Premature optimization is the root of all evil" -- Donald Knuth, 1974
Pretty damn good
Fast enough
Document objects tend to be 4-6 times the size of the inoput document
Replace ArrayLists with direct arrays
Use strings instead of UTF-8 in Text
class
Stroe base URIs in a WeakHashMap
Absolutely correct; no malformedness
Fewer "convenience" methods and classes
toXML()
JDOM Elements contain a list; a XOM Element is a list; thus
Typed navigation via loops instead of the Java Collections API, Lists, and Iterators
No support for skipped entities
XOM classes do not implement Serializable
or Cloneable
Streaming
Canonical XML
XInclude
Number of public methods (and constructors) in | DOM2 | JDOM b10 | XOM 1.0d25 |
---|---|---|---|
Node | 25 | 8 * | 11 (13 in 1.1) |
Attribute | 5 | 29 | 20 |
Element | 16 | 73 | 37 |
ProcessingInstruction | 3 | 14 | 9 |
Comment | 0 | 5 | 9 |
Builder | N/A | 32 ** | 16 |
Document | 17 | 41 | 13 |
* Content
** SAXBuilder
DTD API
Catalog support
XML Encryption
XML Digital Signatures
Joshua Bloch for Effective Java
Ken Arnold for Perfection and Simplicity
Bruce Eckel for Does Java need Checked Exceptions?
Bertrand Meyer for Object Oriented Software Construction
Jason Hunter and Brett McLaughlin for JDOM
Kent Beck and Erich Gamma for JUnit
The members of the xom-interest mailing list for numerous helpful suggestions and critiques
XOM Site: http://www.xom.nu/
XOM-interest mailing list: http://lists.ibiblio.org/mailman/listinfo/xom-interest
Getting Started with XOM by Michael Fitzgerald, http://www.xml.com/pub/a/2002/11/27/xom.html
XML Made Simpler by Rogers Cadenhead, Linux Magazine, March 2003, http://www.linux-mag.com/2003-03/java_xom_01.html
nu.xom.samples
package has simple example of many XOM features.
This presentation: http://cafeconleche.org/slides/sd2005west/xom/