DOM

Elliotte Rusty Harold

Software Development 2002 East

Thursday, November 21, 2002

elharo@metalab.unc.edu

http://www.cafeconleche.org/

Where we're going

A Brief Review of XML Rules and Terminology
Reading XML through the DOM
Writing XML through the DOM

Relevant Standards

Active Standards
- XML 1.0
- Namespaces in XML
- XPath 1.0
- Document Object Model Level 2
- XML Information Set
Under development:
- Document Object Model Level 3
Deprecated and obsolete:
- Document Object Model Level 1

DOM Requirements

You need a JDK
You need some free class libraries
You need a text editor
You need some data to process

Prerequisites

Are familiar with Java including I/O, classes, objects, polymorphism, etc.
Know XML including well-formedness, validity, namespaces, and so forth
I will briefly review proper terminology

A simple example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://www.cafeconleche.org/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->

View in Browser

Markup and Character Data

Markup includes:
- Tags
- Entity References
- Comments
- Processing Instructions
- Document Type Declarations
- XML Declaration
- CDATA Section Delimiters
Character data includes everything else

Markup and Character Data Example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://www.cafeconleche.org/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->

Elements vs. Entities

Logically, an XML document is divided into elements.
Physically, an XML document is divided into entities
Entity references:
- Parsed internal general entity references like &
- Parsed external general entity references
- Unparsed external general entity references
- External parameter entity references
- Internal parameter entity references
Reading an XML document is not the same thing as reading an XML file

The file contains entity references.
The document contains the entities' replacement text.
When you use a parser to read a document you'll get the text including characters like <. You will not see the entity references.

Parsed Character Data

Character data left after entity references are replaced with their text
Given the element
<PUBLISHER>A & M Records</PUBLISHER>

The parsed character data is

A & M Records
CDATA sections are replaced with their actual text.
APIs report the parsed character data; not the unparsed data.

Questions?

DOM, The Document Object Model

The DOM (like XML) is not a triumph of elegance; it's a triumph of "if we do not hang together, we shall hang separately." At least the Browser Wars were not followed by API Wars. Better a common API that we all love to hate than a bazillion contending APIs that carve the Web up into contending enclaves of True Believers.

--Mike Champion on the xml-dev mailing list, September 27, 2001

Trees

An XML document is a tree.
It has a root.
It has nodes.
It is amenable to recursive processing.

Document Object Model

Defines how XML and HTML documents are represented as objects in programs
W3C Standard
Defined in IDL; thus language independent
HTML as well as XML
Writing as well as reading
Covers everything except internal and external DTD subsets
DOM focuses more on the document; SAX focuses more on the parser.

DOM Evolution

DOM Implementations for Java

Apache XML Project's Xerces Java: http://xml.apache.org/xerces-j/index.html
IBM's XML for Java: http://www.alphaworks.ibm.com/formula/xml
Sun's Java API for XML http://java.sun.com/products/xml
Oracle: http://technet.oracle.com/tech/xml
GNU JAXP: http://www.gnu.org/software/classpathx/jaxp/jaxp.html

Eight Modules:

Eight Modules:
- Core: org.w3c.dom *
- HTML: org.w3c.dom.html
- Views: org.w3c.dom.views
- StyleSheets: org.w3c.dom.stylesheets
- CSS: org.w3c.dom.css
- Events: org.w3c.dom.events *
- Traversal: org.w3c.dom.traversal *
- Range: org.w3c.dom.range
Only the core and traversal modules really apply to XML. The other six are for HTML.
* indicates Xerces support

Which modules and features are supported?

A DOM application can use the hasFeature() method of the DOMImplementation interface to determine whether a module is supported or not.
- XML Module: "XML"
- HTML Module: "HTML"
- Views Module: "Views"
- StyleSheets Module: "StyleSheets"
- CSS Module: "CSS"
- CSS (extended interfaces) Module: "CSS2"
- Events Module: "Events"
- User Interface Events (UIEvent interface) Module: "UIEvents"
- Mouse Events Module: "MouseEvents"
- Mutation Events Module: "MutationEvents"
- HTML Events Module: "HTMLEvents"
- Traversal Module: "Traversal"
- Range Module: "Range"

Which modules are supported?

import org.apache.xerces.dom.DOMImplementationImpl;
import org.w3c.dom.*;
import java.io.*;


public class ModuleChecker {

  public static void main(String[] args) {
     
    // parser dependent
    DOMImplementation implementation 
     = DOMImplementationImpl.getDOMImplementation();
    String[] features = {"XML", "HTML", "Views", "StyleSheets",
     "CSS", "CSS2", "Events", "UIEvents", "MouseEvents", 
     "MutationEvents", "HTMLEvents", "Traversal", "Range"};
    
    for (int i = 0; i < features.length; i++) {
      if (implementation.hasFeature(features[i], "2.0")) {
        System.out.println("Implementation supports " 
         + features[i]);
      } 
      else {
        System.out.println("Implementation does not support " 
         + features[i]);
      } 
    }
  
  }

}

Which modules are supported? Results

% java ModuleChecker
Implementation supports XML
Implementation does not support HTML
Implementation does not support Views
Implementation does not support StyleSheets
Implementation does not support CSS
Implementation does not support CSS2
Implementation supports Events
Implementation does not support UIEvents
Implementation does not support MouseEvents
Implementation supports MutationEvents
Implementation does not support HTMLEvents
Implementation supports Traversal
Implementation does not support Range

DOM Trees

Entire document is represented as a tree.
A tree contains nodes.
Some nodes may contain other nodes (depending on node type).
Each document node contains:
- zero or one doctype nodes
- one root element node
- zero or more comment and processing instruction nodes

org.w3c.dom

17 interfaces:
- Attr
- CDATASection
- CharacterData
- Comment
- Document
- DocumentFragment
- DocumentType
- DOMImplementation
- Element
- Entity
- EntityReference
- NamedNodeMap
- Node
- NodeList
- Notation
- ProcessingInstruction
- Text
plus one exception: DOMException
Plus a bunch of HTML stuff in org.w3c.dom.html and other packages we will ignore

The DOM Process

Library specific code creates a parser
The parser parses the document and returns a DOM org.w3c.dom.Document object.
The entire document is stored in memory.
DOM methods and interfaces are used to extract data from this object

Parsing documents with a DOM Parser Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;

public class DOMParserMaker {

  public static void main(String[] args) {
     
    // This is simpler but less flexible than the SAX approach.
    // Perhaps a good creational design pattern is needed here?   
  
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        // work with the document...
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
   
  }

}

The JAXP Process

javax.xml.parsers.DocumentBuilderFactory.newInstance() creates a DocumentBuilderFactory
The factory's newBuilder() method creates a DocumentBuilder
The builder parses the document and returns a DOM org.w3c.dom.Document object.
The entire document is stored in memory.
DOM methods and interfaces are used to extract data from this object

Parsing documents with a JAXP DocumentBuilder

import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;

public class JAXPParserMaker {

  public static void main(String[] args) {
     
    try {       
      DocumentBuilderFactory builderFactory 
       = DocumentBuilderFactory.newInstance();
      DocumentBuilder parser 
       = builderFactory.newDocumentBuilder();
    
      for (int i = 0; i < args.length; i++) {
        try {
          // Read the entire document into memory
          Document d = parser.parse(args[i]); 
          // work with the document...
        }
        catch (SAXException e) {
        System.err.println(e); 
        }
        catch (IOException e) {
          System.err.println(e); 
        }
      
      } // end for
      
    }
    catch (ParserConfigurationException e) {
      System.err.println("You need to install a JAXP aware parser.");
    }
   
  }

}

Questions?

The Node Interface

package org.w3c.dom;

public interface Node {

  // NodeType
  public static final short ELEMENT_NODE                = 1;
  public static final short ATTRIBUTE_NODE              = 2;
  public static final short TEXT_NODE                   = 3;
  public static final short CDATA_SECTION_NODE          = 4;
  public static final short ENTITY_REFERENCE_NODE       = 5;
  public static final short ENTITY_NODE                 = 6;
  public static final short PROCESSING_INSTRUCTION_NODE = 7;
  public static final short COMMENT_NODE                = 8;
  public static final short DOCUMENT_NODE               = 9;
  public static final short DOCUMENT_TYPE_NODE          = 10;
  public static final short DOCUMENT_FRAGMENT_NODE      = 11;
  public static final short NOTATION_NODE               = 12;

  public String       getNodeName();
  public String       getNodeValue() throws DOMException;
  public void         setNodeValue(String nodeValue) throws DOMException;
  public short        getNodeType();
  public Node         getParentNode();
  public NodeList     getChildNodes();
  public Node         getFirstChild();
  public Node         getLastChild();
  public Node         getPreviousSibling();
  public Node         getNextSibling();
  public NamedNodeMap getAttributes();
  public Document     getOwnerDocument();
  public Node         insertBefore(Node newChild, Node refChild) throws DOMException;
  public Node         replaceChild(Node newChild, Node oldChild) throws DOMException;
  public Node         removeChild(Node oldChild) throws DOMException;
  public Node         appendChild(Node newChild) throws DOMException;
  public boolean      hasChildNodes();
  public Node         cloneNode(boolean deep);
  public void         normalize();
  public boolean      supports(String feature, String version);
  public String       getNamespaceURI();
  public String       getPrefix();
  public void         setPrefix(String prefix) throws DOMException;
  public String       getLocalName();
  
}

The NodeList Interface

package org.w3c.dom;

public interface NodeList {
  public Node item(int index);
  public int  getLength();
}

Now we're really ready to read a document

Node Reporter

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;


public class NodeReporter {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    NodeReporter iterator = new NodeReporter();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document doc = parser.getDocument();
        iterator.followNode(doc);
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  // note use of recursion
  public void followNode(Node node) {
    
    processNode(node);
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        followNode(children.item(i));
      } 
    }
    
  }

  public void processNode(Node node) {
    
    String name = node.getNodeName();
    String type = getTypeName(node.getNodeType());
    System.out.println("Type " + type + ": " + name);
    
  }
  
  public static String getTypeName(int type) {
    
    switch (type) {
      case Node.ELEMENT_NODE: 
        return "Element";
      case Node.ATTRIBUTE_NODE: 
        return "Attribute";
      case Node.TEXT_NODE: 
        return "Text";
      case Node.CDATA_SECTION_NODE: 
        return "CDATA Section";
      case Node.ENTITY_REFERENCE_NODE: 
        return "Entity Reference";
      case Node.ENTITY_NODE: 
        return "Entity";
      case Node.PROCESSING_INSTRUCTION_NODE: 
        return "Processing Instruction";
      case Node.COMMENT_NODE : 
        return "Comment";
      case Node.DOCUMENT_NODE: 
        return "Document";
      case Node.DOCUMENT_TYPE_NODE: 
        return "Document Type Declaration";
      case Node.DOCUMENT_FRAGMENT_NODE: 
        return "Document Fragment";
      case Node.NOTATION_NODE: 
        return "Notation";
      default: 
        return "Unknown Type"; 
    }
    
  }

}

Node Reporter Output

% java NodeReporter hotcop.xml
Type Document: #document
Type Processing Instruction: xml-stylesheet
Type Document Type Declaration: SONG
Type Element: SONG
Type Text: #text
Type Element: TITLE
Type Text: #text
Type Text: #text
Type Element: PHOTO
Type Text: #text
Type Element: COMPOSER
Type Text: #text
Type Text: #text
Type Element: COMPOSER
Type Text: #text
Type Text: #text
Type Element: COMPOSER
Type Text: #text
Type Text: #text
Type Element: PRODUCER
Type Text: #text
Type Text: #text
Type Comment: #comment
Type Text: #text
Type Element: PUBLISHER
Type Text: #text
Type Text: #text
Type Element: LENGTH
Type Text: #text
Type Text: #text
Type Element: YEAR
Type Text: #text
Type Text: #text
Type Element: ARTIST
Type Text: #text
Type Text: #text
Type Comment: #comment

Attributes are missing from this output. They are not nodes. They are properties of nodes.

Node Values as returned by getNodeValue()

Node Type	Node Value
element node	null
attribute node	attribute value
text node	text of the node
CDATA section node	text of the section
entity reference node	null
entity node	null
processing instruction node	content of the processing instruction, not including the target
comment node	text of the comment
document node	null
document type declaration node	null
document fragment node	null
notation node	null

The Document Node

The root node representing the entire document; not the same as the root element
Contains:
- one element node
- zero or more processing instruction nodes
- zero or more comment nodes
- zero or one document type nodes

The Document Interface

package org.w3c.dom;

  public interface Document extends Node {
  
    public DocumentType      getDoctype();
    public DOMImplementation getImplementation();
    public Element           getDocumentElement();
    public Element           createElement(String tagName) throws DOMException;
    public Element           createElementNS(String namespaceURI, String qualifiedName) throws DOMException;
    public DocumentFragment  createDocumentFragment();
    public Text              createTextNode(String data);
    public Comment           createComment(String data);
    public CDATASection      createCDATASection(String data) throws DOMException;
    public ProcessingInstruction createProcessingInstruction(String target, String data)
     throws DOMException;
    public Attr            createAttribute(String name) throws DOMException;
    public Attr            createAttributeNS(String namespaceURI, String qualifiedName) throws DOMException;
    public EntityReference createEntityReference(String name) throws DOMException;
    public NodeList        getElementsByTagName(String tagname);
    public NodeList        getElementsByTagNameNS(String namespaceURI, String localName);
    public Element         getElementById(String elementId);
    public Node            importNode(Node importedNode, boolean deep) throws DOMException;
    
}

A Sample Application

UserLand's RSS based list of Web logs at http://static.userland.com/weblogMonitor/logs.xml:

<?xml version="1.0"?>
<!-- <!DOCTYPE foo SYSTEM "http://msdn.microsoft.com/xml/general/htmlentities.dtd"> -->
<weblogs>
	<log>
		<name>MozillaZine</name>
		<url>http://www.mozillazine.org</url>
		<changesUrl>http://www.mozillazine.org/contents.rdf</changesUrl>
		<ownerName>Jason Kersey</ownerName>
		<ownerEmail>kerz@en.com</ownerEmail>
		<description>THE source for news on the Mozilla Organization.  DevChats, Reviews, Chats, Builds, Demos, Screenshots, and more.</description>
		<imageUrl></imageUrl>
		<adImageUrl>http://static.userland.com/weblogMonitor/ads/kerz@en.com.gif</adImageUrl>
		</log>
	<log>
		<name>SalonHerringWiredFool</name>
		<url>http://www.salonherringwiredfool.com/</url>
		<ownerName>Some Random Herring</ownerName>
		<ownerEmail>salonfool@wiredherring.com</ownerEmail>
		<description></description>
		</log>
	<log>
		<name>Scripting News</name>
		<url>http://www.scripting.com/</url>
		<ownerName>Dave Winer</ownerName>
		<ownerEmail>dave@userland.com</ownerEmail>
		<description>News and commentary from the cross-platform scripting community.</description>
		<imageUrl>http://www.scripting.com/gifs/tinyScriptingNews.gif</imageUrl>
		<adImageUrl>http://static.userland.com/weblogMonitor/ads/dave@userland.com.gif</adImageUrl>
		</log>
	<log>
		<name>SlashDot.Org</name>
		<url>http://www.slashdot.org/</url>
		<ownerName>Simply a friend</ownerName>
		<ownerEmail>afriendofweblogs@weblogs.com</ownerEmail>
		<description>News for Nerds, Stuff that Matters.</description>
		</log>
	</weblogs>

Full list

DOM Design

We can easily find out how many URLs there will be when we start parsing, since they're all in memory.
Single threaded by nature; no benefit to multiple threads since no data will be available until the entire document has been read and parsed.
The character data of each url element needs to be read. Everything else can be ignored.
The getElementsByTagName() method in Document gives us a quick list of all the url elements.
The XML parsing is so straight-forward it can be done inside one method. No extra class is required.

Weblogs with DOM

import org.w3c.dom.*;
import org.xml.sax.SAXException;
import java.io.IOException;
import java.util.*;
import java.net.*;


public class WeblogsDOM {

  public static String DEFAULT_URL
   = "http://static.userland.com/weblogMonitor/logs.xml";

  public static List listChannels() throws DOMException {
    return listChannels(DEFAULT_URL);
  }

  public static List listChannels(String uri) throws DOMException {

    if (uri == null) {
      throw new NullPointerException("URL must be non-null");
    }

    org.apache.xerces.parsers.DOMParser parser
     = new org.apache.xerces.parsers.DOMParser();

    Vector urls = null;

    try {
      // Read the entire document into memory
      parser.parse(uri);
      Document doc = parser.getDocument();
      NodeList logs = doc.getElementsByTagName("url");

      urls = new Vector(logs.getLength());

      for (int i = 0; i < logs.getLength(); i++) {
        try {
          Node element = logs.item(i);
          Node text = element.getFirstChild();
          String content = text.getNodeValue();
          URL u = new URL(content);
          urls.addElement(u);
        }
        catch (MalformedURLException e) {
          // bad input data from one third party; just ignore it
        }
      }
    }
    catch (SAXException e) {
      System.err.println(e);
    }
    catch (IOException e) {
      System.err.println(e);
    }

    return urls;

  }

  public static void main(String[] args) {

    try {
      List urls;
      if (args.length > 0) {
        try {
          URL url = new URL(args[0]);
          urls = listChannels(args[0]);
        }
        catch (MalformedURLException e) {
          System.err.println("Usage: java WeblogsDOM url");
          return;
        }
      }
      else {
        urls = listChannels();
      }
      Iterator iterator = urls.iterator();
      while (iterator.hasNext()) {
        System.out.println(iterator.next());
      }
    }
    catch (/* Unexpected */ Exception e) {
      e.printStackTrace();
    }

  } // end main

}

Weblogs Output

% java WeblogsDOM
http://2020Hindsight.editthispage.com/
http://www.sff.net/people/mitchw/weblog/weblog.htp
http://nate.weblogs.com/
http://plugins.launchpoint.net
http://404.psistorm.net
http://home.att.net/~geek9000
http://daubnet.tzo.com/weblog
several hundred more...

Questions?

Element Nodes

Represents a complete element including its start-tag, end-tag, and content
Contains:
- Element nodes
- ProcessingInstruction nodes
- Comment nodes
- Text nodes
- CDATASection nodes
- EntityReference nodes

The Element Interface

package org.w3c.dom;

public interface Element extends Node {

  public String   getTagName();

  public NodeList getElementsByTagName(String name);
  public NodeList getElementsByTagNameNS(String namespaceURI, 
   String localName);

  public String   getAttribute(String name);
  public String   getAttributeNS(String namespaceURI, 
   String localName);
  public void     setAttribute(String name, String value) 
   throws DOMException;
  public void     setAttributeNS(String namespaceURI, 
   String qualifiedName, String value) throws DOMException;
  public void     removeAttribute(String name) throws DOMException;
  public void     removeAttributeNS(String namespaceURI, 
   String localName) throws DOMException;
  public Attr     getAttributeNode(String name);
  public Attr     getAttributeNodeNS(String namespaceURI, String localName);
  public Attr     setAttributeNode(Attr newAttr) throws DOMException;
  public Attr     setAttributeNodeNS(Attr newAttr) throws DOMException;
  public Attr     removeAttributeNode(Attr oldAttr) throws DOMException;

}

IDTagger

import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.IOException;
import org.apache.xml.serialize.*;


public class IDTagger {

  int id = 1;

  public void processNode(Node node) {
    
    if (node.getNodeType() == Node.ELEMENT_NODE) {
      
      Element element = (Element) node;
      String currentID = element.getAttribute("ID");
      if (currentID == null || currentID.equals("")) {
        element.setAttribute("ID", "_" + id);
        id = id + 1; 
      }
    }
    
  }

  // note use of recursion
  public void followNode(Node node) {
    
    processNode(node);
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        followNode(children.item(i));
      } 
    }
    
  }

  public static void main(String[] args) {
     
    DOMParser parser  = new DOMParser();
    IDTagger iterator = new IDTagger();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document document = parser.getDocument();
        iterator.followNode(document);
        
        // now we serialize the document...
        OutputFormat format = new OutputFormat(document);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(document);       
        
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

Output from IDTagger

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<?xml-stylesheet type="text/css" href="song.css"?><!-- This should be a four digit year like "1999",
     not a two-digit year like "99" --><SONG xmlns="http://www.cafeconleche.org/namespace/song" ID="_1" xmlns:xlink="http://www.w3.org/1999/xlink">   <TITLE ID="_2">Hot Cop</TITLE>   <PHOTO ALT="Victor Willis in Cop Outfit" HEIGHT="200" ID="_3" WIDTH="100" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="hotcop.jpg" xlink:show="onLoad" xlink:type="simple"/>   <COMPOSER ID="_4">Jacques Morali</COMPOSER>   <COMPOSER ID="_5">Henri Belolo</COMPOSER>   <COMPOSER ID="_6">Victor Willis</COMPOSER>   <PRODUCER ID="_7">Jacques Morali</PRODUCER>   <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->   <PUBLISHER ID="_8" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.amrecords.com/" xlink:type="simple">     A &amp; M Records   </PUBLISHER>   <LENGTH ID="_9">6:20</LENGTH>   <YEAR ID="_10">1978</YEAR>   <ARTIST ID="_11">Village People</ARTIST> </SONG><!-- You can tell what album I was 
     listening to when I wrote this example -->

View Output in Browser

CharacterData interface

Represents things that are basically text holders
Super interface of Text, Comment, and CDATASection

The CharacterData Interface

package org.w3c.dom;

public interface CharacterData extends Node {

  public String getData() throws DOMException;
  public void   setData(String data) throws DOMException;
  public int    getLength();
  public String substringData(int offset, int count) 
   throws DOMException;
  public void   appendData(String arg) 
   throws DOMException;
  public void   insertData(int offset, String arg) 
   throws DOMException;
  public void   deleteData(int offset, int count) 
   throws DOMException;
  public void   replaceData(int offset, int count, String arg) 
   throws DOMException;
  
}

ROT13 XML Text

import org.apache.xerces.parsers.DOMParser;
import org.apache.xml.serialize.*;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import java.io.IOException;


public class ROT13XML {

  public void processNode(Node node) {
    
    if (node.getNodeType() == Node.TEXT_NODE
     || node.getNodeType() == Node.COMMENT_NODE
     || node.getNodeType() == Node.CDATA_SECTION_NODE) {
      CharacterData text = (CharacterData) node;
      String data = text.getData();
      text.setData(rot13(data));
    }
    
  }

  // note use of recursion
  public void followNode(Node node) {
    
    processNode(node);
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        followNode(children.item(i));
      } 
    }
    
  }
  
  public static String rot13(String s) {
    
    StringBuffer result = new StringBuffer(s.length());
    for (int i = 0; i < s.length(); i++) {
      int c = s.charAt(i);
      if (c >= 'A' && c <= 'M') result.append((char) (c+13));
      else if (c >= 'N' && c <= 'Z') result.append((char) (c-13));
      else if (c >= 'a' && c <= 'm') result.append((char) (c+13));
      else if (c >= 'n' && c <= 'z') result.append((char) (c-13));
      else result.append((char) c);
      
    } 
    return result.toString();
    
  }

  public static void main(String[] args) {
     
    DOMParser parser   = new DOMParser();
    ROT13XML  iterator = new ROT13XML();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document document = parser.getDocument();
        iterator.followNode(document);
        
        // now we serialize the document...
        OutputFormat format = new OutputFormat(document);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(document);
               
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

ROT13 XML Output

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<?xml-stylesheet type="text/css" href="song.css"?>
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song"
xmlns:xlink="http://www.w3.org/1999/xlink">   <TITLE>Ubg Pbc</TITLE>
<PHOTO ALT="Victor Willis in Cop Outfit" HEIGHT="200" WIDTH="100"
xlink:href="hotcop.jpg" xlink:show="onLoad" xlink:type="simple"/>
<COMPOSER>Wnpdhrf Zbenyv</COMPOSER>   <COMPOSER>Uraev Orybyb</COMPOSER>
<COMPOSER>Ivpgbe Jvyyvf</COMPOSER>   <PRODUCER>Wnpdhrf Zbenyv</PRODUCER>
<!-- Gur choyvfure vf npghnyyl Cbyltenz ohg V arrqrq         na rknzcyr
bs n trareny ragvgl ersrerapr. -->   <PUBLISHER
xlink:href="http://www.amrecords.com/" xlink:type="simple">     N &amp;
Z Erpbeqf   </PUBLISHER>   <LENGTH>6:20</LENGTH>   <YEAR>1978</YEAR>
<ARTIST>Ivyyntr Crbcyr</ARTIST> </SONG>
<!-- Lbh pna gryy jung nyohz V jnf 
     yvfgravat gb jura V jebgr guvf rknzcyr -->

Text Nodes

Represents the text content of an element or attribute
Contains only pure text, no markup
Parsers will return a single maximal text node for each contiguous run of pure text
Editing may change this

The Text Interface

package org.w3c.dom;

public interface Text extends CharacterData {

  public Text splitText(int offset) throws DOMException;
  
}

CDATA section Nodes

Represents a CDATA section like this example from a hypothetical SVG tutorial:

<p>You can use a default <code>xmlns</code> attribute to avoid 
having to add the svg prefix to all your elements:</p>
     <![CDATA[
       <svg xmlns="http://www.w3.org/2000/svg" 
            width="12cm" height="10cm">
         <ellipse rx="110" ry="130" />
         <rect x="4cm" y="1cm" width="3cm" height="6cm" />
       </svg>
     ]]>

No children

The CDATASection Interface

package org.w3c.dom;

public interface CDATASection extends Text {
}

DocumentType Nodes

Represents a document type declaration
Has no children

The DocumentType Interface

package org.w3c.dom;

public interface DocumentType extends Node {

  public String       getName();
  public NamedNodeMap getEntities();
  public NamedNodeMap getNotations();
  public String       getPublicId();
  public String       getSystemId();
  public String       getInternalSubset();
  
}

Example of the DocumentType Interface

Verify that a document is correct XHTML
From the XHTML 1.0 spec:
1. It must validate against one of the three DTDs found in Appendix A.
2. The root element of the document must be <html>.
3. The root element of the document must designate the XHTML namespace using the xmlns attribute [XMLNAMES]. The namespace for XHTML is defined to be http://www.w3.org/1999/xhtml.
4. There must be a DOCTYPE declaration in the document prior to the root element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in Appendix A using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions.
```
<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "DTD/xhtml1-strict.dtd">

<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "DTD/xhtml1-transitional.dtd">

<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
     "DTD/xhtml1-frameset.dtd">
```

XHTMLValidator

import org.w3c.dom.*;
import org.apache.xerces.parsers.*; 
import java.io.*;
import org.xml.sax.*;


public class XHTMLValidator {

  public static void main(String[] args) {
    
    for (int i = 0; i < args.length; i++) {
      validate(args[i]);
    }   
    
  }

  private static DOMParser parser = new DOMParser();
  
  static {
    
    // turn on validation
    try {
      parser.setFeature(
       "http://xml.org/sax/features/validation", true);
      parser.setErrorHandler(new ValidityErrorReporter());
    }
    catch (SAXNotRecognizedException e) {
      System.err.println(
         "Installed XML parser cannot validate; "
       + "checking for well-formedness instead...");
    } 
    catch (SAXNotSupportedException e) {
      System.err.println(
       "Cannot turn on validation here; "
       + " checking for well-formedness instead...");
    }     
    
  }
  
  // not thread safe
  public static void validate(String source) {
        
    try {

      try {
        parser.parse(source); 
        // ValidityErrorReporter prints any validity errors detected
      }
      catch (SAXException e) {  
        System.out.println(source + " is not well formed."); 
        return; 
      }
      
      // If we get this far, then the document is well-formed XML.
      // Check to see whether the document is actually XHTML    
      Document document = parser.getDocument();
    
      DocumentType doctype = document.getDoctype();
    
      if (doctype == null) {
        System.out.println("No DOCTYPE"); 
        return;
      }

      String name     = doctype.getName();
      String systemID = doctype.getSystemId();
      String publicID = doctype.getPublicId();
      
      if (!name.equals("html")) {
        System.out.println("Incorrect root element name " + name); 
      }
    
      if (publicID == null
       || (!publicID.equals("-//W3C//DTD XHTML 1.0 Strict//EN")
           && !publicID.equals(
                "-//W3C//DTD XHTML 1.0 Transitional//EN")
           && !publicID.equals(
                "-//W3C//DTD XHTML 1.0 Frameset//EN"))) {
        System.out.println(source 
         + " does not seem to use an XHTML 1.0 DTD");
      }
    
      // Check the namespace on the root element
      Element root = document.getDocumentElement();
      String xmlnsValue = root.getAttribute("xmlns");
      if (!xmlnsValue.equals("http://www.w3.org/1999/xhtml")) {
        System.out.println(source 
         + " does not properly declare the"
         + " http://www.w3.org/1999/xhtml"
         + " namespace on the root element");        
      }
    
      // get ready for the next parse
      parser.reset();
      
    }
    catch (IOException e) {
      System.err.println("Could not read " + source);
    }
    catch (Exception e) {
      System.err.println(e);
      e.printStackTrace();
    }
    
  }

}

EntityReference Nodes

Represents an entity reference like & or &signature;
Optional: some parsers (including Xerces) just expand entities
Contains:
- Element nodes
- ProcessingInstruction nodes
- Comment nodes
- Text nodes
- CDATASection nodes
- EntityReference nodes

The EntityReference Interface

package org.w3c.dom;

public interface EntityReference extends Node {

}

Attr Nodes

Represents an attribute
Contains:
- Text nodes
- Entity reference nodes

The Attr Interface

package org.w3c.dom;

public interface Attr extends Node {

  public String   getName();
  public boolean  getSpecified();
  public String   getValue();
  public void     setValue(String value) throws DOMException;
  public Element  getOwnerElement();
  
}

XLinkSpider with DOM

import org.xml.sax.*;
import org.apache.xerces.parsers.*;
import java.io.*;
import java.util.*;
import org.w3c.dom.*;


public class DOMSpider {

  private static DOMParser parser = new DOMParser();
  
  // namespace support is turned off by default in Xerces
  static {
    try {
      parser.setFeature(
       "http://xml.org/sax/features/namespaces", true); 
    }
    catch (Exception e) {
      System.err.println(e);
    }
  }
  
  private static Vector visited = new Vector();
  
  private static int maxDepth = 5;
  private static int currentDepth = 0; 
  
  public static void listURIs(String systemId) {
    
    currentDepth++;
    try {
      if (currentDepth < maxDepth) {
        parser.parse(systemId);
     
        Document document = parser.getDocument();   
    
        Vector uris = new Vector();
        // search the document for uris, 
        // store them in vector, and print them
        searchForURIs(document.getDocumentElement(), uris);
    
    
        Enumeration e = uris.elements();
        while (e.hasMoreElements()) {
          String uri = (String) e.nextElement();
          visited.addElement(uri);
          listURIs(uri); 
        }
      
      }
    
    }
    catch (SAXException e) {
      // couldn't load the document, 
      // probably not well-formed XML, skip it 
    }
    catch (IOException e) {
      // couldn't load the document, 
      // likely network failure, skip it 
    }
    finally { 
      currentDepth--;
      System.out.flush();     
    }
      
  }
  
  // use recursion 
  public static void searchForURIs(Element element, Vector uris) {
    
    // look for XLinks in this element
    String uri = element.getAttributeNS("http://www.w3.org/1999/xlink", "href");

    if (uri != null && !uri.equals("") 
         && !visited.contains(uri) 
         && !uris.contains(uri)) {
      System.out.println(uri);
      uris.addElement(uri);
    }
    
    // process child elements recursively
    NodeList children = element.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node n = children.item(i);
      if (n instanceof Element) {
        searchForURIs((Element) n, uris);
      } 
    }
    
  }
  

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java DOMSpider URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      try {
        listURIs(args[i]);
      }
      catch (Exception e) {
        System.err.println(e);
        e.printStackTrace(); 
      }
      
    } // end for
  
  } // end main

} // end DOMSpider

ProcessingInstruction Nodes

Represents a processing instruction like
<?robots index="yes" follow="no"?>
No children

The ProcessingInstruction Interface

package org.w3c.dom;

public interface ProcessingInstruction extends Node {

  public String getTarget();
  public String getData();
  public void   setData(String data) throws DOMException;
  
}

XLinkSpider that Respects robots processing instruction

import org.xml.sax.*;
import org.apache.xerces.parsers.*;
import java.io.*;
import java.util.*;
import org.w3c.dom.*;


public class PoliteDOMSpider {

  private static DOMParser parser = new DOMParser();
  
  // namespace support is turned off by default in Xerces
  static {
    try {
      parser.setFeature("http://xml.org/sax/features/namespaces", 
       true); 
    }
    catch (Exception e) {
      System.err.println(e);
    }
  }
  
  private static Vector visited = new Vector();
  
  private static int maxDepth = 5;
  private static int currentDepth = 0; 
  
  public static void listURIs(String systemId) {
    
    currentDepth++;
    try {
      if (currentDepth < maxDepth) {
        parser.parse(systemId);
     
        Document document = parser.getDocument();   
    
        if (robotsAllowed(document)) {
          Vector uris = new Vector();
          // search the document for uris,
          // store them in vector, print them
          searchForURIs(document.getDocumentElement(), uris);
    
          Enumeration e = uris.elements();
          while (e.hasMoreElements()) {
            String uri = (String) e.nextElement();
            visited.addElement(uri);
            listURIs(uri); 
          }
          
        }
      
      }
    
    }
    catch (SAXException e) {
      // couldn't load the document, 
      // probably not well-formed XML, skip it 
    }
    catch (IOException e) {
      // couldn't load the document, 
      // likely network failure, skip it 
    }
    finally { 
      currentDepth--;
      System.out.flush();     
    }
      
  }
  
  public static boolean robotsAllowed(Document document) {
    
    NodeList children = document.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node n = children.item(i);
      if (n instanceof ProcessingInstruction) {
        ProcessingInstruction pi = (ProcessingInstruction) n;
        if (pi.getTarget().equals("robots")) {
          String data = pi.getData();
          if (data.indexOf("follow=\"no\"") >= 0) {
            return false; 
          } 
        }
      }
    }
    
    return true;
    
  }
  
  
  // use recursion 
  public static void searchForURIs(Element element, Vector uris) {
    
    // look for XLinks in this element
    String uri = element.getAttributeNS("http://www.w3.org/1999/xlink", "href");

    if (uri != null && !uri.equals("") 
         && !visited.contains(uri) 
         && !uris.contains(uri)) {
      System.out.println(uri);
      uris.addElement(uri);
    }
    
    // process child elements recursively
    NodeList children = element.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node n = children.item(i);
      if (n instanceof Element) {
        searchForURIs((Element) n, uris);
      } 
    }
    
  }
  

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java PoliteDOMSpider URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      try {
        listURIs(args[i]);
      }
      catch (Exception e) {
        System.err.println(e);
        e.printStackTrace(); 
      }
      
    } // end for
  
  } // end main

} // end PoliteDOMSpider

Comment Nodes

Represents a comment like this example from the XML 1.0 spec:

<!--* N.B. some readers (notably JC) find the following
paragraph awkward and redundant.  I agree it's logically redundant:
it *says* it is summarizing the logical implications of
matching the grammar, and that means by definition it's
logically redundant.  I don't think it's rhetorically
redundant or unnecessary, though, so I'm keeping it.  It
could however use some recasting when the editors are feeling
stronger. -MSM *-->

No children

The Comment Interface

package org.w3c.dom;

public interface Comment extends CharacterData {
}

Comment Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;


public class DOMCommentReader {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        processNode(d);
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  // note use of recursion
  public static void processNode(Node node) {
    
    int type = node.getNodeType();
    if (type == Node.COMMENT_NODE) {
      System.out.println(node.getNodeValue());
      System.out.println();
    }
    else {
      if (node.hasChildNodes()) {
        NodeList children = node.getChildNodes();
        for (int i = 0; i < children.getLength(); i++) {
          processNode(children.item(i));
        } 
      }
    }
    
  }

}

DOMCommentReader Output

% java DOMCommentReader hotcop.xml
 The publisher is actually Polygram but I needed
       an example of a general entity reference.

 You can tell what album I was
     listening to when I wrote this example

Or try http://www.w3.org/TR/1998/REC-xml-19980210.xml for more interesting output

Entity Nodes

Represents an actual entity, not an entity reference!
Contains:
- Element nodes
- ProcessingInstruction nodes
- Comment nodes
- Text nodes
- CDATASection nodes
- EntityReference nodes

The Entity Interface

package org.w3c.dom;

public interface Entity extends Node {

  public String  getPublicId();
  public String  getSystemId();
  public String  getNotationName();
  
}

DOMException

A runtime exception but you should catch it
Error code gives more detailed information:

DOMException.INDEX_SIZE_ERR
Index or size is negative, or greater than the allowed value

DOMException.DOMSTRING_SIZE_ERR
The specified range of text does not fit into a String

DOMException.HIERARCHY_REQUEST_ERR
Attempt to insert a node somewhere it doesn't belong

DOMException.WRONG_DOCUMENT_ERR
If a node is used in a different document than the one that created it (that doesn't support it)

DOMException.INVALID_CHARACTER_ERR
An invalid or illegal character is specified, such as in a name.

DOMException.NO_DATA_ALLOWED_ERR
Attempt to add data to a node which does not support data

DOMException.NO_MODIFICATION_ALLOWED_ERR
Attempt to modify a read-only object

DOMException.NOT_FOUND_ERR
Attempt to reference a node in a context where it does not exist

DOMException.NOT_SUPPORTED_ERR
The implementation does not support the type of object requested

DOMException.INUSE_ATTRIBUTE_ERR
Attempt to add an attribute to an element that already has that attribute

DOMException.INVALID_STATE_ERR
An attempt is made to use an object that is not, or no longer, usable.

DOMException.SYNTAX_ERR
An invalid or illegal string is specified.

DOMException.INVALID_MODIFICATION_ERR
An attempt to modify the type of the underlying object.

DOMException.NAMESPACE_ERR
An attempt is made to create or change an object in a way which is incorrect with regard to namespaces.

DOMException.INVALID_ACCESS_ERR
A parameter or an operation is not supported by the underlying object.
Current value accessible from the public code field

Questions?

The org.w3c.dom.traversal Package

Four interfaces:

DocumentTraversal
NodeFilter
NodeIterator
TreeWalker

NodeIterator

package org.w3c.dom.traversal;

public interface NodeIterator {

  public int        getWhatToShow();
  public NodeFilter getFilter();
  public boolean    getExpandEntityReferences();
  public Node       nextNode() throws DOMException;
  public Node       previousNode() throws DOMException;
  public void       detach();
    
}

ValueReporter

import org.apache.xerces.parsers.*;
import org.apache.xerces.dom.*;
import org.w3c.dom.*;
import org.w3c.dom.traversal.*;
import org.xml.sax.*;
import java.io.*;


public class ValueReporter {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document doc = parser.getDocument();
        DocumentImpl impl = (DocumentImpl) doc;
        NodeIterator iterator = impl.createNodeIterator(
         doc.getDocumentElement(), NodeFilter.SHOW_ALL, null, true
        );
        Node node;
        while ((node = iterator.nextNode()) != null) {
          processNode(node);      
        }
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  public static void processNode(Node node) {
    
    String name = node.getNodeName();
    String type = getTypeName(node.getNodeType());
    String value = node.getNodeValue();
    System.out.println("Type " + type + ": " + name 
     + " \"" + value + "\"");
    
  }
  
  public static String getTypeName(int type) {
    
    switch (type) {
      case Node.ELEMENT_NODE: 
        return "Element";
      case Node.ATTRIBUTE_NODE: 
        return "Attribute";
      case Node.TEXT_NODE: 
        return "Text";
      case Node.CDATA_SECTION_NODE: 
        return "CDATA Section";
      case Node.ENTITY_REFERENCE_NODE: 
        return "Entity Reference";
      case Node.ENTITY_NODE: 
        return "Entity";
      case Node.PROCESSING_INSTRUCTION_NODE: 
        return "Processing Instruction";
      case Node.COMMENT_NODE: 
        return "Comment";
      case Node.DOCUMENT_NODE: 
        return "Document";
      case Node.DOCUMENT_TYPE_NODE: 
        return "Document Type Declaration";
      case Node.DOCUMENT_FRAGMENT_NODE: 
        return "Document Fragment";
      case Node.NOTATION_NODE: 
        return "Notation";
      default: 
        return "Unknown Type"; 
    }
    
  }

}

ValueReporter Output

% java ValueReporter hotcop.xml
Type Element: SONG "null"
Type Text: #text "
  "
Type Element: TITLE "null"
Type Text: #text "Hot Cop"
Type Text: #text "
  "
Type Element: PHOTO "null"
Type Text: #text "
  "
Type Element: COMPOSER "null"
Type Text: #text "Jacques Morali"
Type Text: #text "
  "
Type Element: COMPOSER "null"
Type Text: #text "Henri Belolo"
Type Text: #text "
  "
Type Element: COMPOSER "null"
Type Text: #text "Victor Willis"
Type Text: #text "
  "
Type Element: PRODUCER "null"
Type Text: #text "Jacques Morali"
Type Text: #text "
  "
Type Comment: #comment " The publisher is actually Polygram but I needed
       an example of a general entity reference. "
Type Text: #text "
  "
Type Element: PUBLISHER "null"
Type Text: #text "
    A & M Records
  "
Type Text: #text "
  "
Type Element: LENGTH "null"
Type Text: #text "6:20"
Type Text: #text "
  "
Type Element: YEAR "null"
Type Text: #text "1978"
Type Text: #text "
  "
Type Element: ARTIST "null"
Type Text: #text "Village People"
Type Text: #text "
"

Attributes are missing from this output. They are not children. They are properties of nodes.

NodeFilter

package org.w3c.dom.traversal;

public interface NodeFilter {

  // Constants returned by acceptNode
  public static final short FILTER_ACCEPT             = 1;
  public static final short FILTER_REJECT             = 2;
  public static final short FILTER_SKIP               = 3;

  // Constants for whatToShow
  public static final int   SHOW_ALL                  = 0x0000FFFF;
  public static final int   SHOW_ELEMENT              = 0x00000001;
  public static final int   SHOW_ATTRIBUTE            = 0x00000002;
  public static final int   SHOW_TEXT                 = 0x00000004;
  public static final int   SHOW_CDATA_SECTION        = 0x00000008;
  public static final int   SHOW_ENTITY_REFERENCE     = 0x00000010;
  public static final int   SHOW_ENTITY               = 0x00000020;
  public static final int   SHOW_PROCESSING_INSTRUCTION = 0x00000040;
  public static final int   SHOW_COMMENT              = 0x00000080;
  public static final int   SHOW_DOCUMENT             = 0x00000100;
  public static final int   SHOW_DOCUMENT_TYPE        = 0x00000200;
  public static final int   SHOW_DOCUMENT_FRAGMENT    = 0x00000400;
  public static final int   SHOW_NOTATION             = 0x00000800;

  public short        acceptNode(Node n);
    
}

DOM based TagStripper

import org.apache.xerces.parsers.*;
import org.apache.xerces.dom.*;
import org.w3c.dom.*;
import org.w3c.dom.traversal.*;
import org.xml.sax.SAXException;
import java.io.IOException;


public class DOMTagStripper {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document doc = parser.getDocument();
        DocumentImpl impl = (DocumentImpl) doc;
        NodeIterator iterator = impl.createNodeIterator(
         doc.getDocumentElement(), NodeFilter.SHOW_TEXT, null, true
        );
        Node node;
        while ((node = iterator.nextNode()) != null) {
          System.out.print(node.getNodeValue());      
        }
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

Output from a DOM based TagStripper

% java DOMTagStripper hotcop.xml

  Hot Cop
  Jacques Morali
  Henri Belolo
  Victor Willis
  Jacques Morali

  A & M Records
  6:20
  1978
  Village People

Questions?

Writing XML Documents with DOM

DOM is for both input and output
New documents are created with a parser-specific API or JAXP
A serializer + output format converts the DOM to a byte stream

org.apache.xerces.dom.DOMImplementationImpl

A Xerces-specific class used to create new DOM documents

package org.apache.xerces.dom;

public class DOMImplementationImpl implements DOMImplementation {

  public boolean hasFeature(String feature, String version) 
  
  public static DOMImplementation getDOMImplementation()
  
  public DocumentType createDocumentType(String qualifiedName, 
   String publicID, String systemID, String internalSubset)
                                          
  public Document createDocument(String namespaceURI, 
   String qualifiedName, DocumentType doctype)
   throws DOMException

}

A Xerces/DOM program that writes Fibonacci numbers into an XML document

import java.math.BigInteger;
import java.io.*;
import org.w3c.dom.*;
import org.apache.xerces.dom.*;


public class FibonacciDOM {

  public static void main(String[] args) {

    try {

      DOMImplementation impl 
       = DOMImplementationImpl.getDOMImplementation();

      Document fibonacci 
       = impl.createDocument(null, "Fibonacci_Numbers", null);

      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;

      Element root = fibonacci.getDocumentElement();

      for (int i = 1; i <= 25; i++) {
        Element number = fibonacci.createElement("fibonacci");
        number.setAttribute("index", Integer.toString(i));
        Text text = fibonacci.createTextNode(low.toString());
        number.appendChild(text);
        root.appendChild(number);

        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }

      // Now the document has been created and exists in memory
    }
    catch (DOMException e) {
      e.printStackTrace();
    }

  }

}

A JAXP/DOM program that writes Fibonacci numbers into an XML document

import java.math.BigInteger;
import java.io.*;
import org.w3c.dom.*;
import javax.xml.parsers.*;


public class FibonacciJAXP {

  public static void main(String[] args) {

    try {       
      DocumentBuilderFactory factory 
       = DocumentBuilderFactory.newInstance();
      DocumentBuilder builder = factory.newDocumentBuilder();
      DOMImplementation impl = builder.getDOMImplementation();

      Document fibonacci 
       = impl.createDocument(null, "Fibonacci_Numbers", null);

      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;

      Element root = fibonacci.getDocumentElement();

      for (int i = 1; i <= 25; i++) {
        Element number = fibonacci.createElement("fibonacci");
        number.setAttribute("index", Integer.toString(i));
        Text text = fibonacci.createTextNode(low.toString());
        number.appendChild(text);
        root.appendChild(number);

        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }

      // Now the document has been created and exists in memory
    }
    catch (DOMException e) {
      e.printStackTrace();
    }
    catch (ParserConfigurationException e) {
      System.err.println("You need to install a JAXP aware DOM implementation.");
    }
    
  }

}

Serialization

The process of taking an in-memory DOM tree and converting it to a stream of characters that can be written onto an output stream
Not a standard part of DOM Level 2
The org.apache.xml.serialize package:

A DOM program that writes Fibonacci numbers onto System.out

import java.math.BigInteger;
import java.io.*;
import org.w3c.dom.*;
import org.apache.xerces.dom.*;
import org.apache.xml.serialize.*; 


public class FibonacciDOMSerializer {

  public static void main(String[] args) {
   
    try {
      
      DOMImplementation impl 
       = DOMImplementationImpl.getDOMImplementation();

      Document fibonacci 
       = impl.createDocument(null, "Fibonacci_Numbers", null);
      
      BigInteger low  = BigInteger.ONE;
      BigInteger high = BigInteger.ONE;      
      
      Element root = fibonacci.getDocumentElement(); 

      for (int i = 1; i <= 25; i++) {
        Element number = fibonacci.createElement("fibonacci");
        number.setAttribute("index", Integer.toString(i));
        Text text = fibonacci.createTextNode(low.toString());
        number.appendChild(text);
        root.appendChild(number);
        
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      
      try {
        // Now that the document is created we need to *serialize* it
        OutputFormat format = new OutputFormat(fibonacci);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(fibonacci);
      }
      catch (IOException e) {
        System.err.println(e); 
      }
    }
    catch (DOMException e) {
      e.printStackTrace();
    }

  }

}

fibonacci.xml

<?xml version="1.0" encoding="UTF-8"?>
<Fibonacci_Numbers><fibonacci index="1">1</fibonacci><fibonacci index="2">1</fibonacci><fibonacci index="3">2</fibonacci><fibonacci index="4">3</fibonacci><fibonacci index="5">5</fibonacci><fibonacci index="6">8</fibonacci><fibonacci index="7">13</fibonacci><fibonacci index="8">21</fibonacci><fibonacci index="9">34</fibonacci><fibonacci index="10">55</fibonacci><fibonacci index="11">89</fibonacci><fibonacci index="12">144</fibonacci><fibonacci index="13">233</fibonacci><fibonacci index="14">377</fibonacci><fibonacci index="15">610</fibonacci><fibonacci index="16">987</fibonacci><fibonacci index="17">1597</fibonacci><fibonacci index="18">2584</fibonacci><fibonacci index="19">4181</fibonacci><fibonacci index="20">6765</fibonacci><fibonacci index="21">10946</fibonacci><fibonacci index="22">17711</fibonacci><fibonacci index="23">28657</fibonacci><fibonacci index="24">46368</fibonacci><fibonacci index="25">75025</fibonacci></Fibonacci_Numbers>

OutputFormat

package org.apache.xml.serialize;

public class OutputFormat extends Object {

  public OutputFormat()
  public OutputFormat(String method, 
   String encoding, boolean indenting)
  public OutputFormat(Document doc)
  public OutputFormat(Document doc, 
   String encoding, boolean indenting)
  
  public String   getMethod()
  public void     setMethod(String method)
  public String   getVersion()
  public void     setVersion(String version)
  public int      getIndent()
  public boolean  getIndenting()
  public void     setIndent(int indent)
  public void     setIndenting(boolean on)
  public String   getEncoding()
  public void     setEncoding(String encoding)
  public String   getMediaType()
  public void     setMediaType(String mediaType)
  public void     setDoctype(String publicID, String systemID)
  public String   getDoctypePublic()
  public String   getDoctypeSystem()
  public boolean  getOmitXMLDeclaration()
  public void     setOmitXMLDeclaration(boolean omit)
  public boolean  getStandalone()
  public void     setStandalone(boolean standalone)
  public String[] getCDataElements()
  public boolean  isCDataElement(String tagName)
  public void     setCDataElements(String[] cdataElements)
  public String[] getNonEscapingElements()
  public boolean  isNonEscapingElement(String tagName)
  public void     setNonEscapingElements(String[] nonEscapingElements)
  public String   getLineSeparator()
  public void     setLineSeparator(String lineSeparator)
  public boolean  getPreserveSpace()
  public void     setPreserveSpace(boolean preserve)
  public int      getLineWidth()
  public void     setLineWidth(int lineWidth)
  public char     getLastPrintable()
  
  public static String whichMethod(Document doc)
  public static String whichDoctypePublic(Document doc)
  public static String whichDoctypeSystem(Document doc)
  public static String whichMediaType(String method)
  
}

Better formatted output

Latin-1 encoding
Indentation
Word wrapping
Document type declaration

 try {
  // Now that the document is created we need to *serialize* it
  OutputFormat format = new OutputFormat(fibonacci, "8859_1", true);
  format.setLineSeparator("\r\n");
  format.setLineWidth(72);
  format.setDoctype(null, "fibonacci.dtd");
  XMLSerializer serializer = new XMLSerializer(System.out, format);
  serializer.serialize(root);
}
catch (IOException e) {
  System.err.println(e); 
}

Question: Why won't this let us add an xml-stylesheet directive?

formatted_fibonacci.xml

<?xml version="1.0" encoding="8859_1"?>
<!DOCTYPE Fibonacci_Numbers SYSTEM "fibonacci.dtd">
<Fibonacci_Numbers>
    <fibonacci index="1">1</fibonacci>
    <fibonacci index="2">1</fibonacci>
    <fibonacci index="3">2</fibonacci>
    <fibonacci index="4">3</fibonacci>
    <fibonacci index="5">5</fibonacci>
    <fibonacci index="6">8</fibonacci>
    <fibonacci index="7">13</fibonacci>
    <fibonacci index="8">21</fibonacci>
    <fibonacci index="9">34</fibonacci>
    <fibonacci index="10">55</fibonacci>
    <fibonacci index="11">89</fibonacci>
    <fibonacci index="12">144</fibonacci>
    <fibonacci index="13">233</fibonacci>
    <fibonacci index="14">377</fibonacci>
    <fibonacci index="15">610</fibonacci>
    <fibonacci index="16">987</fibonacci>
    <fibonacci index="17">1597</fibonacci>
    <fibonacci index="18">2584</fibonacci>
    <fibonacci index="19">4181</fibonacci>
    <fibonacci index="20">6765</fibonacci>
    <fibonacci index="21">10946</fibonacci>
    <fibonacci index="22">17711</fibonacci>
    <fibonacci index="23">28657</fibonacci>
    <fibonacci index="24">46368</fibonacci>
    <fibonacci index="25">75025</fibonacci>
</Fibonacci_Numbers>

DOM based XMLPrettyPrinter

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;
import org.apache.xerces.dom.*;
import org.apache.xml.serialize.*; 


public class DOMPrettyPrinter {

  public static void main(String[] args) { 
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document document = parser.getDocument();
        
        OutputFormat format 
         = new OutputFormat(document, "UTF-8", true);
        format.setLineSeparator("\r\n");
        format.setIndenting(true);
        format.setIndent(2);
        format.setLineWidth(72);
        format.setPreserveSpace(false);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(document);     
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

Output from a DOM based XMLPrettyPrinter

<?xml version="1.0" encoding="UTF-8"?>
<!-- <!DOCTYPE foo SYSTEM "http://msdn.microsoft.com/xml/general/htmlentities.dtd"> -->
<weblogs>
  <log>
    <name>MozillaZine</name>
    <url>http://www.mozillazine.org</url>
    <changesUrl>http://www.mozillazine.org/contents.rdf</changesUrl>
    <ownerName>Jason Kersey</ownerName>
    <ownerEmail>kerz@en.com</ownerEmail>
    <description>THE source for news on the Mozilla Organization.
      DevChats, Reviews, Chats, Builds, Demos, Screenshots, and more.</description>
    <imageUrl/>
    <adImageUrl>http://static.userland.com/weblogMonitor/ads/kerz@en.com.gif</adImageUrl>
  </log>
  <log>
    <name>SalonHerringWiredFool</name>
    <url>http://www.salonherringwiredfool.com/</url>
    <ownerName>Some Random Herring</ownerName>
    <ownerEmail>salonfool@wiredherring.com</ownerEmail>
    <description/>
  </log>
  <log>
    <name>Scripting News</name>
    <url>http://www.scripting.com/</url>
    <ownerName>Dave Winer</ownerName>
    <ownerEmail>dave@userland.com</ownerEmail>
    <description>News and commentary from the cross-platform scripting community.</description>
    <imageUrl>http://www.scripting.com/gifs/tinyScriptingNews.gif</imageUrl>
    <adImageUrl>http://static.userland.com/weblogMonitor/ads/dave@userland.com.gif</adImageUrl>
  </log>
  <log>
    <name>SlashDot.Org</name>
    <url>http://www.slashdot.org/</url>
    <ownerName>Simply a friend</ownerName>
    <ownerEmail>afriendofweblogs@weblogs.com</ownerEmail>
    <description>News for Nerds, Stuff that Matters.</description>
  </log>
</weblogs>

The point is this:

Using the DOM to write documents automatically maintains well-formedness constraints
Validity is not automatically maintained.

Questions?

To Learn More

This presentation: http://www.cafeconleche.org/slides/sd2002east/dom/
XML in a Nutshell, 2nd Edition
- Elliotte Rusty Harold and W. Scott Means
- O'Reilly & Associates, 2002
- ISBN 0-596-00292-0
Processing XML with Java
- Elliotte Rusty Harold
- Addison-Wesley, 2002
- ISBN 0-201-77186-1
- Online: http://www.cafeconleche.org/books/xmljava
Who's Jon Johansen? http://www.eff.org/IP/DeCSS_prosecutions/Johansen_DeCSS_case/

Questions?

Index | Cafe con Leche