DOM

DOM

Elliotte Rusty Harold

SD2000 East

Monday November 1, 2000

elharo@metalab.unc.edu

http://metalab.unc.edu/xml/


Where we're going


Relevant Standards


DOM is easy


Prerequisites


A simple example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->
View in Browser

Markup and Character Data


Markup and Character Data Example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->

Entities


Parsed Character Data


CDATA sections


Comments


Processing Instructions


The XML Declaration

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

Document Type Declaration

<!DOCTYPE SONG SYSTEM "song.dtd">


Document Type Definition

<!ELEMENT SONG (TITLE, COMPOSER+, PRODUCER*, 
 PUBLISHER*, YEAR?, LENGTH?, ARTIST+)>

<!ELEMENT TITLE (#PCDATA)>

<!ELEMENT COMPOSER (#PCDATA)>
<!ELEMENT PRODUCER (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)>
<!ELEMENT LENGTH (#PCDATA)>
<!-- This should be a four digit year like "1999",
     not a two-digit year like "99" -->
<!ELEMENT YEAR (#PCDATA)>

<!ELEMENT ARTIST (#PCDATA)>

XML Names


XML Namespaces


Namespace Syntax


Namespace URIs


Binding Prefixes to Namespace URIs


The Default Namespace


How Parsers Handle Namespaces


Canonical XML


Trees


Document Object Model


DOM Evolution


Eight Modules:


Which modules and features are supported?


Which modules are supported?

import org.apache.xerces.dom.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;


public class ModuleChecker {

  public static void main(String[] args) {
     
    // parser dependent
    DOMImplementation implementation 
     = DOMImplementationImpl.getDOMImplementation();
    String[] features = {"XML", "HTML", "Views", "StyleSheets",
     "CSS", "CSS2", "Events", "UIEvents", "MouseEvents", 
     "MutationEvents", "HTMLEvents", "Traversal", "Range"};
    
    for (int i = 0; i < features.length; i++) {
      if (implementation.hasFeature(features[i], "2.0")) {
        System.out.println("Implementation supports " 
         + features[i]);
      } 
      else {
        System.out.println("Implementation does not support " 
         + features[i]);
      } 
    }
  
  }

}

Which modules are supported? Results

D:\speaking\SD2000 East\dom\examples>java ModuleChecker
Implementation supports XML
Implementation does not support HTML
Implementation does not support Views
Implementation does not support StyleSheets
Implementation does not support CSS
Implementation does not support CSS2
Implementation supports Events
Implementation does not support UIEvents
Implementation does not support MouseEvents
Implementation supports MutationEvents
Implementation does not support HTMLEvents
Implementation supports Traversal
Implementation does not support Range

DOM Trees


org.w3c.dom


DOM Parsers for Java


The DOM Process


Parsing documents with a DOM Parser Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;

public class DOMParserMaker {

  public static void main(String[] args) {
     
    // This is simpler but less flexible than the SAX approach.
    // Perhaps a good creational design pattern is needed here?   
  
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        // work with the document...
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
   
  }

}

The Node Interface

package org.w3c.dom;

public interface Node {

  // NodeType
  public static final short ELEMENT_NODE                = 1;
  public static final short ATTRIBUTE_NODE              = 2;
  public static final short TEXT_NODE                   = 3;
  public static final short CDATA_SECTION_NODE          = 4;
  public static final short ENTITY_REFERENCE_NODE       = 5;
  public static final short ENTITY_NODE                 = 6;
  public static final short PROCESSING_INSTRUCTION_NODE = 7;
  public static final short COMMENT_NODE                = 8;
  public static final short DOCUMENT_NODE               = 9;
  public static final short DOCUMENT_TYPE_NODE          = 10;
  public static final short DOCUMENT_FRAGMENT_NODE      = 11;
  public static final short NOTATION_NODE               = 12;

  public String       getNodeName();
  public String       getNodeValue() throws DOMException;
  public void         setNodeValue(String nodeValue) throws DOMException;
  public short        getNodeType();
  public Node         getParentNode();
  public NodeList     getChildNodes();
  public Node         getFirstChild();
  public Node         getLastChild();
  public Node         getPreviousSibling();
  public Node         getNextSibling();
  public NamedNodeMap getAttributes();
  public Document     getOwnerDocument();
  public Node         insertBefore(Node newChild, Node refChild) throws DOMException;
  public Node         replaceChild(Node newChild, Node oldChild) throws DOMException;
  public Node         removeChild(Node oldChild) throws DOMException;
  public Node         appendChild(Node newChild) throws DOMException;
  public boolean      hasChildNodes();
  public Node         cloneNode(boolean deep);
  public void         normalize();
  public boolean      supports(String feature, String version);
  public String       getNamespaceURI();
  public String       getPrefix();
  public void         setPrefix(String prefix) throws DOMException;
  public String       getLocalName();
}

The NodeList Interface

package org.w3c.dom;

public interface NodeList {
  public Node item(int index);
  public int  getLength();
}

Now we're really ready to read a document


Node Iterator

import org.w3c.dom.*;


// Depth first search of DOM Tree
public abstract class NodeIterator {

  // note use of recursion
  public void followNode(Node node) {
    
    processNode(node);
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        followNode(children.item(i));
      } 
    }
    
  }
  
  // Override this method to do something as each node is visited
  protected abstract void processNode(Node node);
    // I could make processNode() a separate method in 
    // a NodeProcessor interface, and make followNode static
    // but I wanted to keep this example simple.

}

Node Reporter

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;


public class NodeReporter extends NodeIterator {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    NodeIterator iterator = new NodeReporter();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        iterator.followNode(d);
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  public void processNode(Node node) {
    
    String name = node.getNodeName();
    String type = getTypeName(node.getNodeType());
    System.out.println("Type " + type + ": " + name);
    
  }
  
  public static String getTypeName(int type) {
    
    switch (type) {
      case Node.ELEMENT_NODE: return "Element";
      case Node.ATTRIBUTE_NODE: return "Attribute";
      case Node.TEXT_NODE: return "Text";
      case Node.CDATA_SECTION_NODE: return "CDATA Section";
      case Node.ENTITY_REFERENCE_NODE: return "Entity Reference";
      case Node.ENTITY_NODE: return "Entity";
      case Node.PROCESSING_INSTRUCTION_NODE: return "processing Instruction";
      case Node.COMMENT_NODE : return "Comment";
      case Node.DOCUMENT_NODE: return "Document";
      case Node.DOCUMENT_TYPE_NODE: return "Document Type Declaration";
      case Node.DOCUMENT_FRAGMENT_NODE: return "Document Fragment";
      case Node.NOTATION_NODE: return "Notation";
      default: return "Unknown Type"; 
    }
    
  }

}

Node Reporter Output

D:\speaking\SD2000 East\dom\examples>java NodeReporter hotcop.xml
Type Document: #document
Type processing Instruction: xml-stylesheet
Type Document Type Declaration: SONG
Type Element: SONG
Type Text: #text
Type Element: TITLE
Type Text: #text
Type Text: #text
Type Element: COMPOSER
Type Text: #text
Type Text: #text
Type Element: COMPOSER
Type Text: #text
Type Text: #text
Type Element: COMPOSER
Type Text: #text
Type Text: #text
Type Element: PRODUCER
Type Text: #text
Type Text: #text
Type Comment: #comment
Type Text: #text
Type Element: PUBLISHER
Type Text: #text
Type Text: #text
Type Text: #text
Type Text: #text
Type Element: LENGTH
Type Text: #text
Type Text: #text
Type Element: YEAR
Type Text: #text
Type Text: #text
Type Element: ARTIST
Type Text: #text
Type Text: #text
Type Comment: #comment

Attributes are missing from this output. They are not nodes. They are properties of nodes.


Node Values as returned by getNodeValue()

Node TypeNode Value
element nodenull
attribute nodeattribute value
text nodetext of the node
CDATA section nodetext of the section
entity reference nodenull
entity node is null
processing instruction nodecontent of the processing instruction, not including the target
comment nodetext of the comment
document nodenull
document type declaration nodenull
document fragment nodenull
notation nodenull

DOM based TagStripper

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;


public class DOMTagStripper extends NodeIterator {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    NodeIterator iterator = new DOMTagStripper();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        iterator.followNode(d);
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  public void processNode(Node node) {
    
    int type = node.getNodeType();
    if (type == Node.TEXT_NODE) {
      System.out.print(node.getNodeValue());
    }
    
  }

}

Output from a DOM based TagStripper

D:\speaking\SD2000 East\dom\examples>java DOMTagStripper hotcop.xml

  Hot Cop
  Jacques Morali
  Henri Belolo
  Victor Willis
  Jacques Morali

  A & M Records
  6:20
  1978
  Village People

The Document Node


The Document Interface

package org.w3c.dom;

  public interface Document extends Node {
  
    public DocumentType      getDoctype();
    public DOMImplementation getImplementation();
    public Element           getDocumentElement();
    public Element           createElement(String tagName) throws DOMException;
    public Element           createElementNS(String namespaceURI, String qualifiedName) throws DOMException;
    public DocumentFragment  createDocumentFragment();
    public Text              createTextNode(String data);
    public Comment           createComment(String data);
    public CDATASection      createCDATASection(String data) throws DOMException;
    public ProcessingInstruction createProcessingInstruction(String target, String data)
     throws DOMException;
    public Attr            createAttribute(String name) throws DOMException;
    public Attr            createAttributeNS(String namespaceURI, String qualifiedName) throws DOMException;
    public EntityReference createEntityReference(String name) throws DOMException;
    public NodeList        getElementsByTagName(String tagname);
    public NodeList        getElementsByTagNameNS(String namespaceURI, String localName);
    public Element         getElementById(String elementId);
    public Node            importNode(Node importedNode, boolean deep) throws DOMException;
    
}

Grab HeadLines

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;


public class HeadlineGrabber  {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    
    try {
      // Read the entire document into memory
      parser.parse("http://static.userland.com/myUserLandMisc/currentStories.xml"); 
     
      Document d = parser.getDocument();
      NodeList headlines = d.getElementsByTagName("storyText");
      for (int i = 0; i < headlines.getLength(); i++) {
        NodeList storyText = headlines.item(i).getChildNodes();
        for (int j = 0; j < storyText.getLength(); j++) {
          Node textContent = storyText.item(j);
          System.out.print(textContent.getNodeValue());
        }
        System.out.println();
        System.out.println();
        System.out.flush();
      }
    }
    catch (SAXException e) {
      System.err.println(e); 
    }
    catch (IOException e) {
      System.err.println(e); 
    }
  
  } // end main

}
View Output in Browser

Element Nodes


The Element Interface

package org.w3c.dom;

public interface Element extends Node {

  public String    getTagName();
  public String    getAttribute(String name);
  public void      setAttribute(String name, String value) throws DOMException;
  public void      removeAttribute(String name) throws DOMException;
  public Attr      getAttributeNode(String name);
  public Attr      setAttributeNode(Attr newAttr) throws DOMException;
  public Attr      removeAttributeNode(Attr oldAttr) throws DOMException;
  public NodeList  getElementsByTagName(String name);
  public String    getAttributeNS(String namespaceURI, String localName);
  public void      setAttributeNS(String namespaceURI, String qualifiedName, String value) throws DOMException;
  public void      removeAttributeNS(String namespaceURI, String localName) throws DOMException;
  public Attr      getAttributeNodeNS(String namespaceURI, String localName);
  public Attr      setAttributeNodeNS(Attr newAttr) throws DOMException;
  public NodeList  getElementsByTagNameNS(String namespaceURI, String localName);

}

IDTagger

import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.IOException;
import org.apache.xml.serialize.*;


public class IDTagger extends NodeIterator {

  int id = 1;

  public void processNode(Node node) {
    
    if (node instanceof Element) {
      
      Element element = (Element) node;
      String currentID = element.getAttribute("ID");
      if (currentID == null || currentID.equals("")) {
        element.setAttribute("ID", "_" + id);
        id = id + 1; 
      }
    }
    
  }

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    NodeIterator iterator = new IDTagger();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document document = parser.getDocument();
        iterator.followNode(document);
        
        // now we serialize the document...
        OutputFormat format = new OutputFormat(document);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(document);       
        
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main


}
View Output in Browser

CharacterData interface


The CharacterData Interface

package org.w3c.dom;

public interface CharacterData extends Node {

  public String  getData() throws DOMException;
  public void    setData(String data) throws DOMException;
  public int     getLength();
  public String  substringData(int offset, int count) throws DOMException;
  public void    appendData(String arg) throws DOMException;
  public void    insertData(int offset, String arg) throws DOMException;
  public void    deleteData(int offset, int count) throws DOMException;
  public void    replaceData(int offset, int count, String arg) throws DOMException;
  
}

ROT13 XML Text

import org.apache.xerces.parsers.DOMParser;
import org.apache.xml.serialize.*;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import java.io.IOException;


public class ROT13XML extends NodeIterator {

  int id = 1;

  public void processNode(Node node) {
    
    if (node instanceof CharacterData) {
      
      CharacterData text = (CharacterData) node;
      String data = text.getData();
      text.setData(rot13(data));
    }
    
  }
  
  public static String rot13(String s) {
    
    StringBuffer result = new StringBuffer(s.length());
    for (int i = 0; i < s.length(); i++) {
      int c = s.charAt(i);
      if (c >= 'A' && c <= 'M') result.append((char) (c+13));
      else if (c >= 'N' && c <= 'Z') result.append((char) (c-13));
      else if (c >= 'a' && c <= 'm') result.append((char) (c+13));
      else if (c >= 'n' && c <= 'z') result.append((char) (c-13));
      else result.append((char) c);
      
    } 
    return result.toString();
    
  }

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    NodeIterator iterator = new ROT13XML();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document document = parser.getDocument();
        iterator.followNode(document);
        
        // now we serialize the document...
        OutputFormat format = new OutputFormat(document);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(document);
               
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}
View result in browser

Text Nodes


The Text Interface

package org.w3c.dom;

public interface Text extends CharacterData {

  public Text splitText(int offset) throws DOMException;
  
}

DocumentType Nodes


The DocumentType Interface

package org.w3c.dom;

public interface DocumentType extends Node {

  public String       getName();
  public NamedNodeMap getEntities();
  public NamedNodeMap getNotations();
  public String       getPublicId();
  public String       getSystemId();
  public String       getInternalSubset();
  
}

Example of the DocumentType Interface


XHTMLValidator

import org.w3c.dom.*;
import org.apache.xerces.parsers.*; 
import java.io.*;
import org.xml.sax.*;


public class XHTMLValidator {

  public static void main(String[] args) {
    
    for (int i = 0; i < args.length; i++) {
      validate(args[i]);
    }   
    
  }

  private static DOMParser parser = new DOMParser();
  
  static {
    
    // turn on validation
    try {
      parser.setFeature("http://xml.org/sax/features/validation", true);
      parser.setErrorHandler(new ValidityErrorReporter());
    }
    catch (SAXNotRecognizedException e) {
      System.err.println(
       "Installed XML parser cannot validate; checking for well-formedness instead...");
    } 
    catch (SAXNotSupportedException e) {
      System.err.println(
       "Cannot turn on validation here; checking for well-formedness instead...");
    }     
    
  }
  
  // not thread safe
  public static void validate(String source) {
        
    try {

      try {
        parser.parse(source); 
        // ValidityErrorReporter prints any validity errors detected
      }
      catch (SAXException e) {  
        System.out.println(source + " is not well formed."); 
        return; 
      }
      
      // If we get this far, then the document is well-formed XML.
      // Check to see whether the document is actually XHTML    
      Document document = parser.getDocument();
    
      DocumentType doctype = document.getDoctype();
    
      if (doctype == null) {
        System.out.println("No DOCTYPE"); 
        return;
      }

      String name     = doctype.getName();
      String systemID = doctype.getSystemId();
      String publicID = doctype.getPublicId();
      
      if (!name.equals("html")) {
        System.out.println("Incorrect root element name " + name); 
      }
    
      if (publicID == null
       || (!publicID.equals("-//W3C//DTD XHTML 1.0 Strict//EN")
           && !publicID.equals("-//W3C//DTD XHTML 1.0 Transitional//EN")
           && !publicID.equals("-//W3C//DTD XHTML 1.0 Frameset//EN"))) {
        System.out.println(source + " does not seem to use an XHTML 1.0 DTD");
      }
    
      // Check the namespace on the root element
      Element root = document.getDocumentElement();
      String xmlnsValue = root.getAttribute("xmlns");
      if (!xmlnsValue.equals("http://www.w3.org/1999/xhtml")) {
        System.out.println(source 
         + " does not properly declare the http://www.w3.org/1999/xhtml namespace on the root element");        
      }
    
      // get ready for the next parse
      parser.reset();
      
    }
    catch (IOException e) {
      System.err.println("Could not read " + source);
    }
    catch (Exception e) {
      System.err.println(e);
      e.printStackTrace();
    }
    
  }

}

EntityReference Nodes


The EntityReference Interface

package org.w3c.dom;

public interface EntityReference extends Node {

}

Attr Nodes


The Attr Interface

package org.w3c.dom;

public interface Attr extends Node {

  public String   getName();
  public boolean  getSpecified();
  public String   getValue();
  public void     setValue(String value) throws DOMException;
  public Element  getOwnerElement();
  
}

XLinkSpider with DOM

import org.xml.sax.*;
import org.apache.xerces.parsers.*;
import java.io.*;
import java.util.*;
import org.w3c.dom.*;


public class DOMSpider {

  private static DOMParser parser = new DOMParser();
  
  // namespace suport is turned off by default in Xerces
  static {
    try {
      parser.setFeature("http://xml.org/sax/features/namespaces", true); 
    }
    catch (Exception e) {
      System.err.println(e);
    }
  }
  
  private static Vector visited = new Vector();
  
  private static int maxDepth = 5;
  private static int currentDepth = 0; 
  
  public static void listURIs(String systemId) {
    
    currentDepth++;
    try {
      if (currentDepth < maxDepth) {
        parser.parse(systemId);
     
        Document document = parser.getDocument();   
    
        Vector uris = new Vector();
        // search the document for uris, 
        // store them in vector, and print them
        searchForURIs(document.getDocumentElement(), uris);
    
    
        Enumeration e = uris.elements();
        while (e.hasMoreElements()) {
          String uri = (String) e.nextElement();
          visited.addElement(uri);
          listURIs(uri); 
        }
      
      }
    
    }
    catch (SAXException e) {
      // couldn't load the document, 
      // probably not well-formed XML, skip it 
    }
    catch (IOException e) {
      // couldn't load the document, 
      // likely network failure, skip it 
    }
    finally { 
      currentDepth--;
      System.out.flush();     
    }
      
  }
  
  // use recursion 
  public static void searchForURIs(Element element, Vector uris) {
    
    // look for XLinks in this element
    String uri = element.getAttribute("xlink:href");

     // Namespace support seems buggy
//   String uri = element.getAttributeNS("href", "http://www.w3.org/1999/xlink");

    if (uri != null && !uri.equals("") 
         && !visited.contains(uri) 
         && !uris.contains(uri)) {
      System.out.println(uri);
      uris.addElement(uri);
    }
    
    // process child elements recursively
    NodeList children = element.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node n = children.item(i);
      if (n instanceof Element) {
        searchForURIs((Element) n, uris);
      } 
    }
    
  }
  

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java DOMSpider URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      try {
        listURIs(args[i]);
      }
      catch (Exception e) {
        System.err.println(e);
        e.printStackTrace(); 
      }
      
    } // end for
  
  } // end main

} // end DOMSpider

ProcessingInstruction Nodes


The ProcessingInstruction Interface

package org.w3c.dom;

public interface ProcessingInstruction extends Node {

  public String  getTarget();
  public String  getData();
  public void    setData(String data) throws DOMException;
  
}

XLinkSpider that Respects robots processing instruction

import org.xml.sax.*;
import org.apache.xerces.parsers.*;
import java.io.*;
import java.util.*;
import org.w3c.dom.*;


public class PoliteDOMSpider {

  private static DOMParser parser = new DOMParser();
  
  // namespace suport is turned off by default in Xerces
  static {
    try {
      parser.setFeature("http://xml.org/sax/features/namespaces", 
       true); 
    }
    catch (Exception e) {
      System.err.println(e);
    }
  }
  
  private static Vector visited = new Vector();
  
  private static int maxDepth = 5;
  private static int currentDepth = 0; 
  
  public static void listURIs(String systemId) {
    
    currentDepth++;
    try {
      if (currentDepth < maxDepth) {
        parser.parse(systemId);
     
        Document document = parser.getDocument();   
    
        if (robotsAllowed(document)) {
          Vector uris = new Vector();
          // search the document for uris,
          // store them in vector, print them
          searchForURIs(document.getDocumentElement(), uris);
    
          Enumeration e = uris.elements();
          while (e.hasMoreElements()) {
            String uri = (String) e.nextElement();
            visited.addElement(uri);
            listURIs(uri); 
          }
          
        }
      
      }
    
    }
    catch (SAXException e) {
      // couldn't load the document, 
      // probably not well-formed XML, skip it 
    }
    catch (IOException e) {
      // couldn't load the document, 
      // likely network failure, skip it 
    }
    finally { 
      currentDepth--;
      System.out.flush();     
    }
      
  }
  
  public static boolean robotsAllowed(Document document) {
    
    NodeList children = document.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node n = children.item(i);
      if (n instanceof ProcessingInstruction) {
        ProcessingInstruction pi = (ProcessingInstruction) n;
        if (pi.getTarget().equals("robots")) {
          String data = pi.getData();
          if (data.indexOf("follow=\"no\"") >= 0) {
            return false; 
          } 
        }
      }
    }
    
    return true;
    
  }
  
  
  // use recursion 
  public static void searchForURIs(Element element, Vector uris) {
    
    // look for XLinks in this element
    String uri = element.getAttribute("xlink:href");

     // Namespace support seems buggy
//    String uri = element.getAttributeNS("href", "http://www.w3.org/1999/xlink");

    if (uri != null && !uri.equals("") 
         && !visited.contains(uri) 
         && !uris.contains(uri)) {
      System.out.println(uri);
      uris.addElement(uri);
    }
    
    // process child elements recursively
    NodeList children = element.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node n = children.item(i);
      if (n instanceof Element) {
        searchForURIs((Element) n, uris);
      } 
    }
    
  }
  

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java PoliteDOMSpider URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      try {
        listURIs(args[i]);
      }
      catch (Exception e) {
        System.err.println(e);
        e.printStackTrace(); 
      }
      
    } // end for
  
  } // end main

} // end PoliteDOMSpider

Comment Nodes


The Comment Interface

package org.w3c.dom;

public interface Comment extends CharacterData {
}

Comment Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;


public class CommentReader {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        processNode(d);
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  // note use of recursion
  public static void processNode(Node node) {
    
    int type = node.getNodeType();
    if (type == Node.COMMENT_NODE) {
      System.out.println(node.getNodeValue());
      System.out.println();
    }
    else {
      if (node.hasChildNodes()) {
        NodeList children = node.getChildNodes();
        for (int i = 0; i < children.getLength(); i++) {
          processNode(children.item(i));
        } 
      }
    }
    
  }



}

CommentReader Output

D:\speaking\SD2000 East\dom\examples>java CommentReader hotcop.xml
 The publisher is actually Polygram but I needed
       an example of a general entity reference.

 You can tell what album I was
     listening to when I wrote this example

Or try http://www.w3.org/TR/1998/REC-xml-19980210.xml for more interesting output


CDATA section Nodes


The CDATASection Interface

package org.w3c.dom;

public interface CDATASection extends Text {
}

Entity Nodes


The Entity Interface

package org.w3c.dom;

public interface Entity extends Node {

  public String  getPublicId();
  public String  getSystemId();
  public String  getNotationName();
  
}

DOMException


Writing XML Documents with DOM


org.apache.xerces.dom.DOMImplementationImpl

package org.apache.xerces.dom;

public class DOMImplementationImpl implements DOMImplementation {

  public boolean hasFeature(String feature, String version) 
  
  public static DOMImplementation getDOMImplementation()
  
  public DocumentType  createDocumentType(String qualifiedName, 
                                          String publicID, 
                                          String systemID,
                                          String internalSubset)
                                          
  public Document createDocument(String namespaceURI, 
                                 String qualifiedName, 
                                 DocumentType doctype)
                                 throws DOMException

} 

A DOM program that writes Fibonacci numbers into an XML document

import java.math.*;
import java.io.*;
import org.w3c.dom.*;
import org.apache.xerces.dom.*;


public class FibonacciDOM {

  public static void main(String[] args) {
   
    try {
      
      DOMImplementationImpl impl = (DOMImplementationImpl) 
       DOMImplementationImpl.getDOMImplementation();
       
      DocumentType type = impl.createDocumentType("Fibonacci_Numbers",
       null, null);
      
      // type is supposed to be able to be null, 
      // but in practice that didn't work                     
      DocumentImpl fibonacci 
       = (DocumentImpl) impl.createDocument(null, "Fibonacci_Numbers", type);
      
      BigInteger low  = BigInteger.ZERO;
      BigInteger high = BigInteger.ONE;      
      
      Element root = fibonacci.createElement("Fibonacci_Numbers");
      // This not only creates the element; it also makes it the
      // root element of the document. 

      for (int i = 0; i < 101; i++) {
        Element number = fibonacci.createElement("fibonacci");
        number.setAttribute("index", Integer.toString(i));
        Text text = fibonacci.createTextNode(low.toString());
        number.appendChild(text);
        root.appendChild(number);
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      
      // Now that the document is created we need to *serialize* it
      
      
 
    }
    catch (DOMException e) {
      e.printStackTrace();
    }

  }

}

Serialization


A DOM program that writes Fibonacci numbers onto System.out

import java.math.*;
import java.io.*;
import org.w3c.dom.*;
import org.apache.xerces.dom.*;
import org.apache.xml.serialize.*;


public class FibonacciDOMSerializer {

  public static void main(String[] args) {

    try {

      DOMImplementationImpl impl = (DOMImplementationImpl)
       DOMImplementationImpl.getDOMImplementation();

      DocumentType type = impl.createDocumentType("Fibonacci_Numbers",
       null, null);

      // type is supposed to be able to be null,
      // but in practice that didn't work
      DocumentImpl fibonacci
       = (DocumentImpl) impl.createDocument(null, "Fibonacci_Numbers", type);

      BigInteger low  = BigInteger.ZERO;
      BigInteger high = BigInteger.ONE;

      Element root = fibonacci.createElement("Fibonacci_Numbers");
      // This not only creates the element; it also makes it the
      // root element of the document.

      for (int i = 0; i <= 25; i++) {
        Element number = fibonacci.createElement("fibonacci");
        number.setAttribute("index", Integer.toString(i));
        Text text = fibonacci.createTextNode(low.toString());
        number.appendChild(text);
        root.appendChild(number);
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }

      try {
        // Now that the document is created we need to *serialize* it
        OutputFormat format = new OutputFormat(fibonacci);
        XMLSerializer serializer = new XMLSerializer(System.out, format);
        serializer.serialize(root);
      }
      catch (IOException e) {
        System.err.println(e);
      }
    }
    catch (DOMException e) {
      e.printStackTrace();
    }

  }

}

fibonacci.xml

<?xml version="1.0" encoding="UTF-8"?>
<Fibonacci_Numbers><fibonacci index="0">0</fibonacci><fibonacci
index="1">1</fibonacci><fibonacci index="2">1</fibonacci><fibonacci
index="3">2</fibonacci><fibonacci index="4">3</fibonacci><fibonacci
index="5">5</fibonacci><fibonacci index="6">8</fibonacci><fibonacci
index="7">13</fibonacci><fibonacci index="8">21</fibonacci><fibonacci
index="9">34</fibonacci><fibonacci index="10">55</fibonacci><fibonacci
index="11">89</fibonacci><fibonacci index="12">144</fibonacci><fibonacci
index="13">233</fibonacci><fibonacci
index="14">377</fibonacci><fibonacci
index="15">610</fibonacci><fibonacci
index="16">987</fibonacci><fibonacci
index="17">1597</fibonacci><fibonacci
index="18">2584</fibonacci><fibonacci
index="19">4181</fibonacci><fibonacci
index="20">6765</fibonacci><fibonacci
index="21">10946</fibonacci><fibonacci
index="22">17711</fibonacci><fibonacci
index="23">28657</fibonacci><fibonacci
index="24">46368</fibonacci><fibonacci
index="25">75025</fibonacci>
</Fibonacci_Numbers>

OutputFormat

package org.apache.xml.serialize;

public class OutputFormat extends Object {

  public OutputFormat()
  public OutputFormat(String method, String encoding, boolean indenting)
  public OutputFormat(Document doc)
  public OutputFormat(Document doc, String encoding, boolean indenting)
  
  public String  getMethod()
  public void    setMethod(String method)
  public String  getVersion()
  public void    setVersion(String version)
  public int      getIndent()
  public boolean  getIndenting()
  public void     setIndent(int indent)
  public void     setIndenting(boolean on)
  public String   getEncoding()
  public void     setEncoding(String encoding)
  public String   getMediaType()
  public void     setMediaType(String mediaType)
  public void     setDoctype(String publicID, String systemID)
  public String   getDoctypePublic()
  public String   getDoctypeSystem()
  public boolean  getOmitXMLDeclaration()
  public void     setOmitXMLDeclaration(boolean omit)
  public boolean  getStandalone()
  public void     setStandalone(boolean standalone)
  public String[] getCDataElements()
  public boolean  isCDataElement(String tagName)
  public void     setCDataElements(String[] cdataElements)
  public String[] getNonEscapingElements()
  public boolean  isNonEscapingElement(String tagName)
  public void     setNonEscapingElements(String[] nonEscapingElements)
  public String   getLineSeparator()
  public void     setLineSeparator(String lineSeparator)
  public boolean  getPreserveSpace()
  public void     setPreserveSpace(boolean preserve)
  public int      getLineWidth()
  public void     setLineWidth(int lineWidth)
  public char     getLastPrintable()
  
  public static String whichMethod(Document doc)
  public static String whichDoctypePublic(Document doc)
  public static String whichDoctypeSystem(Document doc)
  public static String whichMediaType(String method)
  
}

Better formatted output

      try {
        // Now that the document is created we need to *serialize* it
        OutputFormat format = new OutputFormat(fibonacci, "8859_1", true);
        format.setLineSeparator("\r\n");
        format.setLineWidth(72);
        format.setDoctype(null, "fibonacci.dtd");
        XMLSerializer serializer = new XMLSerializer(System.out, format);
        serializer.serialize(root);
      }
      catch (IOException e) {
        System.err.println(e); 
      }

formatted_fibonacci.xml

<?xml version="1.0" encoding="8859_1"?>
<!DOCTYPE Fibonacci_Numbers SYSTEM "fibonacci.dtd">
<Fibonacci_Numbers>
    <fibonacci index="0">0</fibonacci>
    <fibonacci index="1">1</fibonacci>
    <fibonacci index="2">1</fibonacci>
    <fibonacci index="3">2</fibonacci>
    <fibonacci index="4">3</fibonacci>
    <fibonacci index="5">5</fibonacci>
    <fibonacci index="6">8</fibonacci>
    <fibonacci index="7">13</fibonacci>
    <fibonacci index="8">21</fibonacci>
    <fibonacci index="9">34</fibonacci>
    <fibonacci index="10">55</fibonacci>
    <fibonacci index="11">89</fibonacci>
    <fibonacci index="12">144</fibonacci>
    <fibonacci index="13">233</fibonacci>
    <fibonacci index="14">377</fibonacci>
    <fibonacci index="15">610</fibonacci>
    <fibonacci index="16">987</fibonacci>
    <fibonacci index="17">1597</fibonacci>
    <fibonacci index="18">2584</fibonacci>
    <fibonacci index="19">4181</fibonacci>
    <fibonacci index="20">6765</fibonacci>
    <fibonacci index="21">10946</fibonacci>
    <fibonacci index="22">17711</fibonacci>
    <fibonacci index="23">28657</fibonacci>
    <fibonacci index="24">46368</fibonacci>
    <fibonacci index="25">75025</fibonacci>
</Fibonacci_Numbers>

DOM based XMLPrettyPrinter

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;
import org.apache.xerces.dom.*;
import org.apache.xml.serialize.*; 


public class DOMPrettyPrinter {

  public static void main(String[] args) { 
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document document = parser.getDocument();
        
        OutputFormat format 
         = new OutputFormat(document, "UTF-8", true);
        format.setLineSeparator("\r\n");
        format.setLineWidth(72);
        format.setIndent(2);
        format.setPreserveSpace(false);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(document);     
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

Output from a DOM based XMLPrettyPrinter


The point is this:


To Learn More


Questions?


Index | Cafe con Leche

Copyright 2000 Elliotte Rusty Harold
elharo@metalab.unc.edu
Last Modified September 11, 2000