TreeWalker

TreeWalker
Prev	Chapter 12. The DOM Traversal Module	Next

The purpose of TreeWalker is much the same as that of NodeIterator, traversing a subtree of a document rooted at a particular node and filtered by both node type and custom logic. However, it differs from NodeIterator in that the traversal model is based on a tree with parents, children, and sibling nodes rather than a linear list with only previous and next nodes. Since the traversal model is very similar to what’s already available in the Node interface, tree walkers aren’t as commonly used as NodeIterator. However, the ability to filter the nodes that appear in the tree can be very useful on occasion.

Example 12.7 summarizes the TreeWalker interface. It has getter methods that return the configuration of the TreeWalker, methods to get and set the current node, and methods to move from the current node to its parent, first child, last child, previous sibling, next sibling, previous node, and next node. In all cases these methods return null if there is no such node; e.g. you ask for the last child of an empty element.

Example 12.7. The TreeWalker interface

package org.w3c.dom.traversal;

public interface TreeWalker {
  
  public Node       getRoot();
  public int        getWhatToShow();
  public NodeFilter getFilter();
  public boolean    getExpandEntityReferences();
  
  public Node       getCurrentNode();
  public void       setCurrentNode(Node currentNode) 
   throws DOMException;
  public Node       parentNode();
  public Node       firstChild();
  public Node       lastChild();
  public Node       previousSibling();
  public Node       nextSibling();
  public Node       previousNode();
  public Node       nextNode();

}

A TreeWalker object is always positioned at one of the nodes in its sub-tree. It begins its existence positioned at the first node in document order. From there you can change the walker’s position by invoking nextNode(), previousNode(), parentNode(), firstChild(), lastChild(), previousSibling() and nextSibling(). In the event that there is no parent, sibling, or child relative to the current node within the walker’s tree, these methods all return null. You can find out where the walker is positioned with currentNode().

TreeWalker objects are created in almost exactly the same way as NodeIterator objects. That is, you cast the Document object you want to walk to DocumentTraversal, and invoke its createTreeWalker() method. The createTreeWalker() method takes the same four arguments with the same four meanings as createNodeIterator(): the root node of the subtree to walk, an int constant specifying which types of nodes to display, a custom NodeFilter object or null, and a boolean indicating whether or not to expand entity references.

Note

If the root node is filtered out either by whatToShow or by the NodeFilter, then the sub-tree being walked may not have a single root. In other words, it’s more like a DocumentFragment than a Document. As long as you’re cognizant of this possibility, this is not a large problem.

TreeWalkers are called for whenever the hierarchy matters; that is, whenever what’s important is not just the node itself but its parent and other ancestor nodes. For example, suppose you want to generate a list of examples in a DocBook document in the following format:

Example 1.1: A plain text document indicating an order for 12 Birdsong Clocks, SKU 244
Example 1.2: An XML document indicating an order for 12 Birdsong Clocks, SKU 244
Example 1.3: A document indicating an order for 12 Birdsong Clocks, SKU 244?
…
Example 2.1: An XML document that labels elements with schema simple types
Example 2.2: URLGrabber
Example 2.3: URLGrabberTest
…

As you saw in the last chapter, DocBook documents are structured roughly like this:

<book>
  …
  <chapter>
    …
    <example id="filename.java">
      <title>Some Java Program</title>
      <programlisting>import javax.xml.parsers;
        // more Java code…      
      </programlisting>
    </example>  
    …
    <example id="filename.xml">
      <title>Some XML document</title>
      <programlisting><![CDATA[<?xml version="1.0"?>
<root>
  …     
</root>]]></programlisting>
     </example>  
    …
  </chapter>
  more chapters…
</book>

For maximum convenience we want a TreeWalker that only sees book, chapter, example, and title elements. However, title elements should only be allowed when they represent the title of an example, not a chapter or a figure or anything else. We can set whatToShow to NodeFilter.SHOW_ELEMENT to limit the walker to elements and design a NodeFilter that only picks out these four elements. Example 12.8 demonstrates this filter.

Example 12.8. The ExampleFilter class

import org.w3c.dom.traversal.NodeFilter;
import org.w3c.dom.*;


public class ExampleFilter implements NodeFilter {

  public short acceptNode(Node node) {
     
    Element candidate = (Element) node;
    String name = candidate.getNodeName(); 
    if (name.equals("example")) return FILTER_ACCEPT;
    else if (name.equals("chapter")) return FILTER_ACCEPT;
    else if (name.equals("book")) return FILTER_ACCEPT;
    else if (name.equals("title")) {
      // Is this the title of an example, in which case we accept
      // it, or the title of something else, in which case we
      // reject it?
      Node parent = node.getParentNode();
      if ("example".equals(parent.getNodeName())) {
        return FILTER_ACCEPT;
      }
    }
    return FILTER_SKIP;
     
  }

}

In each case when an element is rejected, acceptNode() returns FILTER_SKIP, not FILTER_REJECT. For TreeWalker, unlike NodeIterator, the difference is important. By returning FILTER_SKIP, acceptNode() indicates that this node should not be reported but that its children should be. If acceptNode() returns FILTER_REJECT for a node, then neither that node nor any of its descendants would be covered.

The TreeWalker is simply a view of the document. It does not itself change the document or the nodes the document contains. For example, even though the ExampleFilter hides all text nodes, these can still be extracted from a title element. Example 12.9 walks the tree and pulls out these titles using this filter.

Example 12.9. Navigating a sub-tree with TreeWalker

import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.w3c.dom.traversal.*;
import org.xml.sax.SAXException;
import java.io.IOException;


public class ExampleList {

  public static void printExampleTitles(Document doc) {
  
    // Create the NodeIterator
    DocumentTraversal traversable = (DocumentTraversal) doc;
    TreeWalker walker = traversable.createTreeWalker(
     doc.getDocumentElement(), NodeFilter.SHOW_ELEMENT, 
     new ExampleFilter(), true);
    
    // The TreeWalker starts out positioned at the root     
    Node chapter = walker.firstChild();
    int chapterNumber = 0;
    while (chapter != null) {
      chapterNumber++;
      Node example = walker.firstChild();
      int exampleNumber = 0;
      while (example != null) {
        exampleNumber++;
        Node title = walker.firstChild();
        String titleText = TextExtractor.getText(title);
        titleText = "Example " + chapterNumber + "."
         + exampleNumber + ": " + titleText;
        System.out.println(titleText);
        // Back up to the example
        walker.parentNode();
        example = walker.nextSibling();
      }
      // Reposition the walker on the parent chapter
      walker.parentNode();
      // Go to the next chapter
      chapter = walker.nextSibling();
    }
    
  }
  
  public static void main(String[] args) {

    if (args.length <= 0) {
      System.out.println("Usage: java ExampleList URL");
      return;
    }
    
    String url = args[0];
    
    try {
      DocumentBuilderFactory factory 
       = DocumentBuilderFactory.newInstance();
      DocumentBuilder parser = factory.newDocumentBuilder();
      
      // Check for the traversal module
      DOMImplementation impl = parser.getDOMImplementation();
      if (!impl.hasFeature("traversal", "2.0")) {
        System.out.println(
         "A DOM implementation that supports traversal is required."
        );  
        return;
      }
      
      // Read the document
      Document doc = parser.parse(url); 
      printExampleTitles(doc);
       
    }
    catch (SAXException e) {
      System.out.println(url + " is not well-formed.");
    }
    catch (IOException e) { 
      System.out.println(
       "Due to an IOException, the parser could not check " + url
      ); 
    }
    catch (FactoryConfigurationError e) { 
      System.out.println("Could not locate a factory class"); 
    }
    catch (ParserConfigurationException e) { 
      System.out.println("Could not locate a JAXP parser"); 
    }
     
  } // end main  
  
}

The use of TreeWalker here and NodeIterator in TextExtractor make this task a lot simpler than it otherwise would be. Hiding all the irrelevant parts means, among other things, you don’t have to worry about the complexities of example elements that appear at different depths in the tree or insignificant white space that may sporadically add extra text nodes where you don’t expect them. The traversal package enables you to boil down a document to the minimum structure relevant to your problem.

Copyright 2001, 2002 Elliotte Rusty Harold	elharo@metalab.unc.edu	Last Modified January 19, 2002
	Up To Cafe con Leche