The CharacterData interface

The CharacterData interface is a generic super-interface for nodes that are composed mostly of text including:

The CharacterData interface is almost never used directly, rather than as an instance of one of these three sub-interfaces. However, you almost always work with text, comment, and CDATA section nodes using the methods of the CharacterData interface.

Example 11.7 summarizes the CharacterData interface. This interface has methods that manipulate the text content of this node. As usual, it also inherits all the methods of its super-interface Node such as getParentNode() and getNodeValue().

Example 11.7. The CharacterData interface

package org.w3c.dom;

public interface CharacterData extends Node {
  
  public String getData() throws DOMException;
  public void   setData(String data) throws DOMException;
  public int    getLength();
  public String substringData(int offset, int length)
   throws DOMException;
  public void   appendData(String data) throws DOMException;
  public void   insertData(int offset, String data)
   throws DOMException;
  public void   deleteData(int offset, int length)
   throws DOMException;
  public void   replaceData(int offset, int length, String data)
   throws DOMException;

}

The getData() method returns a String containing the complete content of the node. Any escaped characters like & or   will be replaced by the actual characters they represent. The setData() method replaces the entire text content of the node. There’s no need to escape the string you pass to this method. If the document is written out to a file or a stream, the serialization code is responsible for escaping these characters. In-memory, the type of the object is enough to determine whether a less than sign is the start of a tag or just a less than sign.

There are also methods to read and write just parts of the text content. The offsets are all zero-based as in Java’s String class. For example, this code fragment deletes the first six characters from the CharacterData object text:

text.delete(0, 6);

Java’s String type is a very good match for DOM strings. Each char in a Java String is a single UTF-16 code point. That is, most Unicode characters are represented by exactly one Java char. However, characters with code points greater than 65,535 such as many musical symbols are represented by two chars each, one for each half of the surrogate pair representing the character in UTF-16. The getLength() method in this interface returns the number of UTF-16 code points, not the number of Unicode characters. This is also how the length() method in Java’s String class behaves.

On Usenet, jokes which some people are likely to find offensive are often obscured by rotating the ASCII character set 13 places. That is, the first letter of the alphabet, A, is transformed into the fourteenth letter of the alphabet, N. The second letter of the alphabet, B, is transformed into the fifteenth letter of the alphabet, O, and so forth through M which becomes Z. Then N is transformed into A, O into B, and so on through Z which becomes M. It’s not a particularly strong cipher, but it’s enough to prevent people from accidentally reading something they don’t want to read. It has the extra advantage of reversing itself. That is, running the cipher text through the ROT13 algorithm one more time restores the original text.

Example 11.8 is a simple program that obscures text nodes, comments, and CDATA sections by applying the ROT13 algorithm to them. The encoded documents are as well-formed and valid as the unencoded documents. Only the character data gets changed, not the markup. [3] This program can also decode documents that are already encoded.

Example 11.8. ROT13 encoder for XML documents

import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.dom.DOMSource;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import java.io.IOException;


public class ROT13XML {

  // note use of recursion
  public static void encode(Node node) {
    
    if (node instanceof CharacterData) {
      CharacterData text = (CharacterData) node;
      String data = text.getData();
      text.setData(rot13(data));
    }
    
    // recurse the children
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        encode(children.item(i));
      } 
    }
    
  }
  
  public static String rot13(String s) {
    
    StringBuffer out = new StringBuffer(s.length());
    for (int i = 0; i < s.length(); i++) {
      int c = s.charAt(i);
      if (c >= 'A' && c <= 'M') out.append((char) (c+13));
      else if (c >= 'N' && c <= 'Z') out.append((char) (c-13));
      else if (c >= 'a' && c <= 'm') out.append((char) (c+13));
      else if (c >= 'n' && c <= 'z') out.append((char) (c-13));
      else out.append((char) c);
    } 
    return out.toString();
    
  }

  public static void main(String[] args) {

    if (args.length <= 0) {
      System.out.println("Usage: java ROT13XML URL");
      return;
    }
    
    String url = args[0];
    
    try {
      DocumentBuilderFactory factory 
       = DocumentBuilderFactory.newInstance();
      DocumentBuilder parser = factory.newDocumentBuilder();
      
      // Read the document
      Document document = parser.parse(url); 
      
      // Modify the document
      ROT13XML.encode(document);

      // Write it out again
      TransformerFactory xformFactory 
       = TransformerFactory.newInstance();
      Transformer idTransform = xformFactory.newTransformer();
      Source input = new DOMSource(document);
      Result output = new StreamResult(System.out);
      idTransform.transform(input, output);

    }
    catch (SAXException e) {
      System.out.println(url + " is not well-formed.");
    }
    catch (IOException e) { 
      System.out.println(
      "Due to an IOException, the parser could not encode " + url
      ); 
    }
    catch (FactoryConfigurationError e) { 
      System.out.println("Could not locate a factory class"); 
    }
    catch (ParserConfigurationException e) { 
      System.out.println("Could not locate a JAXP parser"); 
    }
    catch (TransformerConfigurationException e) { 
      System.out.println("Could not locate a TrAX transformer"); 
    }
    catch (TransformerException e) { 
      System.out.println("Could not transform"); 
    }
     
  } // end main

}

The encode() method recursively descends the tree applying the ROT13 algorithm to every CharacterData object it finds, whether a Comment, Text, or CDATASection. The algorithm itself is encapsulated in the rot13() method. Since both methods merely operate on their arguments but otherwise have no interaction with any state maintained in the class, I made them static. The main() method encodes a document at a URL typed on the command line, and then copies the result to System.out.

Here’s a joke encoded by this program. You’ll have to run the program if you want to find out what it says. :-)

D:\books\XMLJAVA>java ROT13XML joke.xml
<?xml version="1.0" encoding="utf-8"?><joke>
  Erchoyvpna srghf-jbefuvccref jnag gb tvir srghfrf rdhny be
  fhcrevbe evtugf bire jbzra'f obqvrf, rira vs vg guerngraf n
  jbzna'f culfvpny urnygu -- rira jura gur srghf qbrfa'g lrg unir
  n shyyl shapgvbavat uhzna oenva, be nal oenva ng nyy. Lbh unir
  gb fnl bar guvat -- Erchoyvpnaf gnxr pner bs gurve bja.
</joke>


[3] ROT13XML could also encode attribute values and processing instructions without affecting well-formedness or validity, but since DOM does not represent these nodes as instances of CharacterData, I leave this as an exercise for the reader.


Copyright 2001, 2002 Elliotte Rusty Haroldelharo@metalab.unc.eduLast Modified January 04, 2002
Up To Cafe con Leche