Processing XML with Java

Elliotte Rusty Harold

Tuesday, August 28, 2001

elharo@metalab.unc.edu

http://www.ibiblio.org/xml/

Where we're going

XML Infoset
Writing XML with Java
Reading XML through SAX2
Reading and Writing XML through the DOM
JDOM
dom4j?
TRAX?

Processing XML with Java is easy

You need a JDK
You need some free class libraries
You need a text editor
You need some data to process

Prerequisites

Are familiar with Java including I/O, classes, objects, polymorphism, etc.
Know XML including well-formedness, validity, namespaces, and so forth
I will briefly review proper terminology

Parser APIs

SAX, the Simple API for XML
- SAX1
- SAX2
DOM, the Document Object Model
- DOM Level 0
- DOM Level 1
- DOM Level 2
- DOM Level 3
JDOM
dom4j
TRAX
Proprietary APIs
- Parser specific APIs
- Sun's Java API for XML Parsing = SAX1 + DOM1 + a few factory classes
- JSR-000031 XML Data Binding Specification from Bluestone, Sun, webMethods et al.
  The proposed specification will define an XML data-binding facility for the JavaTM Platform. Such a facility compiles an XML schema into one or more Java classes. These automatically-generated classes handle the translation between XML documents that follow the schema and interrelated instances of the derived classes. They also ensure that the constraints expressed in the schema are maintained as instances of the classes are manipulated.

Part I: XML Infoset

The Infoset is the unfortunate standard to which those in retreat from the radical and most useful implications of well-formedness have rallied. At its core the Infoset insists that there is 'more' to XML than the straightforward syntax of well-formedness. By imposing its canonical semantics the Infoset obviates the infinite other semantic outcomes which might be elaborated in particular unique circumstances from an instance of well-formed XML 1.0 syntax. The question we should be asking is not whether the Infoset has chosen the correct canonical semantics, but whether the syntactic possibilities of XML 1.0 should be curtailed in this way at all.

--Walter Perry on the xml-dev mailing list

A simple example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->

View in Browser

Markup and Character Data

Markup includes:
- Tags
- Entity References
- Comments
- Processing Instructions
- Document Type Declarations
- XML Declaration
- CDATA Section Delimiters
Character data includes everything else

Markup and Character Data Example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://www.ibiblio.org/xml/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->

Entities

An XML document is made up of one or more physical storage units called entities
Entity references:
- Parsed internal general entity references like &
- Parsed external general entity references
- Unparsed external general entity references
- External parameter entity references
- Internal parameter entity references
Reading an XML document is not the same thing as reading an XML file

The file contains entity references.
The file document contains the entities' replacement text.
When you use a parser to read a document you'll get the text including characters like <. You will not see the entity references.

Parsed Character Data

Character data left after entity references are replaced with their text
Given the element
<PUBLISHER>A & M Records</PUBLISHER>

The parsed character data is

A & M Records

CDATA sections

Used to include large blocks of text with lots of normally illegal literal characters like < and &, typically XML or HTML.

<p>You can use a default <code>xmlns</code>
attribute to avoid having to add the svg prefix to all
your elements:</p>
<![CDATA[
  <svg xmlns="http://www.w3.org/2000/svg" 
       width="12cm" height="10cm">
    <ellipse rx="110" ry="130" />
    <rect x="4cm" y="1cm" width="3cm" height="6cm" />
  </svg>
]]>

CDATA is for human authors, not for programs!

Comments


Comments are for humans, not programs.

Processing Instructions

Divided into a target and data for the target
The target must be an XML name
The data can have an effectively arbitrary format

<?robots index="yes" follow="no"?>
<?xml-stylesheet href="pelicans.css" type="text/css"?>
<?php 
  mysql_connect("database.unc.edu", "clerk", "password"); 
  $result = mysql("CYNW", "SELECT LastName, FirstName FROM Employees 
    ORDER BY LastName, FirstName"); 
  $i = 0;
  while ($i < mysql_numrows ($result)) {
     $fields = mysql_fetch_row($result);
     echo "<person>$fields[1] $fields[0] </person>\r\n";
     $i++;
  }
  mysql_close();
?>

These are for programs

The XML Declaration

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

Looks like a processing instruction but isn't.
version attribute
- required
- always has the value 1.0
encoding attribute
- UTF-8
- ISO-8859-1
- SJIS
- etc.
standalone attribute
- yes
- no

Document Type Declaration

<!DOCTYPE SONG SYSTEM "song.dtd">

Document Type Definition (DTD)

<!ELEMENT SONG (TITLE, PHOTO?, COMPOSER+, PRODUCER*, 
 PUBLISHER*, LENGTH?, YEAR?, ARTIST+)>

<!ELEMENT TITLE (#PCDATA)>

<!ELEMENT COMPOSER (#PCDATA)>
<!ELEMENT PRODUCER (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)>
<!ELEMENT LENGTH (#PCDATA)>
<!-- This should be a four digit year like "1999",
     not a two-digit year like "99" -->
<!ELEMENT YEAR (#PCDATA)>

<!ELEMENT ARTIST (#PCDATA)>
<!ELEMENT PHOTO EMPTY>
<!ATTLIST PHOTO xlink:type (simple) #FIXED "simple" 
                xlink:show (onLoad) #FIXED "onLoad" 
                xlink:href CDATA #REQUIRED
                ALT CDATA #REQUIRED
                WIDTH NMTOKEN #REQUIRED
                HEIGHT NMTOKEN #REQUIRED
>
<!ATTLIST PUBLISHER xlink:type (simple) #FIXED "simple" 
                    xlink:href CDATA #REQUIRED

>
<!ATTLIST SONG xmlns CDATA       #FIXED "http://metalab.unc.edu/xml/namespace/song"
               xmlns:xlink CDATA #FIXED "http://www.w3.org/1999/xlink"
>

XML Names

Used for element, attribute, and entity names
Can contain any Unicode 2.0 alphabetic, ideographic, or numeric Unicode character
Can contain hyphen, underscore, or period
Can also contain colons but these are reserved for namespaces
Can begin with any Unicode 2.0 alphabetic or ideographic character or the underscore but not digits or other punctuation marks

XML Namespaces

Raison d'etre:
1. To distinguish between elements and attributes from different vocabularies with different meanings.
2. To group all related elements and attributes together so that a parser can easily recognize them.
Each element is given a prefix
Each prefix (as well as the empty prefix) is associated with a URI
Elements with the same URI are in the same namespace
URIs are purely formal. They do not necessarily point to a page.

Namespace Syntax

Elements and attributes that are in namespaces have names that contain exactly one colon. They look like this:
```
rdf:description
xlink:type
xsl:template
```
Everything before the colon is called the prefix
Everything after the colon is called the local part or local name.
The complete name including the colon is called the qualified name or raw name.

Namespace URIs

Each prefix in a qualified name is associated with a URI.
For example, all elements in XSLT 1.0 style sheets are associated with the http://www.w3.org/1999/XSL/Transform URI.
The customary prefix xsl is a shorthand for the longer URI http://www.w3.org/1999/XSL/Transform.
You can't use the URI in the element name directly.

Binding Prefixes to Namespace URIs

Prefixes are bound to namespace URIs by attaching an xmlns:prefix attribute to the prefixed element or one of its ancestors.

<svg:svg xmlns:svg="http://www.w3.org/2000/svg" 
 width="12cm" height="10cm">
  <svg:ellipse rx="110" ry="130" />
  <svg:rect x="4cm" y="1cm" width="3cm" height="6cm" />
</svg:svg>

Bindings have scope within the element where they're declared.
An SVG processor can recognize all three of these elements as SVG elements because they all have prefixes bound to the particular URI defined by the SVG specification.

The Default Namespace

Indicate that an unprefixed element and all its unprefixed descendant elements belong to a particular namespace by attaching an xmlns attribute with no prefix:

<DATASCHEMA xmlns="http://www.w3.org/2000/P3Pv1">
  <DATA name="vehicle.make" type="text" short="Make" 
        category="preference" size="31"/>
  <DATA name="vehicle.model" type="text" short="Model" 
        category="preference" size="31"/>
  <DATA name="vehicle.year" type="number" short="Year" 
        category="preference" size="4"/>
  <DATA name="vehicle.license.state." type="postal." short="State" 
        category="preference" size="2"/>
  <DATA name="vehicle.license.number" type="text" 
        short="License Plate Number" category="preference" size="12"/>
</DATASCHEMA>

Both the DATASCHEMA and DATA elements are in the http://www.w3.org/2000/P3Pv1 namespace.
Default namespaces apply only to elements, not to attributes. Thus in the above example the name, type, short, category, and size attributes are not in any namespace. Unprefixed attributes are never in any namespace.
You can change the default namespace within a particular element by adding an xmlns attribute to the element.

How Parsers Handle Namespaces

Namespaces were added to XML 1.0 after the fact, but care was taken to ensure backwards compatibility.
An XML 1.0 parser that does not know about namespaces will most likely not have any troubles reading a document that uses namespaces.
A namespace aware parser also checks to see that all prefixes are mapped to URIs. Otherwise it behaves almost exactly like a non-namespace aware parser.
Other software that sits on top of the raw XML parser, an XSLT engine for example, may treat elements differently depending on what namespace they belong to. However, the XML parser itself mostly doesn't care as long as all well-formedness and namespace constraints are met.
A possible exception occurs in the unlikely event that elements with different prefixes belong to the same namespace or elements with the same prefix belong to different namespaces
Many parsers have the option of whether to report namespace violations so that you can turn namespace processing on or off as you see fit.

Three Variations on a Theme

A normal XML document

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->

A canonical XML document

<?xml-stylesheet type="text/css" href="song.css"?>
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song" xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO ALT="Victor Willis in Cop Outfit" HEIGHT="200" WIDTH="100" xlink:href="hotcop.jpg" xlink:show="onLoad" xlink:type="simple"></PHOTO>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  
  <PUBLISHER xlink:href="http://www.amrecords.com/" xlink:type="simple">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

An org.w3c.dom.Document object formed by reading hotcop.xml

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;

public class DOMHotCop {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    
    try {
      parser.parse("http://metalab.unc.edu/xml/examples/hot_cop.xml"); 
      Document d = parser.getDocument();
    }
    catch (SAXException e) {
      System.err.println(e); 
    }
    catch (IOException e) {
      System.err.println(e); 
    }
   
  }

}

Are these three the same thing or not?

Three forms:
- The customary form of an XML document
- The canonical form of an XML document
- The object form of an XML document
Do they contain the same information or not?

What is the XML InfoSet?

A W3C proposed standard for what is and is not significant in an XML document
Not everyone agrees that this is a good thing! or that this is the right list!

The InfoSet defines 11 kinds of Information Items

The Document Information Item
Element Information Items
Attribute Information Items
Processing instruction Information Items
Unexpanded Entity Reference Information Items
Character Information Items
Comment Information Items
The Document Type Declaration Information Item
Unparsed Entity Information Items
Notation Information Items
Namespace Declaration Information Items

The Document Information Item

Represents the entire document; not just the root element
Properties:
- Children
  - One Element Information Item for the root element
  - One Comment Information Item for each Comment
  - One Processing Instruction Information Item for each Processing Instruction
- Notation Declarations
- Unparsed Entities
- Base URI
- Standalone Declaration
- Version Declaration
- All declarations processed

Element Information Items

An Element Information Item Includes:

namespace name
local name
children: a list of element, processing instruction, unexpanded entity reference, character, and comment information items, one for each element, processing instruction, unexpanded entity reference, data character, and comment appearing immediately within the current element
attributes: an unordered set of attribute information items, one for each of the attributes (specified or defaulted from the DTD) of this element. xmlns attributes declarations are not include.
declared namespaces: an unordered set of namespace declaration information items, one for each of the namespaces declared either in the start-tag of this element or defaulted from the DTD.
in-scope namespaces: An unordered set of namespace declaration information items, one for each of the namespaces in effect for this element
base URI: The absolute URI of the external entity in which this element appears, as defined in XML Base. If this is not known, this property is null.
parent

Attributes

xlink:type="simple"
xlink:href="http://www.amrecords.com/"
xlink:type =  "simple"
xlink:show = "onLoad"
xlink:href="hotcop.jpg"
ALT="Victor Willis in Cop Outfit"
WIDTH=" 100 "
HEIGHT=' 200 '

An Attribute Information Item Includes:

namespace name
local name
normalized value
specified: A flag indicating whether this attribute was actually specified in the start-tag of its element, or was defaulted from the DTD
default: An ordered list of character information items, one for each character appearing in the default value specified for this attribute in the DTD, if any.
attribute type:
- ID
- IDREF
- IDREFS
- ENTITY
- ENTITIES
- NMTOKEN
- NMTOKENS
- NOTATION
- CDATA
- ENUMERATED
owner element
references: if the attribute type is IDREF, IDREFS, ENTITY, ENTITIES, or NOTATION, then the value of this property is an ordered list of the element, unparsed entity, or notation information items referred to in the attribute value

Comments

  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
<!--  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG> -->
<!-- You can tell what album I was 
     listening to when I wrote this example -->

A comment Information Item includes:

content
parent

A Processing Instruction Information Item Includes:

<?robots index="yes" follow="no"?>
<?php 
  mysql_connect("database.unc.edu", "clerk", "password"); 
  $result = mysql("CYNW", "SELECT LastName, FirstName FROM Employees 
    ORDER BY LastName, FirstName"); 
  $i = 0;
  while ($i < mysql_numrows ($result)) {
     $fields = mysql_fetch_row($result);
     echo "<person>$fields[1] $fields[0] </person>\r\n";
     $i++;
  }
  mysql_close();
?>

target
content
base URI
parent
notation (named by the target)

Characters

A character is one Unicode character in the content of an element, attribute value, comment or processing instruction data.
A Character Information Item includes:

character code
The Unicode value in the range 0 to #x10FFFF of the character

element content whitespace
A flag indicating whether the character is whitespace appearing within element content

parent

Namespace Information Items

A Namespace Information Item includes:
- prefix
- namespace name (the namespace URI)
Namespace Information Items are attached to elements, one for each namespace in scope on the element

Document Type Declaration

<!DOCTYPE SONG SYSTEM "song.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
                      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

A Document Type Declaration Information Item includes:

system identifier
public identifier
children: Only the processing instruction information items in the internal DTD subset and external DTD subsets.
parent

Unparsed Entity Information Items

Each unparsed entity information item includes

name
system identifier
public identifier
declaration base URI
notation name
notation

The InfoSet Omits:

The internal and external DTD subsets; especially ELEMENT and ATTLIST declarations
Whether an empty element uses two tags or one
What kind of quotes surround attributes
Insignificant white space in attributes
White space that occurs between attributes
Attribute order
CDATA sections
Parsed entities
Comments in the DTD

To Learn More

XML InfoSet Specification: http://www.w3.org/TR/xml-infoset

Part II: Writing XML Documents with Java

XML documents are text
Any Writer can produce an XML document

Unicode

XML documents and APIs are Unicode
Unicode encodings:
- UTF-8
- UTF-16 big endian
- UCS-4 big endian
- UTF-16 little endian
- UCS-4 little endian
Non-Unicode encodings:
- ASCII (subset of UTF-8)
- MacRoman
- Windows ANSI
- Latin 1 through Latin 15
- SJIS Japanese
- Big-5 Chinese
- K0I8R Cyrillic
- Many others...

Readers and Writers

Java's InputStreamReader and OutputStreamWriter classes are very helpful

URL u = new URL(
 "http://www.fxis.co.jp/DMS/sgml/xml/charset/utf-8/weekly.xml");
InputStream in = u.openStream();
InputStreamReader reader = new InputStreamReader(in, "UTF-8");
int c;
while ((c = in.read()) != -1) System.out.write(c);

A Java program that writes Fibonacci numbers into a text file

import java.math.*;
import java.io.*;


public class FibonacciText {

  public static void main(String[] args) {

    try {
      FileOutputStream fout = new FileOutputStream("fibonacci.txt");
      OutputStreamWriter out = new OutputStreamWriter(fout, "8859_1");

      BigInteger low  = BigInteger.ZERO;
      BigInteger high = BigInteger.ONE;

      for (int i = 0; i <= 25; i++) {
        out.write(low.toString() + "\r\n");
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      out.write(high.toString() + "\r\n");

      out.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }

  }

}

fibonacci.txt

A Java program that writes Fibonacci numbers into an XML file

import java.math.*;
import java.io.*;


public class FibonacciXML {

  public static void main(String[] args) {
   
    try {
      FileOutputStream  fout = new FileOutputStream("fibonacci.xml");
      OutputStreamWriter out = new OutputStreamWriter(fout);      
      
      BigInteger low  = BigInteger.ZERO;
      BigInteger high = BigInteger.ONE;      
      
      out.write("<?xml version=\"1.0\"?>\r\n");  
      out.write("<Fibonacci_Numbers>\r\n");  
      for (int i = 0; i < 25; i++) {
        out.write("  <fibonacci index=\"" + i + "\">");
        out.write(low.toString());
        out.write("</fibonacci>\r\n");
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      out.write("</Fibonacci_Numbers>");  
 
      out.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }

  }

}

fibonacci.xml

<?xml version="1.0"?>
<Fibonacci_Numbers>
  <fibonacci index="0">0</fibonacci>
  <fibonacci index="1">1</fibonacci>
  <fibonacci index="2">1</fibonacci>
  <fibonacci index="3">2</fibonacci>
  <fibonacci index="4">3</fibonacci>
  <fibonacci index="5">5</fibonacci>
  <fibonacci index="6">8</fibonacci>
  <fibonacci index="7">13</fibonacci>
  <fibonacci index="8">21</fibonacci>
  <fibonacci index="9">34</fibonacci>
  <fibonacci index="10">55</fibonacci>
  <fibonacci index="11">89</fibonacci>
  <fibonacci index="12">144</fibonacci>
  <fibonacci index="13">233</fibonacci>
  <fibonacci index="14">377</fibonacci>
  <fibonacci index="15">610</fibonacci>
  <fibonacci index="16">987</fibonacci>
  <fibonacci index="17">1597</fibonacci>
  <fibonacci index="18">2584</fibonacci>
  <fibonacci index="19">4181</fibonacci>
  <fibonacci index="20">6765</fibonacci>
  <fibonacci index="21">10946</fibonacci>
  <fibonacci index="22">17711</fibonacci>
  <fibonacci index="23">28657</fibonacci>
  <fibonacci index="24">46368</fibonacci>
</Fibonacci_Numbers>

Suppose we want to use a different encoding than UTF-8

import java.math.*;
import java.io.*;


public class FibonacciLatin1 {

  public static void main(String[] args) {
   
    try {
      FileOutputStream fout 
       = new FileOutputStream("fibonacci_8859_1.xml");
      OutputStreamWriter out = new OutputStreamWriter(fout, "8859_1");      
      
      BigInteger low  = BigInteger.ZERO;
      BigInteger high = BigInteger.ONE;      
      
      out.write("<?xml version=\"1.0\" encoding=\"8859_1\"?>\r\n");  
      out.write("<Fibonacci_Numbers>\r\n");  
      for (int i = 0; i < 25; i++) {
        out.write("  <fibonacci index=\"" + i + "\">");
        out.write(low.toString());
        out.write("</fibonacci>\r\n");
        
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      out.write("</Fibonacci_Numbers>");  
 
      out.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }

  }

}

fibonacci_8859_1.xml

<?xml version="1.0" encoding="8859_1"?>
<Fibonacci_Numbers>
  <fibonacci index="0">0</fibonacci>
  <fibonacci index="1">1</fibonacci>
  <fibonacci index="2">1</fibonacci>
  <fibonacci index="3">2</fibonacci>
  <fibonacci index="4">3</fibonacci>
  <fibonacci index="5">5</fibonacci>
  <fibonacci index="6">8</fibonacci>
  <fibonacci index="7">13</fibonacci>
  <fibonacci index="8">21</fibonacci>
  <fibonacci index="9">34</fibonacci>
  <fibonacci index="10">55</fibonacci>
  <fibonacci index="11">89</fibonacci>
  <fibonacci index="12">144</fibonacci>
  <fibonacci index="13">233</fibonacci>
  <fibonacci index="14">377</fibonacci>
  <fibonacci index="15">610</fibonacci>
  <fibonacci index="16">987</fibonacci>
  <fibonacci index="17">1597</fibonacci>
  <fibonacci index="18">2584</fibonacci>
  <fibonacci index="19">4181</fibonacci>
  <fibonacci index="20">6765</fibonacci>
  <fibonacci index="21">10946</fibonacci>
  <fibonacci index="22">17711</fibonacci>
  <fibonacci index="23">28657</fibonacci>
  <fibonacci index="24">46368</fibonacci>
</Fibonacci_Numbers>

Suppose you want to include a DTD

import java.math.*;
import java.io.*;


public class FibonacciDTD {

  public static void main(String[] args) {
   
    try {
      FileOutputStream fout 
       = new FileOutputStream("valid_fibonacci.xml");
      OutputStreamWriter out = new OutputStreamWriter(fout, "UTF-8");      
      
      BigInteger low  = BigInteger.ZERO;
      BigInteger high = BigInteger.ONE;      
      
      out.write("<?xml version=\"1.0\"?>\r\n");  
      out.write("<!DOCTYPE Fibonacci_Numbers [\r\n");
      out.write("  <!ELEMENT Fibonacci_Numbers (fibonacci*)>\r\n");      
      out.write("  <!ELEMENT fibonacci (#PCDATA)>\r\n");      
      out.write("  <!ATTLIST fibonacci index CDATA #IMPLIED>\r\n");      
      out.write("]>\r\n");  
      out.write("<Fibonacci_Numbers>\r\n");  
      for (int i = 0; i < 25; i++) {
        out.write("  <fibonacci index=\"" + i + "\">");
        out.write(low.toString());
        out.write("</fibonacci>\r\n");
        
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      out.write("</Fibonacci_Numbers>");  
 
      out.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }

  }

}

valid_fibonacci.xml

<?xml version="1.0"?>
<!DOCTYPE Fibonacci_Numbers [
  <!ELEMENT Fibonacci_Numbers (fibonacci*)>
  <!ELEMENT fibonacci (#PCDATA)>
  <!ATTLIST fibonacci index CDATA #IMPLIED>
]>
<Fibonacci_Numbers>
  <fibonacci index="0">0</fibonacci>
  <fibonacci index="1">1</fibonacci>
  <fibonacci index="2">1</fibonacci>
  <fibonacci index="3">2</fibonacci>
  <fibonacci index="4">3</fibonacci>
  <fibonacci index="5">5</fibonacci>
  <fibonacci index="6">8</fibonacci>
  <fibonacci index="7">13</fibonacci>
  <fibonacci index="8">21</fibonacci>
  <fibonacci index="9">34</fibonacci>
  <fibonacci index="10">55</fibonacci>
  <fibonacci index="11">89</fibonacci>
  <fibonacci index="12">144</fibonacci>
  <fibonacci index="13">233</fibonacci>
  <fibonacci index="14">377</fibonacci>
  <fibonacci index="15">610</fibonacci>
  <fibonacci index="16">987</fibonacci>
  <fibonacci index="17">1597</fibonacci>
  <fibonacci index="18">2584</fibonacci>
  <fibonacci index="19">4181</fibonacci>
  <fibonacci index="20">6765</fibonacci>
  <fibonacci index="21">10946</fibonacci>
  <fibonacci index="22">17711</fibonacci>
  <fibonacci index="23">28657</fibonacci>
  <fibonacci index="24">46368</fibonacci>
</Fibonacci_Numbers>

Converting data to XML

Sample Tab Delimited Data: Baseball Statistics



Surname FirstName Team Position Games Played Games Started AtBats Runs Hits Doubles Triples Home runs RBI Stolen Bases Caught Stealing Sacrifice Hits Sacrifice Flies Errors PB Walks Strike outs Hit by pitch 
Anderson Garret ANA Outfield 156 151 622 62 183 41 7 15 79 8 3 3 3 6 0 29 80 1 
Baughman Justin ANA Second Base 62 54 196 24 50 9 1 1 20 10 4 5 3 8 0 6 36 1 
Bolick Frank ANA Third Base 21 11 45 3 7 2 0 1 2 0 0 0 0 0 0 11 8 0 
Disarcina Gary ANA Shortstop 157 155 551 73 158 39 3 3 56 12 7 12 3 14 0 21 51 8 
Edmonds Jim ANA Outfield 154 150 599 115 184 42 1 25 91 7 5 1 1 5 0 57 114 1 
Erstad Darin ANA Outfield 133 129 537 84 159 39 3 19 82 20 6 1 3 3 0 43 77 6 
Garcia Carlos ANA Second Base 19 10 35 4 5 1 0 0 0 2 0 1 0 1 0 3 11 1 
Glaus Troy ANA Third Base 48 45 165 19 36 9 0 1 23 1 0 0 2 7 0 15 51 0 
Greene Todd ANA Outfield 29 15 71 3 18 4 0 1 7 0 0 0 0 0 0 2 20 0 
Helfand Eric ANA Catcher 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
Hollins Dave ANA Third Base 101 98 363 60 88 16 2 11 39 11 3 2 2 17 0 44 69 7 
Jefferies Gregg ANA Outfield 19 18 72 7 25 6 0 1 10 1 0 0 0 0 0 0 5 0 
Johnson Mark ANA First Base 10 2 14 1 1 0 0 0 0 0 0 0 0 0 0 0 6 0 
Kreuter Chad ANA Catcher 96 74 252 27 63 10 1 2 33 1 0 5 1 9 5 33 49 3 
Martin Norberto ANA Second Base 79 50 195 20 42 2 0 1 13 3 1 3 2 4 0 6 29 0 
Mashore Damon ANA Outfield 43 24 98 13 23 6 0 2 11 1 0 1 0 0 0 9 22 3 
Molina Ben ANA Catcher 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
Nevin Phil ANA Catcher 75 65 237 27 54 8 1 8 27 0 0 0 2 5 20 17 67 5 
Obrien Charlie ANA Catcher 62 58 175 13 45 9 0 4 18 0 0 3 3 4 1 10 33 2 
Palmeiro Orlando ANA Outfield 74 34 165 28 53 7 2 0 21 5 4 7 0 0 0 20 11 0 
Pritchett Chris ANA First Base 31 19 80 12 23 2 1 2 8 2 0 0 0 1 0 4 16 0 
Salmon Tim ANA Designated Hitter 136 130 463 84 139 28 1 26 88 0 1 0 10 2 0 90 100 3 
Shipley Craig ANA Third Base 77 32 147 18 38 7 1 2 17 0 4 4 1 3 0 5 22 5 
Velarde Randy ANA Second Base 51 50 188 29 49 13 1 4 26 7 2 0 1 4 0 34 42 1 
Walbeck Matt ANA Catcher 108 91 338 41 87 15 2 6 46 1 1 5 5 7 8 30 68 2 
Williams Reggie ANA Outfield 29 7 36 7 13 1 0 1 5 3 3 1 0 0 0 7 11 1

A Program to convert tab delimited data to XML

import java.io.*;


public class BaseballTabToXML {

  public static void main(String[] args) {
     
    try {
      FileInputStream fin = new FileInputStream(args[0]);
      BufferedReader in 
       = new BufferedReader(new InputStreamReader(fin));
      
      FileOutputStream fout 
       = new FileOutputStream("baseballstats.xml");
      OutputStreamWriter out = new OutputStreamWriter(fout, "UTF-8");      
      out.write("<?xml version=\"1.0\"?>\r\n");  
      out.write("<players>\r\n");
      String playerStats;  
      while ((playerStats = in.readLine()) != null) {
        String[] stats = splitLine(playerStats);         
        out.write("  <player>\r\n");
          out.write("    <first_name>" + stats[1] + "</first_name>\r\n");
          out.write("    <surname>" + stats[0] + "</surname>\r\n");
          out.write("    <games_played>" + stats[4] + "</games_played>\r\n");
          out.write("    <at_bats>" + stats[6] + "</at_bats>\r\n");
          out.write("    <runs>" + stats[7] + "</runs>\r\n");
          out.write("    <hits>" + stats[8] + "</hits>\r\n");
          out.write("    <doubles>" + stats[9] + "</doubles>\r\n");
          out.write("    <triples>" + stats[10] + "</triples>\r\n");
          out.write("    <home_runs>" + stats[11] + "</home_runs>\r\n");
          out.write("    <stolen_bases>" + stats[12] + "</stolen_bases>\r\n");
          out.write("    <caught_stealing>" + stats[14] + "</caught_stealing>\r\n");
          out.write("    <sacrifice_hits>" + stats[15] + "</sacrifice_hits>\r\n");
          out.write("    <sacrifice_flies>" + stats[16] + "</sacrifice_flies>\r\n");
          out.write("    <errors>" + stats[17] + "</errors>\r\n");
          out.write("    <passed_by_ball>" + stats[18] + "</passed_by_ball>\r\n");
          out.write("    <walks>" + stats[19] + "</walks>\r\n");
          out.write("    <strike_outs>" + stats[20] + "</strike_outs>\r\n");
          out.write("    <hit_by_pitch>" + stats[21] + "</hit_by_pitch>\r\n");
        out.write("  </player>\r\n");
      }  
      out.write("</players>\r\n");  
      out.close();
      in.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }
    catch (ArrayIndexOutOfBoundsException e) {
      System.out.println("Usage: java BaseballTabToXML input_file.tab");
    }

  }

  public static String[] splitLine(String playerStats) {
    
    // count the number of tabs
    int numTabs = 0;
    for (int i = 0; i < playerStats.length(); i++) {
      if (playerStats.charAt(i) == '\t') numTabs++;
    }
    int numFields = numTabs + 1;
    String[] fields = new String[numFields];
    int position = 0;
    for (int i = 0; i < numFields; i++) {
      StringBuffer field = new StringBuffer();
      while (position < playerStats.length() 
       && playerStats.charAt(position++) != '\t') {
        field.append(playerStats.charAt(position-1));
      }
      fields[i] = field.toString();
    }    
    return fields;
    
  }

}

Baseball Stats in XML

<?xml version="1.0"?>
<players>
  <player>
    <first_name>FirstName</first_name>
    <surname>Surname</surname>
    <games_played>Games Played</games_played>
    <at_bats>AtBats</at_bats>
    <runs>Runs</runs>
    <hits>Hits</hits>
    <doubles>Doubles</doubles>
    <triples>Triples</triples>
    <home_runs>Home runs</home_runs>
    <stolen_bases>RBI</stolen_bases>
    <caught_stealing>Caught Stealing</caught_stealing>
    <sacrifice_hits>Sacrifice Hits</sacrifice_hits>
    <sacrifice_flies>Sacrifice Flies</sacrifice_flies>
    <errors>Errors</errors>
    <passed_by_ball>PB</passed_by_ball>
    <walks>Walks</walks>
    <strike_outs>Strike outs</strike_outs>
    <hit_by_pitch>Hit by pitch</hit_by_pitch>
  </player>
  <player>
    <first_name>Garret </first_name>
    <surname>Anderson</surname>
    <games_played>156</games_played>
    <at_bats>622</at_bats>
    <runs>62</runs>
    <hits>183</hits>
    <doubles>41</doubles>
    <triples>7</triples>
    <home_runs>15</home_runs>
    <stolen_bases>79</stolen_bases>
    <caught_stealing>3</caught_stealing>
    <sacrifice_hits>3</sacrifice_hits>
    <sacrifice_flies>3</sacrifice_flies>
    <errors>6</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>29</walks>
    <strike_outs>80</strike_outs>
    <hit_by_pitch>1</hit_by_pitch>
  </player>
  <player>
    <first_name>Justin </first_name>
    <surname>Baughman</surname>
    <games_played>62</games_played>
    <at_bats>196</at_bats>
    <runs>24</runs>
    <hits>50</hits>
    <doubles>9</doubles>
    <triples>1</triples>
    <home_runs>1</home_runs>
    <stolen_bases>20</stolen_bases>
    <caught_stealing>4</caught_stealing>
    <sacrifice_hits>5</sacrifice_hits>
    <sacrifice_flies>3</sacrifice_flies>
    <errors>8</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>6</walks>
    <strike_outs>36</strike_outs>
    <hit_by_pitch>1</hit_by_pitch>
  </player>
  <player>
    <first_name>Frank </first_name>
    <surname>Bolick</surname>
    <games_played>21</games_played>
    <at_bats>45</at_bats>
    <runs>3</runs>
    <hits>7</hits>
    <doubles>2</doubles>
    <triples>0</triples>
    <home_runs>1</home_runs>
    <stolen_bases>2</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>11</walks>
    <strike_outs>8</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Gary </first_name>
    <surname>Disarcina</surname>
    <games_played>157</games_played>
    <at_bats>551</at_bats>
    <runs>73</runs>
    <hits>158</hits>
    <doubles>39</doubles>
    <triples>3</triples>
    <home_runs>3</home_runs>
    <stolen_bases>56</stolen_bases>
    <caught_stealing>7</caught_stealing>
    <sacrifice_hits>12</sacrifice_hits>
    <sacrifice_flies>3</sacrifice_flies>
    <errors>14</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>21</walks>
    <strike_outs>51</strike_outs>
    <hit_by_pitch>8</hit_by_pitch>
  </player>
  <player>
    <first_name>Jim </first_name>
    <surname>Edmonds</surname>
    <games_played>154</games_played>
    <at_bats>599</at_bats>
    <runs>115</runs>
    <hits>184</hits>
    <doubles>42</doubles>
    <triples>1</triples>
    <home_runs>25</home_runs>
    <stolen_bases>91</stolen_bases>
    <caught_stealing>5</caught_stealing>
    <sacrifice_hits>1</sacrifice_hits>
    <sacrifice_flies>1</sacrifice_flies>
    <errors>5</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>57</walks>
    <strike_outs>114</strike_outs>
    <hit_by_pitch>1</hit_by_pitch>
  </player>
  <player>
    <first_name>Darin </first_name>
    <surname>Erstad</surname>
    <games_played>133</games_played>
    <at_bats>537</at_bats>
    <runs>84</runs>
    <hits>159</hits>
    <doubles>39</doubles>
    <triples>3</triples>
    <home_runs>19</home_runs>
    <stolen_bases>82</stolen_bases>
    <caught_stealing>6</caught_stealing>
    <sacrifice_hits>1</sacrifice_hits>
    <sacrifice_flies>3</sacrifice_flies>
    <errors>3</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>43</walks>
    <strike_outs>77</strike_outs>
    <hit_by_pitch>6</hit_by_pitch>
  </player>
  <player>
    <first_name>Carlos </first_name>
    <surname>Garcia</surname>
    <games_played>19</games_played>
    <at_bats>35</at_bats>
    <runs>4</runs>
    <hits>5</hits>
    <doubles>1</doubles>
    <triples>0</triples>
    <home_runs>0</home_runs>
    <stolen_bases>0</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>1</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>1</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>3</walks>
    <strike_outs>11</strike_outs>
    <hit_by_pitch>1</hit_by_pitch>
  </player>
  <player>
    <first_name>Troy </first_name>
    <surname>Glaus</surname>
    <games_played>48</games_played>
    <at_bats>165</at_bats>
    <runs>19</runs>
    <hits>36</hits>
    <doubles>9</doubles>
    <triples>0</triples>
    <home_runs>1</home_runs>
    <stolen_bases>23</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>2</sacrifice_flies>
    <errors>7</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>15</walks>
    <strike_outs>51</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Todd </first_name>
    <surname>Greene</surname>
    <games_played>29</games_played>
    <at_bats>71</at_bats>
    <runs>3</runs>
    <hits>18</hits>
    <doubles>4</doubles>
    <triples>0</triples>
    <home_runs>1</home_runs>
    <stolen_bases>7</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>2</walks>
    <strike_outs>20</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Eric </first_name>
    <surname>Helfand</surname>
    <games_played>0</games_played>
    <at_bats>0</at_bats>
    <runs>0</runs>
    <hits>0</hits>
    <doubles>0</doubles>
    <triples>0</triples>
    <home_runs>0</home_runs>
    <stolen_bases>0</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>0</walks>
    <strike_outs>0</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Dave </first_name>
    <surname>Hollins</surname>
    <games_played>101</games_played>
    <at_bats>363</at_bats>
    <runs>60</runs>
    <hits>88</hits>
    <doubles>16</doubles>
    <triples>2</triples>
    <home_runs>11</home_runs>
    <stolen_bases>39</stolen_bases>
    <caught_stealing>3</caught_stealing>
    <sacrifice_hits>2</sacrifice_hits>
    <sacrifice_flies>2</sacrifice_flies>
    <errors>17</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>44</walks>
    <strike_outs>69</strike_outs>
    <hit_by_pitch>7</hit_by_pitch>
  </player>
  <player>
    <first_name>Gregg </first_name>
    <surname>Jefferies</surname>
    <games_played>19</games_played>
    <at_bats>72</at_bats>
    <runs>7</runs>
    <hits>25</hits>
    <doubles>6</doubles>
    <triples>0</triples>
    <home_runs>1</home_runs>
    <stolen_bases>10</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>0</walks>
    <strike_outs>5</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Mark </first_name>
    <surname>Johnson</surname>
    <games_played>10</games_played>
    <at_bats>14</at_bats>
    <runs>1</runs>
    <hits>1</hits>
    <doubles>0</doubles>
    <triples>0</triples>
    <home_runs>0</home_runs>
    <stolen_bases>0</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>0</walks>
    <strike_outs>6</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Chad </first_name>
    <surname>Kreuter</surname>
    <games_played>96</games_played>
    <at_bats>252</at_bats>
    <runs>27</runs>
    <hits>63</hits>
    <doubles>10</doubles>
    <triples>1</triples>
    <home_runs>2</home_runs>
    <stolen_bases>33</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>5</sacrifice_hits>
    <sacrifice_flies>1</sacrifice_flies>
    <errors>9</errors>
    <passed_by_ball>5</passed_by_ball>
    <walks>33</walks>
    <strike_outs>49</strike_outs>
    <hit_by_pitch>3</hit_by_pitch>
  </player>
  <player>
    <first_name>Norberto </first_name>
    <surname>Martin</surname>
    <games_played>79</games_played>
    <at_bats>195</at_bats>
    <runs>20</runs>
    <hits>42</hits>
    <doubles>2</doubles>
    <triples>0</triples>
    <home_runs>1</home_runs>
    <stolen_bases>13</stolen_bases>
    <caught_stealing>1</caught_stealing>
    <sacrifice_hits>3</sacrifice_hits>
    <sacrifice_flies>2</sacrifice_flies>
    <errors>4</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>6</walks>
    <strike_outs>29</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Damon </first_name>
    <surname>Mashore</surname>
    <games_played>43</games_played>
    <at_bats>98</at_bats>
    <runs>13</runs>
    <hits>23</hits>
    <doubles>6</doubles>
    <triples>0</triples>
    <home_runs>2</home_runs>
    <stolen_bases>11</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>1</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>9</walks>
    <strike_outs>22</strike_outs>
    <hit_by_pitch>3</hit_by_pitch>
  </player>
  <player>
    <first_name>Ben </first_name>
    <surname>Molina</surname>
    <games_played>2</games_played>
    <at_bats>1</at_bats>
    <runs>0</runs>
    <hits>0</hits>
    <doubles>0</doubles>
    <triples>0</triples>
    <home_runs>0</home_runs>
    <stolen_bases>0</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>0</walks>
    <strike_outs>0</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Phil </first_name>
    <surname>Nevin</surname>
    <games_played>75</games_played>
    <at_bats>237</at_bats>
    <runs>27</runs>
    <hits>54</hits>
    <doubles>8</doubles>
    <triples>1</triples>
    <home_runs>8</home_runs>
    <stolen_bases>27</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>2</sacrifice_flies>
    <errors>5</errors>
    <passed_by_ball>20</passed_by_ball>
    <walks>17</walks>
    <strike_outs>67</strike_outs>
    <hit_by_pitch>5</hit_by_pitch>
  </player>
  <player>
    <first_name>Charlie </first_name>
    <surname>Obrien</surname>
    <games_played>62</games_played>
    <at_bats>175</at_bats>
    <runs>13</runs>
    <hits>45</hits>
    <doubles>9</doubles>
    <triples>0</triples>
    <home_runs>4</home_runs>
    <stolen_bases>18</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>3</sacrifice_hits>
    <sacrifice_flies>3</sacrifice_flies>
    <errors>4</errors>
    <passed_by_ball>1</passed_by_ball>
    <walks>10</walks>
    <strike_outs>33</strike_outs>
    <hit_by_pitch>2</hit_by_pitch>
  </player>
  <player>
    <first_name>Orlando </first_name>
    <surname>Palmeiro</surname>
    <games_played>74</games_played>
    <at_bats>165</at_bats>
    <runs>28</runs>
    <hits>53</hits>
    <doubles>7</doubles>
    <triples>2</triples>
    <home_runs>0</home_runs>
    <stolen_bases>21</stolen_bases>
    <caught_stealing>4</caught_stealing>
    <sacrifice_hits>7</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>20</walks>
    <strike_outs>11</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Chris </first_name>
    <surname>Pritchett</surname>
    <games_played>31</games_played>
    <at_bats>80</at_bats>
    <runs>12</runs>
    <hits>23</hits>
    <doubles>2</doubles>
    <triples>1</triples>
    <home_runs>2</home_runs>
    <stolen_bases>8</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>1</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>4</walks>
    <strike_outs>16</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Tim </first_name>
    <surname>Salmon</surname>
    <games_played>136</games_played>
    <at_bats>463</at_bats>
    <runs>84</runs>
    <hits>139</hits>
    <doubles>28</doubles>
    <triples>1</triples>
    <home_runs>26</home_runs>
    <stolen_bases>88</stolen_bases>
    <caught_stealing>1</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>10</sacrifice_flies>
    <errors>2</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>90</walks>
    <strike_outs>100</strike_outs>
    <hit_by_pitch>3</hit_by_pitch>
  </player>
  <player>
    <first_name>Craig </first_name>
    <surname>Shipley</surname>
    <games_played>77</games_played>
    <at_bats>147</at_bats>
    <runs>18</runs>
    <hits>38</hits>
    <doubles>7</doubles>
    <triples>1</triples>
    <home_runs>2</home_runs>
    <stolen_bases>17</stolen_bases>
    <caught_stealing>4</caught_stealing>
    <sacrifice_hits>4</sacrifice_hits>
    <sacrifice_flies>1</sacrifice_flies>
    <errors>3</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>5</walks>
    <strike_outs>22</strike_outs>
    <hit_by_pitch>5</hit_by_pitch>
  </player>
  <player>
    <first_name>Randy </first_name>
    <surname>Velarde</surname>
    <games_played>51</games_played>
    <at_bats>188</at_bats>
    <runs>29</runs>
    <hits>49</hits>
    <doubles>13</doubles>
    <triples>1</triples>
    <home_runs>4</home_runs>
    <stolen_bases>26</stolen_bases>
    <caught_stealing>2</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>1</sacrifice_flies>
    <errors>4</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>34</walks>
    <strike_outs>42</strike_outs>
    <hit_by_pitch>1</hit_by_pitch>
  </player>
  <player>
    <first_name>Matt </first_name>
    <surname>Walbeck</surname>
    <games_played>108</games_played>
    <at_bats>338</at_bats>
    <runs>41</runs>
    <hits>87</hits>
    <doubles>15</doubles>
    <triples>2</triples>
    <home_runs>6</home_runs>
    <stolen_bases>46</stolen_bases>
    <caught_stealing>1</caught_stealing>
    <sacrifice_hits>5</sacrifice_hits>
    <sacrifice_flies>5</sacrifice_flies>
    <errors>7</errors>
    <passed_by_ball>8</passed_by_ball>
    <walks>30</walks>
    <strike_outs>68</strike_outs>
    <hit_by_pitch>2</hit_by_pitch>
  </player>
  <player>
    <first_name>Reggie </first_name>
    <surname>Williams</surname>
    <games_played>29</games_played>
    <at_bats>36</at_bats>
    <runs>7</runs>
    <hits>13</hits>
    <doubles>1</doubles>
    <triples>0</triples>
    <home_runs>1</home_runs>
    <stolen_bases>5</stolen_bases>
    <caught_stealing>3</caught_stealing>
    <sacrifice_hits>1</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>7</walks>
    <strike_outs>11</strike_outs>
    <hit_by_pitch>1</hit_by_pitch>
  </player>
</players>

Converting data to XML while Processing it

import java.io.*;
import java.text.*;
import java.util.*;

public class BattingAverage {

  public static void main(String[] args) {

    try {
      FileInputStream fin = new FileInputStream(args[0]);
      BufferedReader in
       = new BufferedReader(new InputStreamReader(fin));

      FileOutputStream fout
       = new FileOutputStream("battingaverages.xml");
      OutputStreamWriter out
       = new OutputStreamWriter(fout, "UTF-8");
      out.write("<?xml version=\"1.0\"?>\r\n");
      out.write("<players>\r\n");
      String playerStats;

      // for formatting batting averages
      DecimalFormat averages = (DecimalFormat)
        NumberFormat.getNumberInstance(Locale.US);
      averages.setMaximumFractionDigits(3);
      averages.setMinimumFractionDigits(3);
      averages.setMinimumIntegerDigits(0);

      while ((playerStats = in.readLine()) != null) {
        String[] stats = splitLine(playerStats);

        String formattedAverage;
        try {
          int atBats         = Integer.parseInt(stats[6]);
          int hits           = Integer.parseInt(stats[8]);

          if (atBats <= 0) formattedAverage = "N/A";
          else {
            double average = hits / (double) atBats;
            formattedAverage = averages.format(average);
          }
        }
        catch (Exception e) {
          // skip this player
          continue;
        }

        out.write("  <player>\r\n");
        out.write("    <first_name>" + stats[1] + "</first_name>\r\n");
        out.write("    <surname>" + stats[0] + "</surname>\r\n");
        out.write("    <batting_average>" + formattedAverage
         + "</batting_average>\r\n");
        out.write("  </player>\r\n");
      }
      out.write("</players>\r\n");
      out.close();
      in.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }
    catch (ArrayIndexOutOfBoundsException e) {
      System.out.println("Usage: java BattingAverage input_file.tab");
    }

  }

  public static String[] splitLine(String playerStats) {

    // count the number of tabs
    int numTabs = 0;
    for (int i = 0; i < playerStats.length(); i++) {
      if (playerStats.charAt(i) == '\t') numTabs++;
    }
    int numFields = numTabs + 1;
    String[] fields = new String[numFields];
    int position = 0;
    for (int i = 0; i < numFields; i++) {
      StringBuffer field = new StringBuffer();
      while (position < playerStats.length()
       && playerStats.charAt(position++) != '\t') {
        field.append(playerStats.charAt(position-1));
      }
      fields[i] = field.toString();
    }
    return fields;

  }

}

Batting Averages in XML

<?xml version="1.0"?>
<players>
  <player>
    <first_name>Garret </first_name>
    <surname>Anderson</surname>
    <batting_average>.294</batting_average>
  </player>
  <player>
    <first_name>Justin </first_name>
    <surname>Baughman</surname>
    <batting_average>.255</batting_average>
  </player>
  <player>
    <first_name>Frank </first_name>
    <surname>Bolick</surname>
    <batting_average>.156</batting_average>
  </player>
  <player>
    <first_name>Gary </first_name>
    <surname>Disarcina</surname>
    <batting_average>.287</batting_average>
  </player>
  <player>
    <first_name>Jim </first_name>
    <surname>Edmonds</surname>
    <batting_average>.307</batting_average>
  </player>
  <player>
    <first_name>Darin </first_name>
    <surname>Erstad</surname>
    <batting_average>.296</batting_average>
  </player>
  <player>
    <first_name>Carlos </first_name>
    <surname>Garcia</surname>
    <batting_average>.143</batting_average>
  </player>
  <player>
    <first_name>Troy </first_name>
    <surname>Glaus</surname>
    <batting_average>.218</batting_average>
  </player>
  <player>
    <first_name>Todd </first_name>
    <surname>Greene</surname>
    <batting_average>.254</batting_average>
  </player>
  <player>
    <first_name>Eric </first_name>
    <surname>Helfand</surname>
    <batting_average>N/A</batting_average>
  </player>
  <player>
    <first_name>Dave </first_name>
    <surname>Hollins</surname>
    <batting_average>.242</batting_average>
  </player>
  <player>
    <first_name>Gregg </first_name>
    <surname>Jefferies</surname>
    <batting_average>.347</batting_average>
  </player>
  <player>
    <first_name>Mark </first_name>
    <surname>Johnson</surname>
    <batting_average>.071</batting_average>
  </player>
  <player>
    <first_name>Chad </first_name>
    <surname>Kreuter</surname>
    <batting_average>.250</batting_average>
  </player>
  <player>
    <first_name>Norberto </first_name>
    <surname>Martin</surname>
    <batting_average>.215</batting_average>
  </player>
  <player>
    <first_name>Damon </first_name>
    <surname>Mashore</surname>
    <batting_average>.235</batting_average>
  </player>
  <player>
    <first_name>Ben </first_name>
    <surname>Molina</surname>
    <batting_average>.000</batting_average>
  </player>
  <player>
    <first_name>Phil </first_name>
    <surname>Nevin</surname>
    <batting_average>.228</batting_average>
  </player>
  <player>
    <first_name>Charlie </first_name>
    <surname>Obrien</surname>
    <batting_average>.257</batting_average>
  </player>
  <player>
    <first_name>Orlando </first_name>
    <surname>Palmeiro</surname>
    <batting_average>.321</batting_average>
  </player>
  <player>
    <first_name>Chris </first_name>
    <surname>Pritchett</surname>
    <batting_average>.288</batting_average>
  </player>
  <player>
    <first_name>Tim </first_name>
    <surname>Salmon</surname>
    <batting_average>.300</batting_average>
  </player>
  <player>
    <first_name>Craig </first_name>
    <surname>Shipley</surname>
    <batting_average>.259</batting_average>
  </player>
  <player>
    <first_name>Randy </first_name>
    <surname>Velarde</surname>
    <batting_average>.261</batting_average>
  </player>
  <player>
    <first_name>Matt </first_name>
    <surname>Walbeck</surname>
    <batting_average>.257</batting_average>
  </player>
  <player>
    <first_name>Reggie </first_name>
    <surname>Williams</surname>
    <batting_average>.361</batting_average>
  </player>
</players>

The point is this:

XML files are text files.
You can write XML files any way you can write a text file in Java or any other language for that matter.
You have to follow well-formedness rules.
You do have to use UTF-8 or specify a different encoding in the XML declaration.

To Learn More

For streams and readers and writers:
- Java I/O
- Elliotte Rusty Harold
- O'Reilly & Associates, 1999
- ISBN: 1-56592-485-1
For well-formedness rules and such:
- XML in a Nutshell
- Elliotte Rusty Harold and Scott Means
- O'Reilly & Associates, 2001
- ISBN: 0-596-00058-8

Part III: Reading XML Documents with SAX

The stereotypical "Desperate Perl Hacker" (DPH) is supposed to be able to write an XML parser in a weekend.
The parser does the hard work for you.
Your code reads the document through the parser's API.

SAX

Public domain, developed on xml-dev mailing list
Maintained by David Megginson and David Brownell
org.xml.sax package
http://www.megginson.com/SAX/
http://sax.sourceforge.net/
Event based

SAX Parsers for Java

Parser	URL	Validating	Namespaces	DOM1	DOM2	SAX1	SAX2	License
Apache XML Project's Xerces Java	http://xml.apache.org/xerces-j/index.html	X	X	X	X	X	X	Apache Software License, Version 1.1
IBM's XML for Java	http://www.alphaworks.ibm.com/formula/xml	X	X	X	X	X	X	License
James Clark's XP	http://www.jclark.com/xml/xp/index.html					X		Modified BSD
Microstar's Ælfred	http://home.pacbell.net/david-b/xml/		X			X	X	GPL with library exception
Silfide's SXP	http://www.loria.fr/projets/XSilfide/EN/sxp/			X		X		Non-GPL viral open source license
Sun's Java API for XML	http://java.sun.com/products/xml	X	X	X		X		free beer
Oracle's XML Parser for Java	http://technet.oracle.com/	X	X	X		X		free beer

SAX1

SAX1 omits:
- Comments
- Lexical Information (CDATA sections, entity references, etc.)
- DTD declarations
- Validation
- Namespaces

SAX2

Adds:
- Namespace support
- Optional validation
- Optional lexical events for comments, CDATA sections, entity references
A lot more configurable
Deprecates a lot of SAX1
Adapter classes convert between parsers.

The SAX2 Process

Use the factory method XMLReaderFactory.createXMLReader() to retrieve a parser-specific implementation of the XMLReader interface
Your code registers a ContentHandler with the parser
An InputSource feeds the document into the parser
As the document is read, the parser calls back to the methods of the ContentHandler to tell it what it's seeing in the document.

Making an XMLReader

The XMLReaderFactory.createXMLReader() method instantiates an XMLReader subclass named by the org.xml.sax.driver system property:
```
try {
  XMLReader parser = XMLReaderFactory.createXMLReader();
} 
catch (SAXException e) {
  System.err.println(e);
}
```

The XMLReaderFactory.createXMLReader(String className) method instantiates an XMLReader subclass named by its argument:

try {
  XMLReader parser 
   = XMLReaderFactory.createXMLReader(   
      "org.apache.xerces.parsers.SAXParser");
} 
catch (SAXException e) {
  System.err.println(e);
}

Or you can use the constructor in the package-specific class:
```
XMLReader parser = new SAXParser();
```

Or all three:

    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    }
    catch (SAXException ex) {
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException ex2) {
        parser = new SAXParser();
      }
    }

Parsing a Document with XMLReader

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;


public class SAX2Checker {

  public static void main(String[] args) {

    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    }
    catch (SAXException ex) {
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException ex2) {
        System.out.println("Could not locate a parser."
         + "Please set the the org.xml.sax.driver property.");
        return;
      }
    }

    if (args.length == 0) {
      System.out.println("Usage: java SAX2Checker URL1 URL2...");
    }

    // start parsing...
    for (int i = 0; i < args.length; i++) {

      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
        // If there are no well-formedness errors
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber()
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not check " + args[i]
         + " because of the IOException " + e);
      }

    }

  }

}

Sample Output from SAX2Checker

C:\>java SAX2Checker http://www.ibiblio.org/xml/
http://www.ibiblio.org/xml/ is not well formed.
The element type "dt" must be terminated by the 
matching end-tag "</dt>". 
at line 186, column 5

The ContentHandler interface

package org.xml.sax;


public interface ContentHandler {

    public void setDocumentLocator(Locator locator);
    
    public void startDocument() throws SAXException;
    
    public void endDocument() throws SAXException;
    
    public void startPrefixMapping(String prefix, String uri) 
     throws SAXException;

    public void endPrefixMapping(String prefix) throws SAXException;

    public void startElement(String namespaceURI, String localName,
     String qualifiedName, Attributes atts) throws SAXException;

    public void endElement(String namespaceURI, String localName,
     String qualifiedName) throws SAXException;

    public void characters(char[] text, int start, int length) 
     throws SAXException;

    public void ignorableWhitespace(char[] text, int start, int length)
     throws SAXException;

    public void processingInstruction(String target, String data)
     throws SAXException;

    public void skippedEntity(String name) throws SAXException;
     
}

SAX2 Event Reporter

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;

public class EventReporter implements ContentHandler {

  public void setDocumentLocator(Locator locator) {
    System.out.println("setDocumentLocator(" + locator + ")");
  }

  public void startDocument() throws SAXException {
    System.out.println("startDocument()");
  }

  public void endDocument() throws SAXException {
    System.out.println("endDocument()");
  }

  public void startElement(String namespaceURI, String localName, 
   String qualifiedName, Attributes atts)
   throws SAXException {
    namespaceURI = '"' + namespaceURI + '"';
    localName = '"' + localName + '"';
    qualifiedName = '"' + qualifiedName + '"';
    String attributeString = "{";
    for (int i = 0; i < atts.getLength(); i++) {
      attributeString += atts.getqualifiedName(i) + "=\"" 
       + atts.getValue(i) + "\"";
      if (i != atts.getLength()-1) attributeString += ", ";
    }
    attributeString += "}";
    System.out.println("startElement(" + namespaceURI + ", " 
     + localName + ", " + qualifiedName + ", " + attributeString + ")");
  }

  public void endElement(String namespaceURI, String localName, 
   String qualifiedName)
   throws SAXException {
    namespaceURI = '"' + namespaceURI + '"';
    localName = '"' + localName + '"';
    qualifiedName = '"' + qualifiedName + '"';
    System.out.println("endElement(" + namespaceURI + ", " 
     + localName + ", " + qualifiedName + ")");
  }

  public void characters(char[] text, int start, int length)
   throws SAXException {
    String textString = "[" + new String(text) + "]";
    System.out.println("characters(" + textString + ", " 
     + start + ", " +  length + ")");
  }

  public void ignorableWhitespace(char[] text, int start, int length)
   throws SAXException {
    System.out.println("ignorableWhitespace()");
  }

  public void processingInstruction(String target, String data)
   throws SAXException {
    System.out.println("processingInstruction(" + target + ", " 
     + data + ")");
  }

  public void startPrefixMapping(String prefix, String uri)
   throws SAXException {
    System.out.println("startPrefixMapping(\"" + prefix + "\", \"" 
     + uri + "\")");
  }

  public void endPrefixMapping(String prefix) throws SAXException {
    System.out.println("endPrefixMapping(\"" + prefix + "\")");
  }

  public void skippedEntity(String name) throws SAXException {
    System.out.println("skippedEntity(" + name + ")");
  }

  // Could easily have put main() method in a separate class
  public static void main(String[] args) {

    XMLReader parser;
    try {
     parser = XMLReaderFactory.createXMLReader();
    }
    catch (Exception e) {
      // fall back on Xerces parser by name
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser");
      }
      catch (Exception ee) {
        System.err.println("Couldn't locate a SAX parser");
        return;
      }
    }


    if (args.length == 0) {
      System.out.println(
       "Usage: java EventReporter URL1 URL2...");
    }

    // Install the content handler
    parser.setContentHandler(new EventReporter());

    // start parsing...
    for (int i = 0; i < args.length; i++) {

      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber()
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not report on " + args[i]
         + " because of the IOException " + e);
      }

    }

  }

}

Event Reporter Output

View in Browser

A Sample Application

UserLand's RSS based list of Web logs at http://static.userland.com/weblogMonitor/logs.xml:

<?xml version="1.0"?>
<!-- <!DOCTYPE foo SYSTEM "http://msdn.microsoft.com/xml/general/htmlentities.dtd"> -->
<weblogs>
	<log>
		<name>MozillaZine</name>
		<url>http://www.mozillazine.org</url>
		<changesUrl>http://www.mozillazine.org/contents.rdf</changesUrl>
		<ownerName>Jason Kersey</ownerName>
		<ownerEmail>kerz@en.com</ownerEmail>
		<description>THE source for news on the Mozilla Organization.  DevChats, Reviews, Chats, Builds, Demos, Screenshots, and more.</description>
		<imageUrl></imageUrl>
		<adImageUrl>http://static.userland.com/weblogMonitor/ads/kerz@en.com.gif</adImageUrl>
		</log>
	<log>
		<name>SalonHerringWiredFool</name>
		<url>http://www.salonherringwiredfool.com/</url>
		<ownerName>Some Random Herring</ownerName>
		<ownerEmail>salonfool@wiredherring.com</ownerEmail>
		<description></description>
		</log>
	<log>
		<name>Scripting News</name>
		<url>http://www.scripting.com/</url>
		<ownerName>Dave Winer</ownerName>
		<ownerEmail>dave@userland.com</ownerEmail>
		<description>News and commentary from the cross-platform scripting community.</description>
		<imageUrl>http://www.scripting.com/gifs/tinyScriptingNews.gif</imageUrl>
		<adImageUrl>http://static.userland.com/weblogMonitor/ads/dave@userland.com.gif</adImageUrl>
		</log>
	<log>
		<name>SlashDot.Org</name>
		<url>http://www.slashdot.org/</url>
		<ownerName>Simply a friend</ownerName>
		<ownerEmail>afriendofweblogs@weblogs.com</ownerEmail>
		<description>News for Nerds, Stuff that Matters.</description>
		</log>
	</weblogs>

Full list

Goal: Return a list of all the URLs in this list as java.net.URL objects

Design Decisions

Should we return an array, an Enumeration, a List, or what?
Perhaps we should use multiple threads?

SAX Design

We do not know how many URLs there will be when we start parsing so let's use a Vector
Single threaded for simplicity but a real program would use multiple threads
- One to load and parse the data
- Another thread (probably the main thread) to serve the data
- Early data could be provided before the entire document had been read
The character data of each url element needs to be stored. Everything else can be ignored.
A startElement() with the name url indicates that we need to start storing this data.
A stopElement() with the name url indicates that we need to stop storing this data, convert it to a URL and put it in the Vector
Should we hide the XML parsing inside a non-public class to avoid accidentally calling the methods from unexpected places or threads?

User Interface Class

import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;
import java.util.*;
import java.io.*;


public class WeblogsSAX {
     
  public static List listChannels() 
   throws IOException, SAXException {
    return listChannels(
     "http://static.userland.com/weblogMonitor/logs.xml"); 
  }
  
  public static List listChannels(String uri) 
   throws IOException, SAXException {
    
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    }
    catch (SAXException ex) {
      parser = XMLReaderFactory.createXMLReader(
       "org.apache.xerces.parsers.SAXParser"
      );
    }
    Vector urls = new Vector(1000);
    ContentHandler handler = new URIGrabber(urls);
    parser.setContentHandler(handler);
    parser.parse(uri);
    return urls;
    
  }
  
  public static void main(String[] args) {
   
    try {
      List urls;
      if (args.length > 0) urls = listChannels(args[0]);
      else urls = listChannels();
      Iterator iterator = urls.iterator();
      while (iterator.hasNext()) {
        System.out.println(iterator.next()); 
      }
    }
    catch (IOException e) {
      System.err.println(e); 
    }
    catch (SAXParseException e) {
      System.err.println(e); 
      System.err.println("at line " + e.getLineNumber() 
       + ", column " + e.getColumnNumber()); 
    }
    catch (SAXException e) {
      System.err.println(e); 
    }
    catch (/* Unexpected */ Exception e) {
      e.printStackTrace(); 
    }
    
  }
  
}

ContentHandler Class

import org.xml.sax.*;
import java.net.*;
import java.util.Vector;

             // conflicts with java.net.ContentHandler
class URIGrabber implements org.xml.sax.ContentHandler {

  private Vector urls;

  URIGrabber(Vector urls) {
    this.urls = urls;
  }

  // do nothing methods
  public void setDocumentLocator(Locator locator) {}
  public void startDocument() throws SAXException {}
  public void endDocument() throws SAXException {}
  public void startPrefixMapping(String prefix, String uri)
   throws SAXException {}
  public void endPrefixMapping(String prefix) throws SAXException {}
  public void skippedEntity(String name) throws SAXException {}
  public void ignorableWhitespace(char[] text, int start, int length)
   throws SAXException {}
  public void processingInstruction(String target, String data)
   throws SAXException {}


  // Remember, there's no guarantee all the text of the
  // url element will be returned in a single call to characters
  private StringBuffer urlBuffer;
  private boolean collecting = false;

  public void startElement(String namespaceURI, String localName,
   String qualifiedName, Attributes atts) throws SAXException {

    if (qualifiedName.equals("url")) {
      collecting = true;
      urlBuffer = new StringBuffer();
    }

  }

  public void characters(char[] text, int start, int length)
   throws SAXException {

    if (collecting) {
      urlBuffer.append(text, start, length);
    }

  }

  public void endElement(String namespaceURI, String localName,
   String qualifiedName) throws SAXException {

    if (qualifiedName.equals("url")) {
      collecting = false;
      String url = urlBuffer.toString();
      try {
        urls.addElement(new URL(url));
      }
      catch (MalformedURLException e) {
        // skip this url
      }
    }

  }

}

Weblogs Output

% java Weblogs shortlogs.xml
http://www.mozillazine.org
http://www.salonherringwiredfool.com/
http://www.slashdot.org/

Features and Properties

SAX2 parsers--that is XMLReaders--are configured by features and properties
Feature and property names are absolute URIs
A feature is boolean, on or off, true or false; a property is an object

public boolean getFeature(String name)
 throws SAXNotRecognizedException, SAXNotSupportedException
public void setFeature(String name, boolean value)
 throws SAXNotRecognizedException, SAXNotSupportedException
public Object getProperty(String name)
 throws SAXNotRecognizedException, SAXNotSupportedException
public void setProperty(String name, Object value)
 throws SAXNotRecognizedException, SAXNotSupportedException

Features can be read-only or read/write.
Some features may be modifiable while parsing; others only before parsing starts

For example,

try {
  if (xmlReader.getFeature("http://xml.org/sax/features/validation")) {
    System.out.println("Parser is validating.");
  } 
  else {
    System.out.println("Parser is not validating.");
  }
} 
catch (SAXException e) {
  System.out.println("Do not know if parser validates");
}

Feature/Property SAXExceptions

SAXNotRecognizedException: the parser does not recognize a requested feature or property
SAXNotSupportedException: the parser does not support a requested feature/property or the feature/property is read-only

Required Features

http://xml.org/sax/features/namespaces
- If true, then perform namespace processing.
- If false, then, at parser option, do not perform namespace processing
- access: (parsing) read-only; (not parsing) read/write
- true by default
http://xml.org/sax/features/namespace-prefixes
- If true, then report the original prefixed names and attributes used for namespace declarations.
- If false, then do not report attributes used for namespace declarations, and optionally do not report original prefixed names.
- false by default
- access: (parsing) read-only; (not parsing) read/write

Core Features

http://xml.org/sax/features/namespaces
http://xml.org/sax/features/namespace-prefixes
http://xml.org/sax/features/string-interning
- If true, then all element names, prefixes, attribute names, Namespace URIs, and local names are internalized using java.lang.String.intern().
- If false, then names are not necessarily internalized.
- access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/validation
- If true, then report all validation errors
- If false, then do not report validation errors.
- access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/external-general-entities
- If true, then include all external general (text) entities.
- false: Do not include external general entities.
- access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/external-parameter-entities
- If true, then include all external parameter entities, including the external DTD subset.
- false: Do not include any external parameter entities, even the external DTD subset.
- access: (parsing) read-only; (not parsing) read/write

adapted from SAX2 documentation by David Megginson

Turning on Validation

Not all parsers are validating but Xerces-J is.
Validity errors are not fatal; therefore they do not throw SAXParseExceptions
Must install an ErrorHandler as well as a ContentHandler
Must set the feature http://xml.org/sax/features/validation

Three Levels of Errors

In increasing order of severity:

A warning; e.g. ambiguous content model, a constraint for compatibility
A recoverable error: typically a validity error
A fatal error: typically a well-formedness error

The ErrorHandler interface

package org.xml.sax;

public interface ErrorHandler {
 
  public void warning(SAXParseException exception)
   throws SAXException;

  public void error(SAXParseException exception)
   throws SAXException;
    
  public void fatalError(SAXParseException exception)
   throws SAXException;
    
}

An ErrorHandler for Reporting Validity Errors

import org.xml.sax.*;
import java.io.*;


public class ValidityErrorReporter implements ErrorHandler {
 
  Writer out;
 
  public ValidityErrorReporter(Writer out) {
    this.out = out;
  }
 
  public ValidityErrorReporter() {
    this(new OutputStreamWriter(System.out));
  }
 
  public void warning(SAXParseException ex)
   throws SAXException {

    try {
      out.write(ex.getMessage() + "\r\n");
      out.write(" at line " + ex.getLineNumber() + ", column " 
       + ex.getColumnNumber() + "\r\n");
      out.flush();
    }
    catch (IOException e) {
      throw new SAXException(e); 
    }
    
  }

  public void error(SAXParseException ex)
   throws SAXException {
    
    try {
      out.write(ex.getMessage() + "\r\n");
      out.write(" at line " + ex.getLineNumber() + ", column " 
       + ex.getColumnNumber() + "\r\n");
      out.flush();
    }
    catch (IOException e) {
      throw new SAXException(e); 
    }
    
  }
    
  public void fatalError(SAXParseException ex)
   throws SAXException {
    
    try {
      out.write(ex.getMessage() + "\r\n");
      out.write(" at line " + ex.getLineNumber() + ", column " 
       + ex.getColumnNumber() + "\r\n");
      out.flush();
    }
    catch (IOException e) {
      throw new SAXException(e); 
    }
    
  }
    
}

Validating

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import org.apache.xerces.parsers.*; 
import java.io.*;


public class SAX2Validator {

  public static void main(String[] args) {
    
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    }
    catch (SAXException ex) {
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser"
        );
      }
      catch (SAXException ex2) {
        System.err.println("Could not locate a SAX2 Parser");
        return;
      }
    }
     
    // turn on validation
    try {
      parser.setFeature(
       "http://xml.org/sax/features/validation", true);
      parser.setErrorHandler(new ValidityErrorReporter());
    }
    catch (SAXNotRecognizedException e) {
      System.err.println(
       "Installed XML parser cannot validate;"
       + " checking for well-formedness instead...");
    } 
    catch (SAXNotSupportedException e) {
      System.err.println(
       "Cannot turn on validation here; "
       + "checking for well-formedness instead...");
    } 
     
    if (args.length == 0) {
      System.out.println("Usage: java SAX2Validator URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
        // If there are no well-formedness errors, 
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber() 
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not check " + args[i] 
         + " because of the IOException " + e);
      }
      
    }  
  
  }

}

Core Properties

http://xml.org/sax/properties/lexical-handler
- data type: org.xml.sax.ext.LexicalHandler
- description: An optional extension handler for items like comments that are not part of the information set and may be omitted.
- access: read/write
http://xml.org/sax/properties/declaration-handler
- data type: org.xml.sax.ext.DeclHandler
- description: An optional extension handler for ATTLIST and ELEMENT declarations (but not notations and unparsed entities).
- access: read/write
http://xml.org/sax/properties/dom-node
- data type: org.w3c.dom.Node
- description: When parsing, the current DOM node being visited if this is a DOM iterator; when not parsing, the root DOM node for iteration.
- access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/properties/xml-string
- data type: java.lang.String
- description: The literal string of characters that was the source for the current event.
- access: read-only

adapted from SAX2 documentation by David Megginson

Nonstandard Features in Xerces

http://apache.org/xml/features/validation/dynamic
- True: The parser will validate the document if a DTD is specified in a DOCTYPE declaration or using the appropriate schema attributes like xsi:noNamespaceSchemaLocation.
- False: Validation is determined by the state of the http://xml.org/sax/features/validation feature.
- Default is false
http://apache.org/xml/features/validation/warn-on-duplicate-attdef
- True: Warn on duplicate attribute declaration.
- False: Do not warn on duplicate attribute declaration.
- Default: true
http://apache.org/xml/features/validation/warn-on-undeclared-elemdef
- True: Warn if element referenced in content model is not declared.
- False: Do not warn if element referenced in content model is not declared.
- Default: true
http://apache.org/xml/features/allow-java-encodings
- True: Allow Java encoding names like 8859_1 in XML and text declarations.
- False: Do not allow Java encoding names in XML and text declarations.
- Default: false
http://apache.org/xml/features/continue-after-fatal-error
- True: Continue after fatal error.
- False: Stops parse on first fatal error.
- Default: false

Nonstandard Properties in Xerces

http://apache.org/xml/properties/schema/external-schemaLocation
http://apache.org/xml/properties/schema/external-noNamespaceSchemaLocation

Properties for Extension Handlers

Extension handlers are non-required interfaces in the org.xml.sax.ext package.
To set the LexicalHandler for an XML reader, set the property http://xml.org/sax/handlers/LexicalHandler.
To set the DeclHandler for an XML reader, set the property http://xml.org/sax/handlers/DeclHandler.
If the reader does not support the requested property, it will throw a SAXNotRecognizedException or a SAXNotSupportedException.

Handling Attributes in SAX2

The startElement() method in ContentHandler receives as an argument an Attributes object containing all attributes on that tag.
public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException

The Attributes interface:

package org.xml.sax;

public interface Attributes {

  public int    getLength();

  /* Look up an attribute's Namespace URI by index.*/
  public String getURI(int index);
  public String getLocalName(int index);
  public String getQName(int index);
  public String getType(int index);
  public String getValue(int index);
  public int    getIndex(String uri, String localPart);
  public int    getIndex(String qualifiedName);
  public String getType(String uri, String localName);
  public String getType(String qualifiedName);
  public String getValue(String uri, String localName);
  public String getValue(String qualifiedName);

}

Attributes Example

import org.xml.sax.*;
import org.apache.xerces.parsers.*;
import java.io.*;
import java.util.*;
import org.xml.sax.helpers.*;


public class XLinkSpider extends DefaultHandler {

  public static Enumeration listURIs(String systemId) 
   throws SAXException, IOException {
    
    // set up the parser 
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    } 
    catch (SAXException e) {
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException e2) {
        System.err.println("Error: could not locate a parser.");
        return null;
      }
    }
      
    // Install the Content Handler   
    XLinkSpider spider = new XLinkSpider();   
    parser.setContentHandler(spider);
    parser.parse(systemId);
    return spider.uris.elements();
      
  }
  
  private Vector uris = new Vector();

  public void startElement(String namespaceURI, String localName, 
   String rawName, Attributes atts) throws SAXException {
    
     String uri = atts.getValue(
      "http://www.w3.org/1999/xlink", "href");
     if (uri != null) uris.addElement(uri);
    
  }
  

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java XLinkSpider URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      try {
        Enumeration uris = listURIs(args[i]);
        while (uris.hasMoreElements()) {
          String s = (String) uris.nextElement();
          System.out.println(s);
        }
      }
      catch (Exception e) {
        System.err.println(e);
        e.printStackTrace(); 
      }
      
    } // end for
  
  } // end main

} // end XLinkSpider

Resolving Entities

The EntityResolver allows you to substitute your own URI lookup scheme for external entities
Especially useful for entities that use URL and URI schemes not supported by Java's protocol handlers; e.g. jdbc: or isbn:

The EntityResolver interface:

package org.xml.sax;

import java.io.IOException;

public interface EntityResolver {  

  public InputSource resolveEntity (String publicId,
   String systemId) throws SAXException, IOException;
    
}

EntityResolver Example

import org.xml.sax.*;

public class RSSResolver implements EntityResolver {

  public InputSource resolveEntity(String publicId, String systemId) {

    if ( publicId.equals(
          "-//Netscape Communications//DTD RSS 0.91//EN")
     || systemId.equals(
          "http://my.netscape.com/publish/formats/rss-0.91.dtd")) {
      return new InputSource(
       "http://www.ibiblio.org/xml/dtds/rss.dtd");
    } 
    else {
      // use the default behaviour
      return null;
    }
    
  }
   
}

Handling DTDs

The DTDHandler interface covers those aspects of DTDs a non-validating parser may care about and that are not handled by other interfaces:
- Notation Declarations
- Unparsed Entity Declarations
Attribute defaults are handled transparently by startElement() and the Attributes interface
Parsed entities are handled transparently by ContentHandler unless you install an EntityResolver

The DTDHandler interface:

package org.xml.sax;

public interface DTDHandler {
       
  public void notationDecl(String name, String publicId, 
   String systemId) throws SAXException;
 
  public void unparsedEntityDecl(String name, String publicId, 
   String systemId, String notationName) throws SAXException;
    
}

DTDHandler Example

Program to map unparsed entities with notation "text/plain" to CDATA sections
AttributeHandler will have to make actual replacements
Will finish with XMLFilter

TextEntityReplacer

import org.xml.sax.*;
import java.util.*;
import java.net.*;
import java.io.*;


public class TextEntityReplacer implements DTDHandler {

  /* This class stores the notation and entity declarations 
     for a single document. It is not designed to be reused
     for multiple parses, though that would be straightforward
     extension. The public and system IDs of the document
     being parsed are set in the constructor.    
  */ 
  
  private URL systemID;
  private String publicID;
  
  public TextEntityReplacer(String publicID, String systemID) 
   throws MalformedURLException {
    System.err.println("created");
    this.publicID = publicID;
    this.systemID = new URL(systemID);
  }

  // store all notations in a hashtable. We'll need them later
  private Hashtable notations = new Hashtable();

  // for the DTDHandler interface
  public void notationDecl(String name, String publicID, 
   String systemID)
   throws SAXException {
    
    Notation n = new Notation(name, publicID, systemID);
    notations.put(name, n);
    
  }
  
  private class Notation {
    
    String name;
    String publicID;
    String systemID;
    
    Notation(String name, String publicID, String systemID) {
      this.name = name;
      this.publicID = publicID;
      this.systemID = systemID;
    } 
    
  }
 
   
  // store all unparsed entities in a hashtable. We'll need them later
  private Hashtable unparsedEntities = new Hashtable();

  // for the DTDHandler interface
  public void unparsedEntityDecl(String name, String publicID, 
   String systemID, String notationName) throws SAXException {
    
    UnparsedEntity e = new UnparsedEntity(name, publicID, 
     systemID, notationName);
    unparsedEntities.put(name, e);
    
  }    

  private class UnparsedEntity {
    
    String name;
    String publicID;
    String systemID;
    String notationName;
    
    UnparsedEntity(String name, String publicID, 
     String systemID, String notationName) {
      this.name = name;
      this.notationName = notationName;
      this.publicID = publicID;
      this.systemID = systemID;
    } 
    
  }


  public boolean isText(String notationName) {
    
    Object o = notations.get(notationName);
    if (o == null) return false;
    Notation n = (Notation) o;
    if (n.systemID.startsWith("text/")) return true;
    return false;
    
  }
  
  public String getText(String entityName) throws IOException {
    
    Object o = unparsedEntities.get(entityName);
    if (o == null) return "";
    UnparsedEntity entity = (UnparsedEntity) o;
    if (!isText(entity.notationName)) {
      return " binary data "; // could throw an exception instead
    }
    
    URL source;
    try {
      source = new URL(systemID, entity.systemID);     
    }
    catch (Exception e) {
      return " unresolvable entity "; // could throw an exception instead
    }
    
    // I'm not really handling characetr encodings here. 
    // A more detailed look at the MIME media type would allow that.
    Reader in = new BufferedReader(
      new InputStreamReader(source.openStream())
    );
    StringBuffer result = new StringBuffer();
    int c;
    while ((c = in.read()) != -1) {
      // Is this necessaary or will parser escape string automatically????
   /*   switch (c) {
        case '<': 
          result.append("&lt;");
          break;
        case '>': 
          result.append("&gt;");
          break;
        case '"': 
          result.append("&quot;");
          break;
        case '\'': 
          result.append("&apos;");
          break;
        case '&': 
          result.append("&amp;");
          break;
        default:
          result.append((char) c); 
      }*/
      result.append((char) c);
    }
    
    return result.toString();
    
  }

}

Handling Declarations

The optional DeclHandler interface covers those aspects of DTDs only a validating parser cares about:
- Element declarations
- Attribute declarations
- Internal entity declarations
- External entity declarations
An optional extension that not all parsers (particularly non-validating parsers) support
To set the DeclHandler for a parser, set the "http://xml.org/sax/handlers/DeclHandler" property. A SAXNotRecognizedException or SAXNotSupportedException will be thrown if the parser doesn't support DeclHandler

The DeclHandler interface:

package org.xml.sax.ext;

import org.xml.sax.SAXException;


public interface DeclHandler {

  public void elementDecl(String name, String model)
   throws SAXException;

  public void attributeDecl(String elementName, String attributeName, 
   String type, String defaultValue, String value) 
   throws SAXException;

  public void internalEntityDecl(String name, String value)
   throws SAXException;

  public void externalEntityDecl(String name, String publicId,
   String systemId) throws SAXException;

}

Handling Lexical Events

The LexicalHandler interface reports:
- Comments
- CDATA sections
- Document type declaration
- Entities
An optional extension that not all parsers support
To set the LexicalHandler for a parser, set the "http://xml.org/sax/handlers/LexicalHandler" property. A SAXNotRecognizedException or SAXNotSupportedException will be thrown if the parser doesn't report lexical events

The LexicalHandler interface

package org.xml.sax.ext;

import org.xml.sax.SAXException;


public interface LexicalHandler {

  public void startDTD(String name, String publicId, String systemId)
   throws SAXException;
  public void endDTD() throws SAXException;
  public void startEntity(String name) throws SAXException;
  public void endEntity(String name) throws SAXException;
  public void startCDATA() throws SAXException;
  public void endCDATA() throws SAXException;
  public void comment (char[] text, int start, int length) 
   throws SAXException;

}

LexicalHandler Example

import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.*;
import java.io.IOException;


public class SAXCommentReader implements LexicalHandler {

  public void startDTD(String name, String publicId, String systemId)
   throws SAXException {}
  public void endDTD() throws SAXException {}
  public void startEntity(String name) throws SAXException {}
  public void endEntity(String name) throws SAXException {}
  public void startCDATA() throws SAXException {}
  public void endCDATA() throws SAXException {}

  public void comment (char[] text, int start, int length)
   throws SAXException {

    String comment = new String(text, start, length);
    System.out.println(comment);

  }

  public static void main(String[] args) {

    // set up the parser
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    }
    catch (SAXException e) {
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException e2) {
        System.err.println("Error: could not locate a parser.");
        return;
      }
    }

    // turn on comment handling
    try {
      parser.setProperty(
       "http://xml.org/sax/properties/lexical-handler",
       new SAXCommentReader()
      );
    }
    catch (SAXNotRecognizedException e) {
      System.err.println(
       "Installed XML parser does not provide lexical events...");
      return;
    }
    catch (SAXNotSupportedException e) {
      System.err.println(
       "Cannot turn on comment processing here");
      return;
    }

    if (args.length == 0) {
      System.out.println("Usage: java SAXCommentReader URL1 URL2...");
    }

    // start parsing...
    for (int i = 0; i < args.length; i++) {

      try {
        parser.parse(args[i]);
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber()
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not check " + args[i]
         + " because of the IOException " + e);
      }

    }

  }

}

SAXCommentReader Output

C:\EXAMPLES>java SAXCommentReader hotcop.xml
 This should be a four digit year like "1999",
     not a two-digit year like "99"
 The publisher is actually Polygram but I needed
       an example of a general entity reference.
 You can tell what album I was
     listening to when I wrote this example

Or try http://www.w3.org/TR/2000/REC-xml-20001006.xml

The Locator interface

Tells the callback class where in the document (line number, column number) a particular event took place
Optional but recommended
Parsers give the callback class a Locator by passing it to the setDocumentLocator() method of ContentHandler

The Locator interface:

package org.xml.sax;

public interface Locator {
    
  public String getPublicId();
  public String getSystemId();
  public int    getLineNumber();
  public int    getColumnNumber();
    
}

Locator Example

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import org.apache.xerces.parsers.*; 
import java.io.*;


public class LocationReporter implements ContentHandler {

  Locator locator = null;

  public void setDocumentLocator(Locator locator) {
    this.locator = locator;  
  }
  
  private String reportPosition() {
    
    if (locator != null) {
      
      String publicID = locator.getPublicId();
      String systemID = locator.getSystemId();
      int line        = locator.getLineNumber();
      int column      = locator.getColumnNumber();
      
      String name;
      if (publicID != null) name = publicID;
      else name = systemID;
      
      return " in " + name + " at line " + line 
       + ", column " + column;
    }
    return "";
    
  }
  
  public void startDocument() throws SAXException {
    System.out.println("Document started" + reportPosition()); 
  }

  public void endDocument() throws SAXException {
    System.out.println("Document ended" + reportPosition()); 
  }
  
  public void characters(char[] text, int start, int length) 
   throws SAXException {
    System.out.println("Got some characters" + reportPosition()); 
  }
  
  public void ignorableWhitespace(char[] text, int start, int length)
   throws SAXException {
    System.out.println("Got some ignorable white space" 
     + reportPosition()); 
  }
  
  public void processingInstruction(String target, String data)
   throws SAXException {
    System.out.println("Got a processing instruction" 
     + reportPosition()); 
  }
  
  // Changed methods for SAX2
  public void startElement(String namespaceURI, String localName,
	 String qualifiedName, Attributes atts) throws SAXException {
    System.out.println("Element " + qualifiedName + " started" 
     + reportPosition()); 
  }
  
  public void endElement(String namespaceURI, String localName,
	 String qualifiedName) throws SAXException {
    System.out.println("Element " + qualifiedName + " ended" 
     + reportPosition()); 
  } 

  // new methods for SAX2
  public void startPrefixMapping(String prefix, String uri) 
   throws SAXException {
    System.out.println("Started mapping prefix " + prefix 
     + " to URI " + uri + reportPosition());     
  }

  public void endPrefixMapping(String prefix) throws SAXException {
    System.out.println("Stopped mapping prefix " 
     + prefix + reportPosition());         
  }

  public void skippedEntity(String name) throws SAXException {
    System.out.println("Skipped entity " + name + reportPosition());         
  }  

  // Could easily have put main() method in a separate class
  public static void main(String[] args) {
    
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    }
    catch (SAXException ex) {
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException e2) {
        System.err.println("Error: no parser found!");
        return; 
      }
    }
     
    if (args.length == 0) {
      System.out.println(
       "Usage: java LocationReporter URL1 URL2..."); 
    } 
      
    // Install the Content Handler      
    parser.setContentHandler(new LocationReporter());
    
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber() 
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not report on " + args[i] 
         + " because of the IOException " + e);
      }
      
    }  
  
  }

}

View Output

Locator Example

Document started in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 1, column 1
Got a processing instruction in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 2, column 51
Started mapping prefix  to URI http://metalab.unc.edu/xml/namespace/song in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 5, column 50
Started mapping prefix xlink to URI http://www.w3.org/1999/xlink in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 5, column 50
Element SONG started in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 5, column 50
Got some ignorable white space in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 6, column 3
Element TITLE started in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 6, column 10
Got some characters in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 6, column 17
Element TITLE ended in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 6, column 26
Got some ignorable white space in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 7, column 3
Element PHOTO started in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 9, column 65
Element PHOTO ended in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 9, column 65
Got some ignorable white space in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 10, column 3
Element COMPOSER started in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 10, column 13
Got some characters in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 10, column 27
Element COMPOSER ended in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 10, column 39
Got some ignorable white space in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 11, column 3
Element COMPOSER started in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 11, column 13
Got some characters in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 11, column 25
Element COMPOSER ended in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 11, column 37
Got some ignorable white space in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 12, column 3
Element COMPOSER started in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 12, column 13
Got some characters in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 12, column 26
Element COMPOSER ended in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 12, column 38
Got some ignorable white space in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 13, column 3
Element PRODUCER started in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 13, column 13
Got some characters in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 13, column 27
Element PRODUCER ended in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 13, column 39
Got some ignorable white space in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 14, column 3
Got some ignorable white space in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 16, column 3
Element PUBLISHER started in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 16, column 73
Got some characters in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 17, column 7
Got some characters in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 17, column 12
Got some characters in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 18, column 3
Element PUBLISHER ended in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 18, column 16
Got some ignorable white space in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 19, column 3
Element LENGTH started in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 19, column 11
Got some characters in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 19, column 15
Element LENGTH ended in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 19, column 25
Got some ignorable white space in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 20, column 3
Element YEAR started in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 20, column 9
Got some characters in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 20, column 13
Element YEAR ended in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 20, column 21
Got some ignorable white space in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 21, column 3
Element ARTIST started in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 21, column 11
Got some characters in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 21, column 25
Element ARTIST ended in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 21, column 35
Got some ignorable white space in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 22, column 1
Element SONG ended in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 22, column 9
Stopped mapping prefix xlink in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 22, column 9
Stopped mapping prefix  in file:///J:/KENTUCKY/xmlandjava/EXAMPLES/hotcop.xml at line 22, column 9
Document ended in Null Entity at line -1, column -1

The DefaultHandler class

Implements the main interfaces with do-nothing methods
- EntityResolver
- DTDHandler
- ContentHandler
- ErrorHandler
Replaces HandlerBase from SAX1

The NamespaceSupport class

Mostly for internal parser use
Occasionally useful for tasks like finding out whether a document contains any XLinks

The NamespaceSupport class:

package org.xml.sax.helpers;

public class NamespaceSupport {

  public final static String XMLNS = "http://www.w3.org/XML/1998/namespace";

  public NamespaceSupport();

  public void reset();
  public void        pushContext();
  public void        popContext();
  public boolean     declarePrefix(String prefix, String uri);
  public String      getURI(String prefix);
  public Enumeration getPrefixes();
  public Enumeration getDeclaredPrefixes();
  public String[]    processName(String qualifiedName, 
   String[] parts, boolean isAttribute);
   
}

Filtering XML

The XMLFilter interface is like an XML reader, "except that it obtains its events from another XML reader rather than a primary source like an XML document or database. Filters can modify a stream of events as they pass on to the final application."
The parent is the parser the filter gets the data from.

Only two methods in the interface:

public void      setParent(XMLReader parent)
public XMLReader getParent()

XMLFilterImpl is a default filter that simply passes along all events it receives:
public class XMLFilterImpl implements XMLFilter, EntityResolver, DTDHandler, ContentHandler, ErrorHandler

Only new methods are constructors:

public XMLFilterImpl()
public XMLFilterImpl(XMLReader parent)

XMLFilter Example

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.util.*;
import java.io.IOException;


public class UnparsedTextFilter extends XMLFilterImpl {

  private TextEntityReplacer replacer;

  public UnparsedTextFilter(XMLReader parent) {
    super(parent);
    System.err.println("created UnparsedTextFilter");
  }

  public void parse(InputSource input) 
   throws IOException, SAXException {
    System.err.println("parsing");
    replacer = new TextEntityReplacer(input.getPublicId(), 
     input.getSystemId());
    this.setDTDHandler(replacer); 
  }
  // The other parse() method just calls this one 

  public void parse(String systemId) 
   throws IOException, SAXException {
    parse(new InputSource(systemId)); 
  }

  public void startElement(String uri, String localName, 
   String qualifiedName, Attributes attributes) throws SAXException {
    
    Vector extraText = new Vector();

    // Are there any unparsed entities in the attributes?
    for (int i = 0; i < attributes.getLength(); i++) {
      if (attributes.getType(i).equals("ENTITY")) {
        try {
          System.out.println("replacing");
          String s = replacer.getText(attributes.getValue(i));
          if (s != null) extraText.addElement(s);
        }
        catch (IOException e) {
          System.err.println(e); 
        }
      } 
      
    }    

    super.startElement(uri, localName, qualifiedName, attributes);
    
    // Now spew out the values of the unparsed entities:
    Enumeration e = extraText.elements();
    while (e.hasMoreElements()) {
      Object o = e.nextElement();
      String s = (String) o;
      super.characters(s.toCharArray(), 0, s.length()); 
    }
    
  }

}

TextMerger

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.util.*;
import java.io.IOException;
import org.apache.xml.serialize.*;


public class TextMerger {

  public static void main(String[] args) {
  
    XMLReader base;
    try {
     base = XMLReaderFactory.createXMLReader();
    }
    catch (Exception e) {
      // fall back on Xerces parser by name
      try {
        base = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser");
      }
      catch (Exception ee) {
        System.err.println("Couldn't locate a SAX parser");
        return;          
      }
    }
    
    XMLReader parser = new UnparsedTextFilter(base);
    
    //essentially a pretty printer
    XMLSerializer printer 
     = new XMLSerializer(System.out, new OutputFormat());
    
    parser.setContentHandler(printer);
    
    for (int i = 0; i < args.length; i++) {
      try {
        parser.parse(args[i]);
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber() 
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not report on " + args[i] 
         + " because of the IOException " + e);
      }      
    }
  
  }

}

InputSource

Encapsulates access to data so that it looks the same whether it's coming from a
- URL
- file
- stream
- reader
- database
- something else
Used in SAX1 and SAX2
Allows the source to be changed

The InputSource interface

package org.xml.sax;

import java.io.*;

public class InputSource {

  public InputSource() 
  public InputSource(String systemID) 
  public InputSource(InputStream in)
  public InputSource(Reader in)

  public void   setPublicId(String publicID)
  public String getPublicId()
  public void   setSystemId(String systemID)
  public String getSystemId()

  public void        setByteStream(InputStream byteStream)
  public InputStream getByteStream()
  public void        setEncoding(String encoding)
  public String      getEncoding()
  public void        setCharacterStream(Reader characterStream)
  public Reader      getCharacterStream()

}

Example of InputSource

import org.xml.sax;
import java.io.*;
import java.net.*;
import java.util.zip.*;
...
try {

  URL u = new URL(
   "http://www.ibiblio.org/xml/examples/1998validstats.xml.gz"); 
  InputStream raw = u.openStream();
  InputStream decompressed = new GZIPInputStream(raw);
  InputSource in = new InputSource(decompressed);
  // read the document... 

}
catch (IOException e) {
  System.err.println(e);
}
catch (SAXException e) {
  System.err.println(e);
}

What SAX2 doesn't do

ELEMENT, ATTLIST, ENTITY declarations are only optionally reported
Schema declarations aren't reported at all
Lexical events are only optionally reported
SAX2 can be configured on top of a lot of different parsers with different capabilities. What the parser does is more important than what SAX2 does.

Event Based API Caveats

You do not always have all the information you need at the time of a given callback
You may need to store information in various data structures (stacks, queues,vectors, arrays, etc.) and act on it at a later point
For example the characters() method is not guaranteed to give you the maximum number of contiguous characters. It may split a single run of characters over multiple method calls.

To Learn More

XML in a Nutshell
- Elliotte Rusty Harold and Scott Means
- O'Reilly & Associates, 2001
- ISBN: 0-596-00058-8

SAX website: http://www.megginson.com/SAX/

Part IV: DOM, The Document Object Model

Writing with DOM
Reading with DOM

Trees

An XML document is a tree.
It has a root.
It has nodes.
It is amenable to recursive processing.
Not all applications agree on what the root is.
Not all applications agree on what is and isn't a node.

Document Object Model

Defines how XML and HTML documents are represented as objects in programs
W3C Standard
Defined in IDL; thus language independent
HTML as well as XML
Writing as well as reading
More complete than SAX; covers everything except internal and external DTD subsets
DOM focuses more on the document; SAX focuses more on the parser.

DOM Evolution

DOM Parsers for Java

Apache XML Project's Xerces Java: http://xml.apache.org/xerces-j/index.html
IBM's XML for Java: http://www.alphaworks.ibm.com/formula/xml
Sun's Java API for XML http://java.sun.com/products/xml

Eight Modules:

Eight Modules:
- Core: org.w3c.dom *
- HTML: org.w3c.dom.html
- Views: org.w3c.dom.views
- StyleSheets: org.w3c.dom.stylesheets
- CSS: org.w3c.dom.css
- Events: org.w3c.dom.events *
- Traversal: org.w3c.dom.traversal *
- Range: org.w3c.dom.range
Only the core and traversal modules really apply to XML. The other six are for HTML.
* indicates Xerces support

Which modules and features are supported?

A DOM application can use the hasFeature() method of the DOMImplementation interface to determine whether a module is supported or not.
- XML Module: "XML"
- HTML Module: "HTML"
- Views Module: "Views"
- StyleSheets Module: "StyleSheets"
- CSS Module: "CSS"
- CSS (extended interfaces) Module: "CSS2"
- Events Module: "Events"
- User Interface Events (UIEvent interface) Module: "UIEvents"
- Mouse Events Module: "MouseEvents"
- Mutation Events Module: "MutationEvents"
- HTML Events Module: "HTMLEvents"
- Traversal Module: "Traversal"
- Range Module: "Range"

Which modules are supported?

import org.apache.xerces.dom.*;
import org.w3c.dom.*;
import java.io.*;


public class ModuleChecker {

  public static void main(String[] args) {
     
    // parser dependent
    DOMImplementation implementation 
     = DOMImplementationImpl.getDOMImplementation();
    String[] features = {"XML", "HTML", "Views", "StyleSheets",
     "CSS", "CSS2", "Events", "UIEvents", "MouseEvents", 
     "MutationEvents", "HTMLEvents", "Traversal", "Range"};
    
    for (int i = 0; i < features.length; i++) {
      if (implementation.hasFeature(features[i], "2.0")) {
        System.out.println("Implementation supports " 
         + features[i]);
      } 
      else {
        System.out.println("Implementation does not support " 
         + features[i]);
      } 
    }
  
  }

}

Which modules are supported? Results

% java ModuleChecker
Implementation supports XML
Implementation does not support HTML
Implementation does not support Views
Implementation does not support StyleSheets
Implementation does not support CSS
Implementation does not support CSS2
Implementation supports Events
Implementation does not support UIEvents
Implementation does not support MouseEvents
Implementation supports MutationEvents
Implementation does not support HTMLEvents
Implementation supports Traversal
Implementation does not support Range

DOM Trees

Entire document is represented as a tree.
A tree contains nodes.
Some nodes may contain other nodes (depending on node type).
Each document node contains:
- zero or one doctype nodes
- one root element node
- zero or more comment and processing instruction nodes

org.w3c.dom

17 classes:
- Attr
- CDATASection
- CharacterData
- Comment
- Document
- DocumentFragment
- DocumentType
- DOMImplementation
- Element
- Entity
- EntityReference
- NamedNodeMap
- Node
- NodeList
- Notation
- ProcessingInstruction
- Text
plus one exception: DOMException
Plus a bunch of HTML stuff in org.w3c.dom.html and other packages we will ignore

The DOM Process

Library specific code creates a parser
The parser parses the document and returns a DOM org.w3c.dom.Document object.
The entire document is stored in memory.
DOM methods and interfaces are used to extract data from this object

Parsing documents with a DOM Parser Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;

public class DOMParserMaker {

  public static void main(String[] args) {
     
    // This is simpler but less flexible than the SAX approach.
    // Perhaps a good creational design pattern is needed here?   
  
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        // work with the document...
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
   
  }

}

The Node Interface

package org.w3c.dom;

public interface Node {

  // NodeType
  public static final short ELEMENT_NODE                = 1;
  public static final short ATTRIBUTE_NODE              = 2;
  public static final short TEXT_NODE                   = 3;
  public static final short CDATA_SECTION_NODE          = 4;
  public static final short ENTITY_REFERENCE_NODE       = 5;
  public static final short ENTITY_NODE                 = 6;
  public static final short PROCESSING_INSTRUCTION_NODE = 7;
  public static final short COMMENT_NODE                = 8;
  public static final short DOCUMENT_NODE               = 9;
  public static final short DOCUMENT_TYPE_NODE          = 10;
  public static final short DOCUMENT_FRAGMENT_NODE      = 11;
  public static final short NOTATION_NODE               = 12;

  public String       getNodeName();
  public String       getNodeValue() throws DOMException;
  public void         setNodeValue(String nodeValue) throws DOMException;
  public short        getNodeType();
  public Node         getParentNode();
  public NodeList     getChildNodes();
  public Node         getFirstChild();
  public Node         getLastChild();
  public Node         getPreviousSibling();
  public Node         getNextSibling();
  public NamedNodeMap getAttributes();
  public Document     getOwnerDocument();
  public Node         insertBefore(Node newChild, Node refChild) throws DOMException;
  public Node         replaceChild(Node newChild, Node oldChild) throws DOMException;
  public Node         removeChild(Node oldChild) throws DOMException;
  public Node         appendChild(Node newChild) throws DOMException;
  public boolean      hasChildNodes();
  public Node         cloneNode(boolean deep);
  public void         normalize();
  public boolean      supports(String feature, String version);
  public String       getNamespaceURI();
  public String       getPrefix();
  public void         setPrefix(String prefix) throws DOMException;
  public String       getLocalName();
}

The NodeList Interface

package org.w3c.dom;

public interface NodeList {
  public Node item(int index);
  public int  getLength();
}

Now we're really ready to read a document

Node Reporter

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;


public class NodeReporter {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    NodeReporter iterator = new NodeReporter();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document doc = parser.getDocument();
        iterator.followNode(doc);
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  // note use of recursion
  public void followNode(Node node) {
    
    processNode(node);
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        followNode(children.item(i));
      } 
    }
    
  }

  public void processNode(Node node) {
    
    String name = node.getNodeName();
    String type = getTypeName(node.getNodeType());
    System.out.println("Type " + type + ": " + name);
    
  }
  
  public static String getTypeName(int type) {
    
    switch (type) {
      case Node.ELEMENT_NODE: 
        return "Element";
      case Node.ATTRIBUTE_NODE: 
        return "Attribute";
      case Node.TEXT_NODE: 
        return "Text";
      case Node.CDATA_SECTION_NODE: 
        return "CDATA Section";
      case Node.ENTITY_REFERENCE_NODE: 
        return "Entity Reference";
      case Node.ENTITY_NODE: 
        return "Entity";
      case Node.PROCESSING_INSTRUCTION_NODE: 
        return "Processing Instruction";
      case Node.COMMENT_NODE : 
        return "Comment";
      case Node.DOCUMENT_NODE: 
        return "Document";
      case Node.DOCUMENT_TYPE_NODE: 
        return "Document Type Declaration";
      case Node.DOCUMENT_FRAGMENT_NODE: 
        return "Document Fragment";
      case Node.NOTATION_NODE: 
        return "Notation";
      default: 
        return "Unknown Type"; 
    }
    
  }

}

Node Reporter Output

% java NodeReporter hotcop.xml
Type Document: #document
Type Processing Instruction: xml-stylesheet
Type Document Type Declaration: SONG
Type Element: SONG
Type Text: #text
Type Element: TITLE
Type Text: #text
Type Text: #text
Type Element: PHOTO
Type Text: #text
Type Element: COMPOSER
Type Text: #text
Type Text: #text
Type Element: COMPOSER
Type Text: #text
Type Text: #text
Type Element: COMPOSER
Type Text: #text
Type Text: #text
Type Element: PRODUCER
Type Text: #text
Type Text: #text
Type Comment: #comment
Type Text: #text
Type Element: PUBLISHER
Type Text: #text
Type Text: #text
Type Element: LENGTH
Type Text: #text
Type Text: #text
Type Element: YEAR
Type Text: #text
Type Text: #text
Type Element: ARTIST
Type Text: #text
Type Text: #text
Type Comment: #comment

Attributes are missing from this output. They are not nodes. They are properties of nodes.

Node Values as returned by getNodeValue()

Node Type	Node Value
element node	null
attribute node	attribute value
text node	text of the node
CDATA section node	text of the section
entity reference node	null
entity node	null
processing instruction node	content of the processing instruction, not including the target
comment node	text of the comment
document node	null
document type declaration node	null
document fragment node	null
notation node	null

The Document Node

The root node representing the entire document; not the same as the root element
Contains:
- one element node
- zero or more processing instruction nodes
- zero or more comment nodes
- zero or one document type nodes

The Document Interface

package org.w3c.dom;

  public interface Document extends Node {
  
    public DocumentType      getDoctype();
    public DOMImplementation getImplementation();
    public Element           getDocumentElement();
    public Element           createElement(String tagName) throws DOMException;
    public Element           createElementNS(String namespaceURI, String qualifiedName) throws DOMException;
    public DocumentFragment  createDocumentFragment();
    public Text              createTextNode(String data);
    public Comment           createComment(String data);
    public CDATASection      createCDATASection(String data) throws DOMException;
    public ProcessingInstruction createProcessingInstruction(String target, String data)
     throws DOMException;
    public Attr            createAttribute(String name) throws DOMException;
    public Attr            createAttributeNS(String namespaceURI, String qualifiedName) throws DOMException;
    public EntityReference createEntityReference(String name) throws DOMException;
    public NodeList        getElementsByTagName(String tagname);
    public NodeList        getElementsByTagNameNS(String namespaceURI, String localName);
    public Element         getElementById(String elementId);
    public Node            importNode(Node importedNode, boolean deep) throws DOMException;
    
}

A Sample Application

UserLand's RSS based list of Web logs at http://static.userland.com/weblogMonitor/logs.xml:

<?xml version="1.0"?>
<!-- <!DOCTYPE foo SYSTEM "http://msdn.microsoft.com/xml/general/htmlentities.dtd"> -->
<weblogs>
	<log>
		<name>MozillaZine</name>
		<url>http://www.mozillazine.org</url>
		<changesUrl>http://www.mozillazine.org/contents.rdf</changesUrl>
		<ownerName>Jason Kersey</ownerName>
		<ownerEmail>kerz@en.com</ownerEmail>
		<description>THE source for news on the Mozilla Organization.  DevChats, Reviews, Chats, Builds, Demos, Screenshots, and more.</description>
		<imageUrl></imageUrl>
		<adImageUrl>http://static.userland.com/weblogMonitor/ads/kerz@en.com.gif</adImageUrl>
		</log>
	<log>
		<name>SalonHerringWiredFool</name>
		<url>http://www.salonherringwiredfool.com/</url>
		<ownerName>Some Random Herring</ownerName>
		<ownerEmail>salonfool@wiredherring.com</ownerEmail>
		<description></description>
		</log>
	<log>
		<name>Scripting News</name>
		<url>http://www.scripting.com/</url>
		<ownerName>Dave Winer</ownerName>
		<ownerEmail>dave@userland.com</ownerEmail>
		<description>News and commentary from the cross-platform scripting community.</description>
		<imageUrl>http://www.scripting.com/gifs/tinyScriptingNews.gif</imageUrl>
		<adImageUrl>http://static.userland.com/weblogMonitor/ads/dave@userland.com.gif</adImageUrl>
		</log>
	<log>
		<name>SlashDot.Org</name>
		<url>http://www.slashdot.org/</url>
		<ownerName>Simply a friend</ownerName>
		<ownerEmail>afriendofweblogs@weblogs.com</ownerEmail>
		<description>News for Nerds, Stuff that Matters.</description>
		</log>
	</weblogs>

Full list

DOM Design

We can easily find out how many URLs there will be when we start parsing, since they're all in memory.
Single threaded by nature; no benefit to multiple threads since no data will be available until the entire document has been read and parsed.
The character data of each url element needs to be read. Everything else can be ignored.
The getElementsByTagName() method in Document gives us a quick list of all the url elements.
The XML parsing is so straight-forward it can be done inside one method. No extra class is required.

Weblogs with DOM

import org.w3c.dom.*;
import org.xml.sax.SAXException;
import java.io.IOException;
import java.util.*;
import java.net.*;


public class WeblogsDOM {

  public static String DEFAULT_URL
   = "http://static.userland.com/weblogMonitor/logs.xml";

  public static List listChannels() throws DOMException {
    return listChannels(DEFAULT_URL);
  }

  public static List listChannels(String uri) throws DOMException {

    if (uri == null) {
      throw new NullPointerException("URL must be non-null");
    }

    org.apache.xerces.parsers.DOMParser parser
     = new org.apache.xerces.parsers.DOMParser();

    Vector urls = null;

    try {
      // Read the entire document into memory
      parser.parse(uri);
      Document doc = parser.getDocument();
      NodeList logs = doc.getElementsByTagName("url");

      urls = new Vector(logs.getLength());

      for (int i = 0; i < logs.getLength(); i++) {
        try {
          Node element = logs.item(i);
          Node text = element.getFirstChild();
          String content = text.getNodeValue();
          URL u = new URL(content);
          urls.addElement(u);
        }
        catch (MalformedURLException e) {
          // bad input data from one third party; just ignore it
        }
      }
    }
    catch (SAXException e) {
      System.err.println(e);
    }
    catch (IOException e) {
      System.err.println(e);
    }

    return urls;

  }

  public static void main(String[] args) {

    try {
      List urls;
      if (args.length > 0) {
        try {
          URL url = new URL(args[0]);
          urls = listChannels(args[0]);
        }
        catch (MalformedURLException e) {
          System.err.println("Usage: java WeblogsDOM url");
          return;
        }
      }
      else {
        urls = listChannels();
      }
      Iterator iterator = urls.iterator();
      while (iterator.hasNext()) {
        System.out.println(iterator.next());
      }
    }
    catch (/* Unexpected */ Exception e) {
      e.printStackTrace();
    }

  } // end main

}

Weblogs Output

% java WeblogsDOM
http://2020Hindsight.editthispage.com/
http://www.sff.net/people/mitchw/weblog/weblog.htp
http://nate.weblogs.com/
http://plugins.launchpoint.net
http://404.psistorm.net
http://home.att.net/~geek9000
http://daubnet.tzo.com/weblog
several hundred more...

Element Nodes

Represents a complete element including its start tag, end tag, and content
Contains:
- Element nodes
- ProcessingInstruction nodes
- Comment nodes
- Text nodes
- CDATASection nodes
- EntityReference nodes

The Element Interface

package org.w3c.dom;

public interface Element extends Node {

  public String   getTagName();

  public NodeList getElementsByTagName(String name);
  public NodeList getElementsByTagNameNS(String namespaceURI, 
   String localName);

  public String   getAttribute(String name);
  public String   getAttributeNS(String namespaceURI, 
   String localName);
  public void     setAttribute(String name, String value) 
   throws DOMException;
  public void     setAttributeNS(String namespaceURI, 
   String qualifiedName, String value) throws DOMException;
  public void     removeAttribute(String name) throws DOMException;
  public void     removeAttributeNS(String namespaceURI, 
   String localName) throws DOMException;
  public Attr     getAttributeNode(String name);
  public Attr     getAttributeNodeNS(String namespaceURI, String localName);
  public Attr     setAttributeNode(Attr newAttr) throws DOMException;
  public Attr     setAttributeNodeNS(Attr newAttr) throws DOMException;
  public Attr     removeAttributeNode(Attr oldAttr) throws DOMException;

}

IDTagger

import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.IOException;
import org.apache.xml.serialize.*;


public class IDTagger {

  int id = 1;

  public void processNode(Node node) {
    
    if (node instanceof Element) {
      
      Element element = (Element) node;
      String currentID = element.getAttribute("ID");
      if (currentID == null || currentID.equals("")) {
        element.setAttribute("ID", "_" + id);
        id = id + 1; 
      }
    }
    
  }

  // note use of recursion
  public void followNode(Node node) {
    
    processNode(node);
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        followNode(children.item(i));
      } 
    }
    
  }

  public static void main(String[] args) {
     
    DOMParser parser  = new DOMParser();
    IDTagger iterator = new IDTagger();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document document = parser.getDocument();
        iterator.followNode(document);
        
        // now we serialize the document...
        OutputFormat format = new OutputFormat(document);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(document);       
        
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

Output from IDTagger

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<?xml-stylesheet type="text/css" href="song.css"?>
<SONG ID="_1" xmlns="http://metalab.unc.edu/xml/namespace/song"
xmlns:xlink="http://www.w3.org/1999/xlink">   <TITLE ID="_2">Hot
Cop</TITLE>   <PHOTO ALT="Victor Willis in Cop Outfit" HEIGHT="200"
ID="_3" WIDTH="100" xlink:href="hotcop.jpg" xlink:show="onLoad"
xlink:type="simple"/>   <COMPOSER ID="_4">Jacques Morali</COMPOSER>
<COMPOSER ID="_5">Henri Belolo</COMPOSER>   <COMPOSER ID="_6">Victor
Willis</COMPOSER>   <PRODUCER ID="_7">Jacques Morali</PRODUCER>   <!--
The publisher is actually Polygram but I needed         an example of a
general entity reference. -->   <PUBLISHER ID="_8"
xlink:href="http://www.amrecords.com/" xlink:type="simple">     A &amp;
M Records   </PUBLISHER>   <LENGTH ID="_9">6:20</LENGTH>   <YEAR
ID="_10">1978</YEAR>   <ARTIST ID="_11">Village People</ARTIST> </SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->

View Output in Browser

CharacterData interface

Represents things that are basically text holders
Super interface of Text, Comment, and CDATASection

The CharacterData Interface

package org.w3c.dom;

public interface CharacterData extends Node {

  public String getData() throws DOMException;
  public void   setData(String data) throws DOMException;
  public int    getLength();
  public String substringData(int offset, int count) 
   throws DOMException;
  public void   appendData(String arg) 
   throws DOMException;
  public void   insertData(int offset, String arg) 
   throws DOMException;
  public void   deleteData(int offset, int count) 
   throws DOMException;
  public void   replaceData(int offset, int count, String arg) 
   throws DOMException;
  
}

ROT13 XML Text

import org.apache.xerces.parsers.DOMParser;
import org.apache.xml.serialize.*;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
import java.io.IOException;


public class ROT13XML {

  public void processNode(Node node) {
    
    if (node instanceof CharacterData) {
      CharacterData text = (CharacterData) node;
      String data = text.getData();
      text.setData(rot13(data));
    }
    
  }

  // note use of recursion
  public void followNode(Node node) {
    
    processNode(node);
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        followNode(children.item(i));
      } 
    }
    
  }
  
  public static String rot13(String s) {
    
    StringBuffer result = new StringBuffer(s.length());
    for (int i = 0; i < s.length(); i++) {
      int c = s.charAt(i);
      if (c >= 'A' && c <= 'M') result.append((char) (c+13));
      else if (c >= 'N' && c <= 'Z') result.append((char) (c-13));
      else if (c >= 'a' && c <= 'm') result.append((char) (c+13));
      else if (c >= 'n' && c <= 'z') result.append((char) (c-13));
      else result.append((char) c);
      
    } 
    return result.toString();
    
  }

  public static void main(String[] args) {
     
    DOMParser parser   = new DOMParser();
    ROT13XML  iterator = new ROT13XML();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document document = parser.getDocument();
        iterator.followNode(document);
        
        // now we serialize the document...
        OutputFormat format = new OutputFormat(document);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(document);
               
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

ROT13 XML Output

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<?xml-stylesheet type="text/css" href="song.css"?>
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song"
xmlns:xlink="http://www.w3.org/1999/xlink">   <TITLE>Ubg Pbc</TITLE>
<PHOTO ALT="Victor Willis in Cop Outfit" HEIGHT="200" WIDTH="100"
xlink:href="hotcop.jpg" xlink:show="onLoad" xlink:type="simple"/>
<COMPOSER>Wnpdhrf Zbenyv</COMPOSER>   <COMPOSER>Uraev Orybyb</COMPOSER>
<COMPOSER>Ivpgbe Jvyyvf</COMPOSER>   <PRODUCER>Wnpdhrf Zbenyv</PRODUCER>
<!-- Gur choyvfure vf npghnyyl Cbyltenz ohg V arrqrq         na rknzcyr
bs n trareny ragvgl ersrerapr. -->   <PUBLISHER
xlink:href="http://www.amrecords.com/" xlink:type="simple">     N &amp;
Z Erpbeqf   </PUBLISHER>   <LENGTH>6:20</LENGTH>   <YEAR>1978</YEAR>
<ARTIST>Ivyyntr Crbcyr</ARTIST> </SONG>
<!-- Lbh pna gryy jung nyohz V jnf 
     yvfgravat gb jura V jebgr guvf rknzcyr -->

Text Nodes

Represents the text content of an element or attribute
Contains only pure text, no markup
Parsers will return a single maximal text node for each contiguous run of pure text
Editing may change this

The Text Interface

package org.w3c.dom;

public interface Text extends CharacterData {

  public Text splitText(int offset) throws DOMException;
  
}

CDATA section Nodes

Represents a CDATA section like this example from a hypothetical SVG tutorial:

<p>You can use a default <code>xmlns</code> attribute to avoid 
having to add the svg prefix to all your elements:</p>
     <![CDATA[
       <svg xmlns="http://www.w3.org/2000/svg" 
            width="12cm" height="10cm">
         <ellipse rx="110" ry="130" />
         <rect x="4cm" y="1cm" width="3cm" height="6cm" />
       </svg>
     ]]>

No children

The CDATASection Interface

package org.w3c.dom;

public interface CDATASection extends Text {
}

DocumentType Nodes

Represents a document type declaration
Has no children

The DocumentType Interface

package org.w3c.dom;

public interface DocumentType extends Node {

  public String       getName();
  public NamedNodeMap getEntities();
  public NamedNodeMap getNotations();
  public String       getPublicId();
  public String       getSystemId();
  public String       getInternalSubset();
  
}

Example of the DocumentType Interface

Verify that a document is correct XHTML
From the XHTML 1.0 spec:
1. It must validate against one of the three DTDs found in Appendix A.
2. The root element of the document must be <html>.
3. The root element of the document must designate the XHTML namespace using the xmlns attribute [XMLNAMES]. The namespace for XHTML is defined to be http://www.w3.org/1999/xhtml.
4. There must be a DOCTYPE declaration in the document prior to the root element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in Appendix A using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions.
```
<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "DTD/xhtml1-strict.dtd">

<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "DTD/xhtml1-transitional.dtd">

<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
     "DTD/xhtml1-frameset.dtd">
```

XHTMLValidator

import org.w3c.dom.*;
import org.apache.xerces.parsers.*; 
import java.io.*;
import org.xml.sax.*;


public class XHTMLValidator {

  public static void main(String[] args) {
    
    for (int i = 0; i < args.length; i++) {
      validate(args[i]);
    }   
    
  }

  private static DOMParser parser = new DOMParser();
  
  static {
    
    // turn on validation
    try {
      parser.setFeature(
       "http://xml.org/sax/features/validation", true);
      parser.setErrorHandler(new ValidityErrorReporter());
    }
    catch (SAXNotRecognizedException e) {
      System.err.println(
         "Installed XML parser cannot validate; "
       + "checking for well-formedness instead...");
    } 
    catch (SAXNotSupportedException e) {
      System.err.println(
       "Cannot turn on validation here; "
       + " checking for well-formedness instead...");
    }     
    
  }
  
  // not thread safe
  public static void validate(String source) {
        
    try {

      try {
        parser.parse(source); 
        // ValidityErrorReporter prints any validity errors detected
      }
      catch (SAXException e) {  
        System.out.println(source + " is not well formed."); 
        return; 
      }
      
      // If we get this far, then the document is well-formed XML.
      // Check to see whether the document is actually XHTML    
      Document document = parser.getDocument();
    
      DocumentType doctype = document.getDoctype();
    
      if (doctype == null) {
        System.out.println("No DOCTYPE"); 
        return;
      }

      String name     = doctype.getName();
      String systemID = doctype.getSystemId();
      String publicID = doctype.getPublicId();
      
      if (!name.equals("html")) {
        System.out.println("Incorrect root element name " + name); 
      }
    
      if (publicID == null
       || (!publicID.equals("-//W3C//DTD XHTML 1.0 Strict//EN")
           && !publicID.equals(
                "-//W3C//DTD XHTML 1.0 Transitional//EN")
           && !publicID.equals(
                "-//W3C//DTD XHTML 1.0 Frameset//EN"))) {
        System.out.println(source 
         + " does not seem to use an XHTML 1.0 DTD");
      }
    
      // Check the namespace on the root element
      Element root = document.getDocumentElement();
      String xmlnsValue = root.getAttribute("xmlns");
      if (!xmlnsValue.equals("http://www.w3.org/1999/xhtml")) {
        System.out.println(source 
         + " does not properly declare the"
         + " http://www.w3.org/1999/xhtml"
         + " namespace on the root element");        
      }
    
      // get ready for the next parse
      parser.reset();
      
    }
    catch (IOException e) {
      System.err.println("Could not read " + source);
    }
    catch (Exception e) {
      System.err.println(e);
      e.printStackTrace();
    }
    
  }

}

EntityReference Nodes

Represents an entity reference like & or &signature;
Optional: some parsers (including Xerces) just expand entities
Contains:
- Element nodes
- ProcessingInstruction nodes
- Comment nodes
- Text nodes
- CDATASection nodes
- EntityReference nodes

The EntityReference Interface

package org.w3c.dom;

public interface EntityReference extends Node {

}

Attr Nodes

Represents an attribute
Contains:
- Text nodes
- Entity reference nodes

The Attr Interface

package org.w3c.dom;

public interface Attr extends Node {

  public String   getName();
  public boolean  getSpecified();
  public String   getValue();
  public void     setValue(String value) throws DOMException;
  public Element  getOwnerElement();
  
}

XLinkSpider with DOM

import org.xml.sax.*;
import org.apache.xerces.parsers.*;
import java.io.*;
import java.util.*;
import org.w3c.dom.*;


public class DOMSpider {

  private static DOMParser parser = new DOMParser();
  
  // namespace suport is turned off by default in Xerces
  static {
    try {
      parser.setFeature(
       "http://xml.org/sax/features/namespaces", true); 
    }
    catch (Exception e) {
      System.err.println(e);
    }
  }
  
  private static Vector visited = new Vector();
  
  private static int maxDepth = 5;
  private static int currentDepth = 0; 
  
  public static void listURIs(String systemId) {
    
    currentDepth++;
    try {
      if (currentDepth < maxDepth) {
        parser.parse(systemId);
     
        Document document = parser.getDocument();   
    
        Vector uris = new Vector();
        // search the document for uris, 
        // store them in vector, and print them
        searchForURIs(document.getDocumentElement(), uris);
    
    
        Enumeration e = uris.elements();
        while (e.hasMoreElements()) {
          String uri = (String) e.nextElement();
          visited.addElement(uri);
          listURIs(uri); 
        }
      
      }
    
    }
    catch (SAXException e) {
      // couldn't load the document, 
      // probably not well-formed XML, skip it 
    }
    catch (IOException e) {
      // couldn't load the document, 
      // likely network failure, skip it 
    }
    finally { 
      currentDepth--;
      System.out.flush();     
    }
      
  }
  
  // use recursion 
  public static void searchForURIs(Element element, Vector uris) {
    
    // look for XLinks in this element
    String uri = element.getAttributeNS("http://www.w3.org/1999/xlink", "href");

    if (uri != null && !uri.equals("") 
         && !visited.contains(uri) 
         && !uris.contains(uri)) {
      System.out.println(uri);
      uris.addElement(uri);
    }
    
    // process child elements recursively
    NodeList children = element.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node n = children.item(i);
      if (n instanceof Element) {
        searchForURIs((Element) n, uris);
      } 
    }
    
  }
  

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java DOMSpider URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      try {
        listURIs(args[i]);
      }
      catch (Exception e) {
        System.err.println(e);
        e.printStackTrace(); 
      }
      
    } // end for
  
  } // end main

} // end DOMSpider

ProcessingInstruction Nodes

Represents a processing instruction like
<?robots index="yes" follow="no"?>
No children

The ProcessingInstruction Interface

package org.w3c.dom;

public interface ProcessingInstruction extends Node {

  public String  getTarget();
  public String  getData();
  public void    setData(String data) throws DOMException;
  
}

XLinkSpider that Respects robots processing instruction

import org.xml.sax.*;
import org.apache.xerces.parsers.*;
import java.io.*;
import java.util.*;
import org.w3c.dom.*;


public class PoliteDOMSpider {

  private static DOMParser parser = new DOMParser();
  
  // namespace suport is turned off by default in Xerces
  static {
    try {
      parser.setFeature("http://xml.org/sax/features/namespaces", 
       true); 
    }
    catch (Exception e) {
      System.err.println(e);
    }
  }
  
  private static Vector visited = new Vector();
  
  private static int maxDepth = 5;
  private static int currentDepth = 0; 
  
  public static void listURIs(String systemId) {
    
    currentDepth++;
    try {
      if (currentDepth < maxDepth) {
        parser.parse(systemId);
     
        Document document = parser.getDocument();   
    
        if (robotsAllowed(document)) {
          Vector uris = new Vector();
          // search the document for uris,
          // store them in vector, print them
          searchForURIs(document.getDocumentElement(), uris);
    
          Enumeration e = uris.elements();
          while (e.hasMoreElements()) {
            String uri = (String) e.nextElement();
            visited.addElement(uri);
            listURIs(uri); 
          }
          
        }
      
      }
    
    }
    catch (SAXException e) {
      // couldn't load the document, 
      // probably not well-formed XML, skip it 
    }
    catch (IOException e) {
      // couldn't load the document, 
      // likely network failure, skip it 
    }
    finally { 
      currentDepth--;
      System.out.flush();     
    }
      
  }
  
  public static boolean robotsAllowed(Document document) {
    
    NodeList children = document.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node n = children.item(i);
      if (n instanceof ProcessingInstruction) {
        ProcessingInstruction pi = (ProcessingInstruction) n;
        if (pi.getTarget().equals("robots")) {
          String data = pi.getData();
          if (data.indexOf("follow=\"no\"") >= 0) {
            return false; 
          } 
        }
      }
    }
    
    return true;
    
  }
  
  
  // use recursion 
  public static void searchForURIs(Element element, Vector uris) {
    
    // look for XLinks in this element
    String uri = element.getAttributeNS("http://www.w3.org/1999/xlink", "href");

    if (uri != null && !uri.equals("") 
         && !visited.contains(uri) 
         && !uris.contains(uri)) {
      System.out.println(uri);
      uris.addElement(uri);
    }
    
    // process child elements recursively
    NodeList children = element.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
      Node n = children.item(i);
      if (n instanceof Element) {
        searchForURIs((Element) n, uris);
      } 
    }
    
  }
  

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java PoliteDOMSpider URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      try {
        listURIs(args[i]);
      }
      catch (Exception e) {
        System.err.println(e);
        e.printStackTrace(); 
      }
      
    } // end for
  
  } // end main

} // end PoliteDOMSpider

Comment Nodes

Represents a comment like this example from the XML 1.0 spec:

<!--* N.B. some readers (notably JC) find the following
paragraph awkward and redundant.  I agree it's logically redundant:
it *says* it is summarizing the logical implications of
matching the grammar, and that means by definition it's
logically redundant.  I don't think it's rhetorically
redundant or unnecessary, though, so I'm keeping it.  It
could however use some recasting when the editors are feeling
stronger. -MSM *-->

No children

The Comment Interface

package org.w3c.dom;

public interface Comment extends CharacterData {
}

Comment Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;


public class DOMCommentReader {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        processNode(d);
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  // note use of recursion
  public static void processNode(Node node) {
    
    int type = node.getNodeType();
    if (type == Node.COMMENT_NODE) {
      System.out.println(node.getNodeValue());
      System.out.println();
    }
    else {
      if (node.hasChildNodes()) {
        NodeList children = node.getChildNodes();
        for (int i = 0; i < children.getLength(); i++) {
          processNode(children.item(i));
        } 
      }
    }
    
  }

}

DOMCommentReader Output

% java DOMCommentReader hotcop.xml
 The publisher is actually Polygram but I needed
       an example of a general entity reference.

 You can tell what album I was
     listening to when I wrote this example

Or try http://www.w3.org/TR/1998/REC-xml-19980210.xml for more interesting output

Entity Nodes

Represents an actual entity, not an entity reference!
Contains:
- Element nodes
- ProcessingInstruction nodes
- Comment nodes
- Text nodes
- CDATASection nodes
- EntityReference nodes

The Entity Interface

package org.w3c.dom;

public interface Entity extends Node {

  public String  getPublicId();
  public String  getSystemId();
  public String  getNotationName();
  
}

DOMException

A runtime exception but you should catch it
Error code gives more detailed information:

DOMException.INDEX_SIZE_ERR
Index or size is negative, or greater than the allowed value

DOMException.DOMSTRING_SIZE_ERR
The specified range of text does not fit into a String

DOMException.HIERARCHY_REQUEST_ERR
Attempt to insert a node somewhere it doesn't belong

DOMException.WRONG_DOCUMENT_ERR
If a node is used in a different document than the one that created it (that doesn't support it)

DOMException.INVALID_CHARACTER_ERR
An invalid or illegal character is specified, such as in a name.

DOMException.NO_DATA_ALLOWED_ERR
Attempt to add data to a node which does not support data

DOMException.NO_MODIFICATION_ALLOWED_ERR
Attempt to modify a read-only object

DOMException.NOT_FOUND_ERR
Attempt to reference a node in a context where it does not exist

DOMException.NOT_SUPPORTED_ERR
The implementation does not support the type of object requested

DOMException.INUSE_ATTRIBUTE_ERR
Attempt to add an attribute to an element that already has that attribute

DOMException.INVALID_STATE_ERR
An attempt is made to use an object that is not, or no longer, usable.

DOMException.SYNTAX_ERR
An invalid or illegal string is specified.

DOMException.INVALID_MODIFICATION_ERR
An attempt to modify the type of the underlying object.

DOMException.NAMESPACE_ERR
An attempt is made to create or change an object in a way which is incorrect with regard to namespaces.

DOMException.INVALID_ACCESS_ERR
A parameter or an operation is not supported by the underlying object.
Current value accessible from the public code field

The org.w3c.dom.traversal Package

Four interfaces:

DocumentTraversal
NodeFilter
NodeIterator
TreeWalker

NodeIterator

package org.w3c.dom.traversal;

public interface NodeIterator {

  public int        getWhatToShow();
  public NodeFilter getFilter();
  public boolean    getExpandEntityReferences();
  public Node       nextNode() throws DOMException;
  public Node       previousNode() throws DOMException;
  public void       detach();
    
}

ValueReporter

import org.apache.xerces.parsers.*;
import org.apache.xerces.dom.*;
import org.w3c.dom.*;
import org.w3c.dom.traversal.*;
import org.xml.sax.*;
import java.io.*;


public class ValueReporter {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document doc = parser.getDocument();
        DocumentImpl impl = (DocumentImpl) doc;
        NodeIterator iterator = impl.createNodeIterator(
         doc.getDocumentElement(), NodeFilter.SHOW_ALL, null, true
        );
        Node node;
        while ((node = iterator.nextNode()) != null) {
          processNode(node);      
        }
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  public static void processNode(Node node) {
    
    String name = node.getNodeName();
    String type = getTypeName(node.getNodeType());
    String value = node.getNodeValue();
    System.out.println("Type " + type + ": " + name 
     + " \"" + value + "\"");
    
  }
  
  public static String getTypeName(int type) {
    
    switch (type) {
      case Node.ELEMENT_NODE: 
        return "Element";
      case Node.ATTRIBUTE_NODE: 
        return "Attribute";
      case Node.TEXT_NODE: 
        return "Text";
      case Node.CDATA_SECTION_NODE: 
        return "CDATA Section";
      case Node.ENTITY_REFERENCE_NODE: 
        return "Entity Reference";
      case Node.ENTITY_NODE: 
        return "Entity";
      case Node.PROCESSING_INSTRUCTION_NODE: 
        return "Processing Instruction";
      case Node.COMMENT_NODE: 
        return "Comment";
      case Node.DOCUMENT_NODE: 
        return "Document";
      case Node.DOCUMENT_TYPE_NODE: 
        return "Document Type Declaration";
      case Node.DOCUMENT_FRAGMENT_NODE: 
        return "Document Fragment";
      case Node.NOTATION_NODE: 
        return "Notation";
      default: 
        return "Unknown Type"; 
    }
    
  }

}

ValueReporter Output

% java ValueReporter hotcop.xml
Type Element: SONG "null"
Type Text: #text "
  "
Type Element: TITLE "null"
Type Text: #text "Hot Cop"
Type Text: #text "
  "
Type Element: PHOTO "null"
Type Text: #text "
  "
Type Element: COMPOSER "null"
Type Text: #text "Jacques Morali"
Type Text: #text "
  "
Type Element: COMPOSER "null"
Type Text: #text "Henri Belolo"
Type Text: #text "
  "
Type Element: COMPOSER "null"
Type Text: #text "Victor Willis"
Type Text: #text "
  "
Type Element: PRODUCER "null"
Type Text: #text "Jacques Morali"
Type Text: #text "
  "
Type Comment: #comment " The publisher is actually Polygram but I needed
       an example of a general entity reference. "
Type Text: #text "
  "
Type Element: PUBLISHER "null"
Type Text: #text "
    A & M Records
  "
Type Text: #text "
  "
Type Element: LENGTH "null"
Type Text: #text "6:20"
Type Text: #text "
  "
Type Element: YEAR "null"
Type Text: #text "1978"
Type Text: #text "
  "
Type Element: ARTIST "null"
Type Text: #text "Village People"
Type Text: #text "
"

Attributes are missing from this output. They are not nodes. They are properties of nodes.

NodeFilter

package org.w3c.dom.traversal;

public interface NodeFilter {

  // Constants returned by acceptNode
  public static final short FILTER_ACCEPT             = 1;
  public static final short FILTER_REJECT             = 2;
  public static final short FILTER_SKIP               = 3;

  // Constants for whatToShow
  public static final int   SHOW_ALL                  = 0x0000FFFF;
  public static final int   SHOW_ELEMENT              = 0x00000001;
  public static final int   SHOW_ATTRIBUTE            = 0x00000002;
  public static final int   SHOW_TEXT                 = 0x00000004;
  public static final int   SHOW_CDATA_SECTION        = 0x00000008;
  public static final int   SHOW_ENTITY_REFERENCE     = 0x00000010;
  public static final int   SHOW_ENTITY               = 0x00000020;
  public static final int   SHOW_PROCESSING_INSTRUCTION = 0x00000040;
  public static final int   SHOW_COMMENT              = 0x00000080;
  public static final int   SHOW_DOCUMENT             = 0x00000100;
  public static final int   SHOW_DOCUMENT_TYPE        = 0x00000200;
  public static final int   SHOW_DOCUMENT_FRAGMENT    = 0x00000400;
  public static final int   SHOW_NOTATION             = 0x00000800;

  public short        acceptNode(Node n);
    
}

DOM based TagStripper

import org.apache.xerces.parsers.*;
import org.apache.xerces.dom.*;
import org.w3c.dom.*;
import org.w3c.dom.traversal.*;
import org.xml.sax.SAXException;
import java.io.IOException;


public class DOMTagStripper {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document doc = parser.getDocument();
        DocumentImpl impl = (DocumentImpl) doc;
        NodeIterator iterator = impl.createNodeIterator(
         doc.getDocumentElement(), NodeFilter.SHOW_TEXT, null, true
        );
        Node node;
        while ((node = iterator.nextNode()) != null) {
          System.out.print(node.getNodeValue());      
        }
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

Output from a DOM based TagStripper

% java DOMTagStripper hotcop.xml

  Hot Cop
  Jacques Morali
  Henri Belolo
  Victor Willis
  Jacques Morali

  A & M Records
  6:20
  1978
  Village People

Writing XML Documents with DOM

DOM is for both input and output
New documents are created with a parser-specific API
A serializer + output format converts the DOM to a byte stream

org.apache.xerces.dom.DOMImplementationImpl

A Xerces-specific class used to create new DOM documents

package org.apache.xerces.dom;

public class DOMImplementationImpl implements DOMImplementation {

  public boolean hasFeature(String feature, String version) 
  
  public static DOMImplementation getDOMImplementation()
  
  public DocumentType createDocumentType(String qualifiedName, 
   String publicID, String systemID, String internalSubset)
                                          
  public Document createDocument(String namespaceURI, 
   String qualifiedName, DocumentType doctype)
   throws DOMException

}

A DOM program that writes Fibonacci numbers into an XML document

import java.math.*;
import java.io.*;
import org.w3c.dom.*;
import org.apache.xerces.dom.*;
import org.apache.xml.serialize.*;


public class FibonacciDOM {

  public static void main(String[] args) {

    try {

      DOMImplementation impl 
       = DOMImplementationImpl.getDOMImplementation();

      Document fibonacci 
       = impl.createDocument(null, "Fibonacci_Numbers", null);

      BigInteger low  = BigInteger.ZERO;
      BigInteger high = BigInteger.ONE;

      Element root = fibonacci.getDocumentElement();

      for (int i = 0; i <= 25; i++) {
        Element number = fibonacci.createElement("fibonacci");
        number.setAttribute("index", Integer.toString(i));
        Text text = fibonacci.createTextNode(low.toString());
        number.appendChild(text);
        root.appendChild(number);

        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }

      // Now the document has been created and exists in memory
    }
    catch (DOMException e) {
      e.printStackTrace();
    }

  }

}

Serialization

The process of taking an in-memory DOM tree and converting it to a stream of characters that can be written onto an output stream
Not a standard part of the DOM
The org.apache.xml.serialize package:

A DOM program that writes Fibonacci numbers onto System.out

import java.math.*;
import java.io.*;
import org.w3c.dom.*;
import org.apache.xerces.dom.*;
import org.apache.xml.serialize.*; 


public class FibonacciDOMSerializer {

  public static void main(String[] args) {
   
    try {
      
      DOMImplementation impl 
       = DOMImplementationImpl.getDOMImplementation();

      Document fibonacci 
       = impl.createDocument(null, "Fibonacci_Numbers", null);
      
      BigInteger low  = BigInteger.ZERO;
      BigInteger high = BigInteger.ONE;      
      
      Element root = fibonacci.getDocumentElement(); 

      for (int i = 0; i <= 25; i++) {
        Element number = fibonacci.createElement("fibonacci");
        number.setAttribute("index", Integer.toString(i));
        Text text = fibonacci.createTextNode(low.toString());
        number.appendChild(text);
        root.appendChild(number);
        
        BigInteger temp = high;
        high = high.add(low);
        low = temp;
      }
      
      try {
        // Now that the document is created we need to *serialize* it
        OutputFormat format = new OutputFormat(fibonacci);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(fibonacci);
      }
      catch (IOException e) {
        System.err.println(e); 
      }
    }
    catch (DOMException e) {
      e.printStackTrace();
    }

  }

}

fibonacci.xml

<?xml version="1.0" encoding="UTF-8"?>
<Fibonacci_Numbers><fibonacci index="0">0</fibonacci><fibonacci index="1">1</fibonacci><fibonacci index="2">1</fibonacci><fibonacci index="3">2</fibonacci><fibonacci index="4">3</fibonacci><fibonacci index="5">5</fibonacci><fibonacci index="6">8</fibonacci><fibonacci index="7">13</fibonacci><fibonacci index="8">21</fibonacci><fibonacci index="9">34</fibonacci><fibonacci index="10">55</fibonacci><fibonacci index="11">89</fibonacci><fibonacci index="12">144</fibonacci><fibonacci index="13">233</fibonacci><fibonacci index="14">377</fibonacci><fibonacci index="15">610</fibonacci><fibonacci index="16">987</fibonacci><fibonacci index="17">1597</fibonacci><fibonacci index="18">2584</fibonacci><fibonacci index="19">4181</fibonacci><fibonacci index="20">6765</fibonacci><fibonacci index="21">10946</fibonacci><fibonacci index="22">17711</fibonacci><fibonacci index="23">28657</fibonacci><fibonacci index="24">46368</fibonacci><fibonacci index="25">75025</fibonacci></Fibonacci_Numbers>

OutputFormat

package org.apache.xml.serialize;

public class OutputFormat extends Object {

  public OutputFormat()
  public OutputFormat(String method, 
   String encoding, boolean indenting)
  public OutputFormat(Document doc)
  public OutputFormat(Document doc, 
   String encoding, boolean indenting)
  
  public String   getMethod()
  public void     setMethod(String method)
  public String   getVersion()
  public void     setVersion(String version)
  public int      getIndent()
  public boolean  getIndenting()
  public void     setIndent(int indent)
  public void     setIndenting(boolean on)
  public String   getEncoding()
  public void     setEncoding(String encoding)
  public String   getMediaType()
  public void     setMediaType(String mediaType)
  public void     setDoctype(String publicID, String systemID)
  public String   getDoctypePublic()
  public String   getDoctypeSystem()
  public boolean  getOmitXMLDeclaration()
  public void     setOmitXMLDeclaration(boolean omit)
  public boolean  getStandalone()
  public void     setStandalone(boolean standalone)
  public String[] getCDataElements()
  public boolean  isCDataElement(String tagName)
  public void     setCDataElements(String[] cdataElements)
  public String[] getNonEscapingElements()
  public boolean  isNonEscapingElement(String tagName)
  public void     setNonEscapingElements(String[] nonEscapingElements)
  public String   getLineSeparator()
  public void     setLineSeparator(String lineSeparator)
  public boolean  getPreserveSpace()
  public void     setPreserveSpace(boolean preserve)
  public int      getLineWidth()
  public void     setLineWidth(int lineWidth)
  public char     getLastPrintable()
  
  public static String whichMethod(Document doc)
  public static String whichDoctypePublic(Document doc)
  public static String whichDoctypeSystem(Document doc)
  public static String whichMediaType(String method)
  
}

Better formatted output

Latin-1 encoding
Indentation
Word wrapping
Document type declaration

 try {
  // Now that the document is created we need to *serialize* it
  OutputFormat format = new OutputFormat(fibonacci, "8859_1", true);
  format.setLineSeparator("\r\n");
  format.setLineWidth(72);
  format.setDoctype(null, "fibonacci.dtd");
  XMLSerializer serializer = new XMLSerializer(System.out, format);
  serializer.serialize(root);
}
catch (IOException e) {
  System.err.println(e); 
}

Question: Why won't this let us add an xml-stylesheet directive?

formatted_fibonacci.xml

<?xml version="1.0" encoding="8859_1"?>
<!DOCTYPE Fibonacci_Numbers SYSTEM "fibonacci.dtd">
<Fibonacci_Numbers>
    <fibonacci index="0">0</fibonacci>
    <fibonacci index="1">1</fibonacci>
    <fibonacci index="2">1</fibonacci>
    <fibonacci index="3">2</fibonacci>
    <fibonacci index="4">3</fibonacci>
    <fibonacci index="5">5</fibonacci>
    <fibonacci index="6">8</fibonacci>
    <fibonacci index="7">13</fibonacci>
    <fibonacci index="8">21</fibonacci>
    <fibonacci index="9">34</fibonacci>
    <fibonacci index="10">55</fibonacci>
    <fibonacci index="11">89</fibonacci>
    <fibonacci index="12">144</fibonacci>
    <fibonacci index="13">233</fibonacci>
    <fibonacci index="14">377</fibonacci>
    <fibonacci index="15">610</fibonacci>
    <fibonacci index="16">987</fibonacci>
    <fibonacci index="17">1597</fibonacci>
    <fibonacci index="18">2584</fibonacci>
    <fibonacci index="19">4181</fibonacci>
    <fibonacci index="20">6765</fibonacci>
    <fibonacci index="21">10946</fibonacci>
    <fibonacci index="22">17711</fibonacci>
    <fibonacci index="23">28657</fibonacci>
    <fibonacci index="24">46368</fibonacci>
    <fibonacci index="25">75025</fibonacci>
</Fibonacci_Numbers>

DOM based XMLPrettyPrinter

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.*;
import org.apache.xerces.dom.*;
import org.apache.xml.serialize.*; 


public class DOMPrettyPrinter {

  public static void main(String[] args) { 
     
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document document = parser.getDocument();
        
        OutputFormat format 
         = new OutputFormat(document, "UTF-8", true);
        format.setLineSeparator("\r\n");
        format.setIndenting(true);
        format.setIndent(2);
        format.setLineWidth(72);
        format.setPreserveSpace(false);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(document);     
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

Output from a DOM based XMLPrettyPrinter

<?xml version="1.0" encoding="UTF-8"?>
<!-- <!DOCTYPE foo SYSTEM "http://msdn.microsoft.com/xml/general/htmlentities.dtd"> -->
<weblogs>
  <log>
    <name>MozillaZine</name>
    <url>http://www.mozillazine.org</url>
    <changesUrl>http://www.mozillazine.org/contents.rdf</changesUrl>
    <ownerName>Jason Kersey</ownerName>
    <ownerEmail>kerz@en.com</ownerEmail>
    <description>THE source for news on the Mozilla Organization.
      DevChats, Reviews, Chats, Builds, Demos, Screenshots, and more.</description>
    <imageUrl/>
    <adImageUrl>http://static.userland.com/weblogMonitor/ads/kerz@en.com.gif</adImageUrl>
  </log>
  <log>
    <name>SalonHerringWiredFool</name>
    <url>http://www.salonherringwiredfool.com/</url>
    <ownerName>Some Random Herring</ownerName>
    <ownerEmail>salonfool@wiredherring.com</ownerEmail>
    <description/>
  </log>
  <log>
    <name>Scripting News</name>
    <url>http://www.scripting.com/</url>
    <ownerName>Dave Winer</ownerName>
    <ownerEmail>dave@userland.com</ownerEmail>
    <description>News and commentary from the cross-platform scripting community.</description>
    <imageUrl>http://www.scripting.com/gifs/tinyScriptingNews.gif</imageUrl>
    <adImageUrl>http://static.userland.com/weblogMonitor/ads/dave@userland.com.gif</adImageUrl>
  </log>
  <log>
    <name>SlashDot.Org</name>
    <url>http://www.slashdot.org/</url>
    <ownerName>Simply a friend</ownerName>
    <ownerEmail>afriendofweblogs@weblogs.com</ownerEmail>
    <description>News for Nerds, Stuff that Matters.</description>
  </log>
</weblogs>

The point is this:

Using the DOM to write documents automatically maintains well-formedness constraints
Validity is not automatically maintained.

To Learn More

This presentation: http://www.ibiblio.org/xml/slides/sd2001east/xmlandjava/
XML in a Nutshell
- Elliotte Rusty Harold and Scott Means
- O'Reilly & Associates, 2001
- ISBN: 0-596-00058-8
DOM Level 2 Core Specification: http://www.w3.org/TR/DOM-Level-2-Core/
DOM Level 2 Traversal and Range Specification: http://www.w3.org/TR/DOM-Level-2-Traversal-Range/

Part V: JDOM

There is no compelling reason for a Java API to manipulate XML to be complex, tricky, unintuitive, or a pain in the neck.

--JDOM Mission Statement

Where we're going

Writing XML with JDOM
Reading XML through JDOM
The JDOM Classes

What is JDOM?

A Pure Java API for reading and writing XML Documents
A Java-oriented API for reading and writing XML Documents
A tree-oriented API for reading and writing XML Documents
A parser independent API for reading and writing XML Documents

About JDOM

Created by Brett McLaughlin and Jason Hunter. (James Duncan Davidson is an unindicted coconspirator.)
Alex Chafee, Alex Rosen, Jools Enticknap, and Philip Nelson are also major contributors.
Open source with an Apache-like license
http://www.jdom.org/

JDOM versions

1.0 Beta 7 is current tarball from June, 2001
Last month has added some functionality to the API
This presentation is based on the June 22, 2001 CVS version
cvs.jdom.org

Four packages:

org.jdom: the classes that represent an XML document and its parts
org.jdom.input: classes for reading a document into memory
org.jdom.output: classes for writing a document onto a stream or other target (e.g. SAX or DOM app)
org.jdom.adapters: classes for hooking up to DOM implementations

The org.jdom package

The classes that represent an XML document and its parts

Attribute
Comment
DocType
Document
Element
Text (incomplete)
CDATA (may be going away)
EntityRef
ProcessingInstruction
plus Verifier
plus assorted exceptions

The org.jdom.input package

Classes for reading a document into memory from a file or other source

DOMBuilder
SAXBuilder
BuilderErrorHandler
DefaultJDOMFactory
SAXHandler

The org.jdom.output package

The classes for writing a document to a file or other target

XMLOutputter
SAXOutputter
DOMOutputter

The org.jdom.adapters package

Classes for hooking up JDOM to DOM implementations:
- AbstractDOMAdapter
- OracleV1DOMAdapter
- OracleV2DOMAdapter
- ProjectXDOMAdapter
- XercesDOMAdapter
- JAXPDOMAdapter
- CrimsonDOMAdapter
- XML4JDOMAdapter
You rarely need to access these directly.

The org.jdom.transform package

Classes for XSLT support:

JDOMResult
JDOMSource

Writing XML Documents with JDOM

JDOM is for both input and output
New documents can be read from a stream or constructed in memory
An org.jdom.output.XMLOutputter sends a document from memory to an OutputStream or Writer
A JDOM document can also be sent to a SAX ContentHandler or DOM org.w3c.dom.Document for further processing with a different API

A JDOM program that writes this XML document

<?xml version="1.0"?>
<GREETING>
  Hello JDOM!
</GREETING>

Hello JDOM

import org.jdom.*;
import org.jdom.output.XMLOutputter;


public class HelloJDOM {

  public static void main(String[] args) {
   
    Element root = new Element("GREETING");
    	
    root.setText("Hello JDOM!");
         
    Document doc = new Document(root);      
    
    // At this point the document only exists in memory.
    // We still need to serialize it
    XMLOutputter outputter = new XMLOutputter();
    try {
      outputter.output(doc, System.out);       
    }
    catch (Exception e) {
      System.err.println(e);
    }

  }

}

Actual Output

<?xml version="1.0" encoding="UTF-8"?>
<GREETING>Hello JDOM!</GREETING>

This is more or less what we wanted, modulo white space.

Hello DOM

Here's the same program using DOM instead of JDOM. Which is simpler?

import java.io.*;
import org.w3c.dom.*;
import org.apache.xerces.dom.*;
import org.apache.xml.serialize.*;


public class HelloDOM {

  public static void main(String[] args) {

    try {

      DOMImplementation impl = DOMImplementationImpl.getDOMImplementation();
      //                       ^^^^^^^^^^^^^^^^^^^^^
      //                       Xerces Specific class

      Document hello = impl.createDocument(null, "GREETING", null);
      //                                   ^^^^              ^^^^
      //                               Namespace URI       DocType

      Element root = hello.getDocumentElement();

      // We can't use a raw string. Instead we have to first create
      // a text node.
      Text text = hello.createTextNode("Hello DOM!");
      root.appendChild(text);

      // Now that the document is created we need to *serialize* it
      try {
        OutputFormat format = new OutputFormat(hello);
        XMLSerializer serializer 
         = new XMLSerializer(System.out, format);
        serializer.serialize(root);
      }
      catch (IOException e) {
        System.err.println(e);
      }
    }
    catch (DOMException e) {
      e.printStackTrace();
    }

  }

}

fibonacci.xml

Suppose we want data in an XML document that looks something like this:

<?xml version="1.0" encoding="UTF-8"?>
<Fibonacci_Numbers>
  <fibonacci index="0">0</fibonacci>
  <fibonacci index="1">1</fibonacci>
  <fibonacci index="2">1</fibonacci>
  <fibonacci index="3">2</fibonacci>
  <fibonacci index="4">3</fibonacci>
  <fibonacci index="5">5</fibonacci>
  <fibonacci index="6">8</fibonacci>
  <fibonacci index="7">13</fibonacci>
  <fibonacci index="8">21</fibonacci>
  <fibonacci index="9">34</fibonacci>
  <fibonacci index="10">55</fibonacci>
  <fibonacci index="11">89</fibonacci>
  <fibonacci index="12">144</fibonacci>
  <fibonacci index="13">233</fibonacci>
  <fibonacci index="14">377</fibonacci>
  <fibonacci index="15">610</fibonacci>
  <fibonacci index="16">987</fibonacci>
  <fibonacci index="17">1597</fibonacci>
  <fibonacci index="18">2584</fibonacci>
  <fibonacci index="19">4181</fibonacci>
  <fibonacci index="20">6765</fibonacci>
  <fibonacci index="21">10946</fibonacci>
  <fibonacci index="22">17711</fibonacci>
  <fibonacci index="23">28657</fibonacci>
  <fibonacci index="24">46368</fibonacci>
  <fibonacci index="25">75025</fibonacci>
</Fibonacci_Numbers>

A JDOM program that writes Fibonacci numbers into an XML file

import org.jdom.*;
import org.jdom.output.XMLOutputter;
import java.math.BigInteger;
import java.io.*;


public class FibonacciJDOM {

  public static void main(String[] args) {

    Element root = new Element("Fibonacci_Numbers");

    BigInteger low  = BigInteger.ZERO;
    BigInteger high = BigInteger.ONE;

    for (int i = 0; i <= 25; i++) {
      Element fibonacci = new Element("fibonacci");
      Attribute index = new Attribute("index", String.valueOf(i));
      fibonacci.setAttribute(index);
      fibonacci.setText(low.toString());
      root.addContent(fibonacci);

      BigInteger temp = high;
      high = high.add(low);
      low = temp;
    }

    Document doc = new Document(root);
    // serialize it into a file
    try {
      FileOutputStream out 
       = new FileOutputStream("fibonacci_jdom.xml");
      XMLOutputter serializer = new XMLOutputter();
      serializer.output(doc, out);
      out.flush();
      out.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }

  }

}

Output

Again, modulo white space this is correct

<?xml version="1.0" encoding="UTF-8"?>
<Fibonacci_Numbers><fibonacci index="0">0</fibonacci><fibonacci index="1">1</fibonacci><fibonacci index="2">1</fibonacci><fibonacci index="3">2</fibonacci><fibonacci index="4">3</fibonacci><fibonacci index="5">5</fibonacci><fibonacci index="6">8</fibonacci><fibonacci index="7">13</fibonacci><fibonacci index="8">21</fibonacci><fibonacci index="9">34</fibonacci><fibonacci index="10">55</fibonacci><fibonacci index="11">89</fibonacci><fibonacci index="12">144</fibonacci><fibonacci index="13">233</fibonacci><fibonacci index="14">377</fibonacci><fibonacci index="15">610</fibonacci><fibonacci index="16">987</fibonacci><fibonacci index="17">1597</fibonacci><fibonacci index="18">2584</fibonacci><fibonacci index="19">4181</fibonacci><fibonacci index="20">6765</fibonacci><fibonacci index="21">10946</fibonacci><fibonacci index="22">17711</fibonacci><fibonacci index="23">28657</fibonacci><fibonacci index="24">46368</fibonacci><fibonacci index="25">75025</fibonacci></Fibonacci_Numbers>

Suppose you want to include a DTD

Suppose we have this DTD at the relative URL fibonacci.dtd:

<!ELEMENT Fibonacci_Numbers (fibonacci*)>
<!ELEMENT fibonacci (#PCDATA)>
<!ATTLIST fibonacci index CDATA #IMPLIED>

We need this DOCTYPE declaration:

<!DOCTYPE Fibonacci_Numbers SYSTEM "fibonacci.dtd">

ValidFibonacci

Use the DocType class to insert a document type declaration
JDOM does not support internal DTD subsets.
JDOM does not let you output a DTD.

import java.math.*;
import java.io.*;
import org.jdom.*;
import org.jdom.output.XMLOutputter;


public class ValidFibonacci {

  public static void main(String[] args) {
   
    Element root = new Element("Fibonacci_Numbers");	
  	      
    BigInteger low  = BigInteger.ZERO;
    BigInteger high = BigInteger.ONE;      
    
    for (int i = 0; i <= 25; i++) {
      Element fibonacci = new Element("fibonacci");
      Attribute index = new Attribute("index", String.valueOf(i));
      fibonacci.setAttribute(index);
      fibonacci.setText(low.toString());
      BigInteger temp = high;
      high = high.add(low);
      low = temp;
      root.addContent(fibonacci);
    }
 
    DocType type = new DocType("Fibonacci_Numbers", "fibonacci.dtd");
 
    Document doc = new Document(root, type);
    // serialize it into a file
    try {
      FileOutputStream out = new FileOutputStream("validfibonacci.xml");
      XMLOutputter serializer = new XMLOutputter(); 
      serializer.output(doc, out);
      out.flush();	
      out.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }

  }

}

View Output in Browser

validfibonacci.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Fibonacci_Numbers SYSTEM "fibonacci.dtd">
<Fibonacci_Numbers><fibonacci index="0">0</fibonacci><fibonacci index="1">1</fibonacci><fibonacci index="2">1</fibonacci><fibonacci index="3">2</fibonacci><fibonacci index="4">3</fibonacci><fibonacci index="5">5</fibonacci><fibonacci index="6">8</fibonacci><fibonacci index="7">13</fibonacci><fibonacci index="8">21</fibonacci><fibonacci index="9">34</fibonacci><fibonacci index="10">55</fibonacci><fibonacci index="11">89</fibonacci><fibonacci index="12">144</fibonacci><fibonacci index="13">233</fibonacci><fibonacci index="14">377</fibonacci><fibonacci index="15">610</fibonacci><fibonacci index="16">987</fibonacci><fibonacci index="17">1597</fibonacci><fibonacci index="18">2584</fibonacci><fibonacci index="19">4181</fibonacci><fibonacci index="20">6765</fibonacci><fibonacci index="21">10946</fibonacci><fibonacci index="22">17711</fibonacci><fibonacci index="23">28657</fibonacci><fibonacci index="24">46368</fibonacci><fibonacci index="25">75025</fibonacci></Fibonacci_Numbers>

Using Namespaces

Suppose we want some MathML like this:

<?xml version="1.0" encoding="UTF-8"?>
<mathml:math xmlns:mathml="http://www.w3.org/1998/Math/MathML">
  <mathml:mrow>
    <mathml:mi>f(0)</mathml:mi>
    <mathml:mo>=</mathml:mo>
    <mathml:mn>0</mathml:mn>
  </mathml:mrow>
  <mathml:mrow>
    <mathml:mi>f(1)</mathml:mi>
    <mathml:mo>=</mathml:mo>
    <mathml:mn>1</mathml:mn>
  </mathml:mrow>
  <mathml:mrow>
    <mathml:mi>f(2)</mathml:mi>
    <mathml:mo>=</mathml:mo>
    <mathml:mn>1</mathml:mn>
  </mathml:mrow>
</mathml:math>

Rules for Using Namespaces

Do not use the qualified names like mathml:mn.
Instead use the prefixes mathml, local names like mn, and URIs like http://www.w3.org/1998/Math/MathML to create the elements.
Do not include xmlns attributes like xmlns:mathml="http://www.w3.org/1998/Math/MathML".
XMLOutputter will decide where to put the xmlns attributes when the document is serialized.

With Namespace Prefixes

import org.jdom.Element;
import org.jdom.Document;
import org.jdom.output.XMLOutputter;
import java.math.BigInteger;
import java.io.*;


public class PrefixedFibonacci {

  public static void main(String[] args) {

    Element root = new Element("math", "mathml",
     "http://www.w3.org/1998/Math/MathML");

    BigInteger low  = BigInteger.ZERO;
    BigInteger high = BigInteger.ONE;

    for (int i = 0; i <= 25; i++) {

      Element mrow = new Element("mrow", "mathml",
       "http://www.w3.org/1998/Math/MathML");

      Element mi = new Element("mi", "mathml",
       "http://www.w3.org/1998/Math/MathML");
      mi.setText("f(" + i + ")");
      mrow.addContent(mi);

      Element mo = new Element("mo", "mathml",
       "http://www.w3.org/1998/Math/MathML");
      mo.setText("=");
      mrow.addContent(mo);

      Element mn = new Element("mn", "mathml",
       "http://www.w3.org/1998/Math/MathML");
      mn.setText(low.toString());
      mrow.addContent(mn);

      BigInteger temp = high;
      high = high.add(low);
      low = temp;
      root.addContent(mrow);

    }

    Document doc = new Document(root);
    // serialize it into a file
    try {
      FileOutputStream out 
       = new FileOutputStream("prefixed_fibonacci.xml");
      XMLOutputter serializer = new XMLOutputter();
      serializer.output(doc, out);
      out.flush();
      out.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }

  }

}

View Output in Browser

The Default, Unprefixed Namespace

Suppose you want some MathML like this:

<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mrow>
    <mi>f(0)</mi>
    <mo>=</mo>
    <mn>0</mn>
  </mrow>
  <mrow>
    <mi>f(1)</mi>
    <mo>=</mo>
    <mn>1</mn>
  </mrow>
  <mrow>
    <mi>f(2)</mi>
    <mo>=</mo>
    <mn>1</mn>
  </mrow>
</math>

Rules for Using Default Namespace

Do not use the local names like mn.
Instead use the local names like mn, and URIs like http://www.w3.org/1998/Math/MathML to create the elements.
Do not include xmlns attributes like xmlns="http://www.w3.org/1998/Math/MathML".
XMLOutputter will decide where to put the xmlns attribute when the document is serialized.

With Default Namespace

import org.jdom.Element;
import org.jdom.Document;
import org.jdom.output.XMLOutputter;
import java.math.BigInteger;
import java.io.*;


public class UnprefixedFibonacci {

  public static void main(String[] args) {
   
    Element root = new Element("math", 
     "http://www.w3.org/1998/Math/MathML");	
  	      
    BigInteger low  = BigInteger.ZERO;
    BigInteger high = BigInteger.ONE;      
    
    for (int i = 0; i <= 25; i++) {
        
      Element mrow = new Element("mrow", 
       "http://www.w3.org/1998/Math/MathML");
      
      Element mi = new Element("mi", 
       "http://www.w3.org/1998/Math/MathML");
      mi.setText("f(" + i + ")"); 
      mrow.addContent(mi);
      
      Element mo = new Element("mo", 
       "http://www.w3.org/1998/Math/MathML");
      mo.setText("="); 
      mrow.addContent(mo);
      
      Element mn = new Element("mn", 
       "http://www.w3.org/1998/Math/MathML");
      mn.setText(low.toString());
      mrow.addContent(mn);

      BigInteger temp = high;
      high = high.add(low);
      low = temp;
      root.addContent(mrow);
      
    }
 
    Document doc = new Document(root);
    // serialize it into a file
    try {
      FileOutputStream out 
       = new FileOutputStream("unprefixed_fibonacci.xml");
      XMLOutputter serializer = new XMLOutputter(); 
      serializer.output(doc, out);
      out.flush();	
      out.close();
    }
    catch (IOException e) {
      System.err.println(e);
    }

  }

}

View Output in Browser

Converting data to XML

Sample Tab Delimited Data: Baseball Statistics



Surname FirstName Team Position Games Played Games Started AtBats Runs Hits Doubles Triples Home runs RBI Stolen Bases Caught Stealing Sacrifice Hits Sacrifice Flies Errors PB Walks Strike outs Hit by pitch 
Anderson Garret ANA Outfield 156 151 622 62 183 41 7 15 79 8 3 3 3 6 0 29 80 1 
Baughman Justin ANA Second Base 62 54 196 24 50 9 1 1 20 10 4 5 3 8 0 6 36 1 
Bolick Frank ANA Third Base 21 11 45 3 7 2 0 1 2 0 0 0 0 0 0 11 8 0 
Disarcina Gary ANA Shortstop 157 155 551 73 158 39 3 3 56 12 7 12 3 14 0 21 51 8 
Edmonds Jim ANA Outfield 154 150 599 115 184 42 1 25 91 7 5 1 1 5 0 57 114 1 
Erstad Darin ANA Outfield 133 129 537 84 159 39 3 19 82 20 6 1 3 3 0 43 77 6 
Garcia Carlos ANA Second Base 19 10 35 4 5 1 0 0 0 2 0 1 0 1 0 3 11 1 
Glaus Troy ANA Third Base 48 45 165 19 36 9 0 1 23 1 0 0 2 7 0 15 51 0 
Greene Todd ANA Outfield 29 15 71 3 18 4 0 1 7 0 0 0 0 0 0 2 20 0 
Helfand Eric ANA Catcher 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
Hollins Dave ANA Third Base 101 98 363 60 88 16 2 11 39 11 3 2 2 17 0 44 69 7 
Jefferies Gregg ANA Outfield 19 18 72 7 25 6 0 1 10 1 0 0 0 0 0 0 5 0 
Johnson Mark ANA First Base 10 2 14 1 1 0 0 0 0 0 0 0 0 0 0 0 6 0 
Kreuter Chad ANA Catcher 96 74 252 27 63 10 1 2 33 1 0 5 1 9 5 33 49 3 
Martin Norberto ANA Second Base 79 50 195 20 42 2 0 1 13 3 1 3 2 4 0 6 29 0 
Mashore Damon ANA Outfield 43 24 98 13 23 6 0 2 11 1 0 1 0 0 0 9 22 3 
Molina Ben ANA Catcher 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
Nevin Phil ANA Catcher 75 65 237 27 54 8 1 8 27 0 0 0 2 5 20 17 67 5 
O'Brien Charlie ANA Catcher 62 58 175 13 45 9 0 4 18 0 0 3 3 4 1 10 33 2 
Palmeiro Orlando ANA Outfield 74 34 165 28 53 7 2 0 21 5 4 7 0 0 0 20 11 0 
Pritchett Chris ANA First Base 31 19 80 12 23 2 1 2 8 2 0 0 0 1 0 4 16 0 
Salmon Tim ANA Designated Hitter 136 130 463 84 139 28 1 26 88 0 1 0 10 2 0 90 100 3 
Shipley Craig ANA Third Base 77 32 147 18 38 7 1 2 17 0 4 4 1 3 0 5 22 5 
Velarde Randy ANA Second Base 51 50 188 29 49 13 1 4 26 7 2 0 1 4 0 34 42 1 
Walbeck Matt ANA Catcher 108 91 338 41 87 15 2 6 46 1 1 5 5 7 8 30 68 2 
Williams Reggie ANA Outfield 29 7 36 7 13 1 0 1 5 3 3 1 0 0 0 7 11 1

A Program to convert tab delimited data to XML

import java.io.*;
import org.jdom.*;
import org.jdom.output.XMLOutputter;


public class JDOMBaseballTabToXML {

  public static void main(String[] args) {
     
    Element root = new Element("players");
    
    try {
      FileInputStream fin = new FileInputStream(args[0]);
      BufferedReader in 
       = new BufferedReader(new InputStreamReader(fin));    

      String playerStats;  
      while ((playerStats = in.readLine()) != null) {
        String[] stats = splitLine(playerStats);
        
        Element player = new Element("player");

        Element first_name = new Element("first_name");
        first_name.setText(stats[1]);
        player.addContent(first_name);
        
        Element surname = new Element("surname");
        surname.setText(stats[0]);
        player.addContent(surname);
       
        Element games_played = new Element("games_played");
        games_played.setText(stats[4]);
        player.addContent(games_played);
       
        Element at_bats = new Element("at_bats");
        at_bats.setText(stats[6]);
        player.addContent(at_bats);
       
        Element runs = new Element("runs");
        runs.setText(stats[7]);
        player.addContent(runs);
       
        Element hits = new Element("hits");
        hits.setText(stats[8]);
        player.addContent(hits);
       
        Element doubles = new Element("doubles");
        doubles.setText(stats[9]);
        player.addContent(doubles);
       
        Element triples = new Element("triples");
        triples.setText(stats[10]);
        player.addContent(triples); 

        Element home_runs = new Element("home_runs");
        home_runs.setText(stats[11]);
        player.addContent(home_runs); 

        Element runs_batted_in = new Element("runs_batted_in");
        runs_batted_in.setText(stats[12]);
        player.addContent(runs_batted_in); 

        Element stolen_bases = new Element("stolen_bases");
        stolen_bases.setText(stats[13]);
        player.addContent(stolen_bases); 

        Element caught_stealing = new Element("caught_stealing");
        caught_stealing.setText(stats[14]);
        player.addContent(caught_stealing); 

        Element sacrifice_hits = new Element("sacrifice_hits");
        sacrifice_hits.setText(stats[15]);
        player.addContent(sacrifice_hits); 

        Element sacrifice_flies = new Element("sacrifice_flies");
        sacrifice_flies.setText(stats[16]);
        player.addContent(sacrifice_flies); 

        Element errors = new Element("errors");
        errors.setText(stats[17]);
        player.addContent(errors); 

        Element passed_by_ball = new Element("passed_by_ball");
        passed_by_ball.setText(stats[18]);
        player.addContent(passed_by_ball); 

        Element walks = new Element("walks");
        walks.setText(stats[19]);
        player.addContent(walks); 

        Element strike_outs = new Element("strike_outs");
        strike_outs.setText(stats[20]);
        player.addContent(strike_outs); 

        Element hit_by_pitch = new Element("hit_by_pitch");
        hit_by_pitch.setText(stats[21]);
        player.addContent(hit_by_pitch); 
        
        root.addContent(player);
      }  
      
      Document doc = new Document(root);
      // serialize it into a file
      FileOutputStream fout 
       = new FileOutputStream("baseballstats.xml");
      
      XMLOutputter serializer = new XMLOutputter(); 
      serializer.output(doc, fout);
      fout.flush();	
      fout.close();
      in.close();
      
    }
    catch (IOException e) {
      System.err.println(e);
    }
    catch (ArrayIndexOutOfBoundsException e) {
      System.out.println("Usage: java BaseballTabToXML input_file.tab");
    }

  }

  public static String[] splitLine(String playerStats) {
    
    // count the number of tabs
    int numTabs = 0;
    for (int i = 0; i < playerStats.length(); i++) {
      if (playerStats.charAt(i) == '\t') numTabs++;
    }
    int numFields = numTabs + 1;
    String[] fields = new String[numFields];
    int position = 0;
    for (int i = 0; i < numFields; i++) {
      StringBuffer field = new StringBuffer();
      while (position < playerStats.length() 
       && playerStats.charAt(position++) != '\t') {
        field.append(playerStats.charAt(position-1));
      }
      fields[i] = field.toString();
    }    
    return fields;
    
  }

}

View Output in Browser

Baseball Stats in XML

<?xml version="1.0"?>
<players>
  <player>
    <first_name>FirstName</first_name>
    <surname>Surname</surname>
    <games_played>Games Played</games_played>
    <at_bats>AtBats</at_bats>
    <runs>Runs</runs>
    <hits>Hits</hits>
    <doubles>Doubles</doubles>
    <triples>Triples</triples>
    <home_runs>Home runs</home_runs>
    <stolen_bases>RBI</stolen_bases>
    <caught_stealing>Caught Stealing</caught_stealing>
    <sacrifice_hits>Sacrifice Hits</sacrifice_hits>
    <sacrifice_flies>Sacrifice Flies</sacrifice_flies>
    <errors>Errors</errors>
    <passed_by_ball>PB</passed_by_ball>
    <walks>Walks</walks>
    <strike_outs>Strike outs</strike_outs>
    <hit_by_pitch>Hit by pitch</hit_by_pitch>
  </player>
  <player>
    <first_name>Garret </first_name>
    <surname>Anderson</surname>
    <games_played>156</games_played>
    <at_bats>622</at_bats>
    <runs>62</runs>
    <hits>183</hits>
    <doubles>41</doubles>
    <triples>7</triples>
    <home_runs>15</home_runs>
    <stolen_bases>79</stolen_bases>
    <caught_stealing>3</caught_stealing>
    <sacrifice_hits>3</sacrifice_hits>
    <sacrifice_flies>3</sacrifice_flies>
    <errors>6</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>29</walks>
    <strike_outs>80</strike_outs>
    <hit_by_pitch>1</hit_by_pitch>
  </player>
  <player>
    <first_name>Justin </first_name>
    <surname>Baughman</surname>
    <games_played>62</games_played>
    <at_bats>196</at_bats>
    <runs>24</runs>
    <hits>50</hits>
    <doubles>9</doubles>
    <triples>1</triples>
    <home_runs>1</home_runs>
    <stolen_bases>20</stolen_bases>
    <caught_stealing>4</caught_stealing>
    <sacrifice_hits>5</sacrifice_hits>
    <sacrifice_flies>3</sacrifice_flies>
    <errors>8</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>6</walks>
    <strike_outs>36</strike_outs>
    <hit_by_pitch>1</hit_by_pitch>
  </player>
  <player>
    <first_name>Frank </first_name>
    <surname>Bolick</surname>
    <games_played>21</games_played>
    <at_bats>45</at_bats>
    <runs>3</runs>
    <hits>7</hits>
    <doubles>2</doubles>
    <triples>0</triples>
    <home_runs>1</home_runs>
    <stolen_bases>2</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>11</walks>
    <strike_outs>8</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Gary </first_name>
    <surname>Disarcina</surname>
    <games_played>157</games_played>
    <at_bats>551</at_bats>
    <runs>73</runs>
    <hits>158</hits>
    <doubles>39</doubles>
    <triples>3</triples>
    <home_runs>3</home_runs>
    <stolen_bases>56</stolen_bases>
    <caught_stealing>7</caught_stealing>
    <sacrifice_hits>12</sacrifice_hits>
    <sacrifice_flies>3</sacrifice_flies>
    <errors>14</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>21</walks>
    <strike_outs>51</strike_outs>
    <hit_by_pitch>8</hit_by_pitch>
  </player>
  <player>
    <first_name>Jim </first_name>
    <surname>Edmonds</surname>
    <games_played>154</games_played>
    <at_bats>599</at_bats>
    <runs>115</runs>
    <hits>184</hits>
    <doubles>42</doubles>
    <triples>1</triples>
    <home_runs>25</home_runs>
    <stolen_bases>91</stolen_bases>
    <caught_stealing>5</caught_stealing>
    <sacrifice_hits>1</sacrifice_hits>
    <sacrifice_flies>1</sacrifice_flies>
    <errors>5</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>57</walks>
    <strike_outs>114</strike_outs>
    <hit_by_pitch>1</hit_by_pitch>
  </player>
  <player>
    <first_name>Darin </first_name>
    <surname>Erstad</surname>
    <games_played>133</games_played>
    <at_bats>537</at_bats>
    <runs>84</runs>
    <hits>159</hits>
    <doubles>39</doubles>
    <triples>3</triples>
    <home_runs>19</home_runs>
    <stolen_bases>82</stolen_bases>
    <caught_stealing>6</caught_stealing>
    <sacrifice_hits>1</sacrifice_hits>
    <sacrifice_flies>3</sacrifice_flies>
    <errors>3</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>43</walks>
    <strike_outs>77</strike_outs>
    <hit_by_pitch>6</hit_by_pitch>
  </player>
  <player>
    <first_name>Carlos </first_name>
    <surname>Garcia</surname>
    <games_played>19</games_played>
    <at_bats>35</at_bats>
    <runs>4</runs>
    <hits>5</hits>
    <doubles>1</doubles>
    <triples>0</triples>
    <home_runs>0</home_runs>
    <stolen_bases>0</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>1</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>1</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>3</walks>
    <strike_outs>11</strike_outs>
    <hit_by_pitch>1</hit_by_pitch>
  </player>
  <player>
    <first_name>Troy </first_name>
    <surname>Glaus</surname>
    <games_played>48</games_played>
    <at_bats>165</at_bats>
    <runs>19</runs>
    <hits>36</hits>
    <doubles>9</doubles>
    <triples>0</triples>
    <home_runs>1</home_runs>
    <stolen_bases>23</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>2</sacrifice_flies>
    <errors>7</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>15</walks>
    <strike_outs>51</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Todd </first_name>
    <surname>Greene</surname>
    <games_played>29</games_played>
    <at_bats>71</at_bats>
    <runs>3</runs>
    <hits>18</hits>
    <doubles>4</doubles>
    <triples>0</triples>
    <home_runs>1</home_runs>
    <stolen_bases>7</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>2</walks>
    <strike_outs>20</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Eric </first_name>
    <surname>Helfand</surname>
    <games_played>0</games_played>
    <at_bats>0</at_bats>
    <runs>0</runs>
    <hits>0</hits>
    <doubles>0</doubles>
    <triples>0</triples>
    <home_runs>0</home_runs>
    <stolen_bases>0</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>0</walks>
    <strike_outs>0</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Dave </first_name>
    <surname>Hollins</surname>
    <games_played>101</games_played>
    <at_bats>363</at_bats>
    <runs>60</runs>
    <hits>88</hits>
    <doubles>16</doubles>
    <triples>2</triples>
    <home_runs>11</home_runs>
    <stolen_bases>39</stolen_bases>
    <caught_stealing>3</caught_stealing>
    <sacrifice_hits>2</sacrifice_hits>
    <sacrifice_flies>2</sacrifice_flies>
    <errors>17</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>44</walks>
    <strike_outs>69</strike_outs>
    <hit_by_pitch>7</hit_by_pitch>
  </player>
  <player>
    <first_name>Gregg </first_name>
    <surname>Jefferies</surname>
    <games_played>19</games_played>
    <at_bats>72</at_bats>
    <runs>7</runs>
    <hits>25</hits>
    <doubles>6</doubles>
    <triples>0</triples>
    <home_runs>1</home_runs>
    <stolen_bases>10</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>0</walks>
    <strike_outs>5</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Mark </first_name>
    <surname>Johnson</surname>
    <games_played>10</games_played>
    <at_bats>14</at_bats>
    <runs>1</runs>
    <hits>1</hits>
    <doubles>0</doubles>
    <triples>0</triples>
    <home_runs>0</home_runs>
    <stolen_bases>0</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>0</walks>
    <strike_outs>6</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Chad </first_name>
    <surname>Kreuter</surname>
    <games_played>96</games_played>
    <at_bats>252</at_bats>
    <runs>27</runs>
    <hits>63</hits>
    <doubles>10</doubles>
    <triples>1</triples>
    <home_runs>2</home_runs>
    <stolen_bases>33</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>5</sacrifice_hits>
    <sacrifice_flies>1</sacrifice_flies>
    <errors>9</errors>
    <passed_by_ball>5</passed_by_ball>
    <walks>33</walks>
    <strike_outs>49</strike_outs>
    <hit_by_pitch>3</hit_by_pitch>
  </player>
  <player>
    <first_name>Norberto </first_name>
    <surname>Martin</surname>
    <games_played>79</games_played>
    <at_bats>195</at_bats>
    <runs>20</runs>
    <hits>42</hits>
    <doubles>2</doubles>
    <triples>0</triples>
    <home_runs>1</home_runs>
    <stolen_bases>13</stolen_bases>
    <caught_stealing>1</caught_stealing>
    <sacrifice_hits>3</sacrifice_hits>
    <sacrifice_flies>2</sacrifice_flies>
    <errors>4</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>6</walks>
    <strike_outs>29</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Damon </first_name>
    <surname>Mashore</surname>
    <games_played>43</games_played>
    <at_bats>98</at_bats>
    <runs>13</runs>
    <hits>23</hits>
    <doubles>6</doubles>
    <triples>0</triples>
    <home_runs>2</home_runs>
    <stolen_bases>11</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>1</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>9</walks>
    <strike_outs>22</strike_outs>
    <hit_by_pitch>3</hit_by_pitch>
  </player>
  <player>
    <first_name>Ben </first_name>
    <surname>Molina</surname>
    <games_played>2</games_played>
    <at_bats>1</at_bats>
    <runs>0</runs>
    <hits>0</hits>
    <doubles>0</doubles>
    <triples>0</triples>
    <home_runs>0</home_runs>
    <stolen_bases>0</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>0</walks>
    <strike_outs>0</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Phil </first_name>
    <surname>Nevin</surname>
    <games_played>75</games_played>
    <at_bats>237</at_bats>
    <runs>27</runs>
    <hits>54</hits>
    <doubles>8</doubles>
    <triples>1</triples>
    <home_runs>8</home_runs>
    <stolen_bases>27</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>2</sacrifice_flies>
    <errors>5</errors>
    <passed_by_ball>20</passed_by_ball>
    <walks>17</walks>
    <strike_outs>67</strike_outs>
    <hit_by_pitch>5</hit_by_pitch>
  </player>
  <player>
    <first_name>Charlie </first_name>
    <surname>Obrien</surname>
    <games_played>62</games_played>
    <at_bats>175</at_bats>
    <runs>13</runs>
    <hits>45</hits>
    <doubles>9</doubles>
    <triples>0</triples>
    <home_runs>4</home_runs>
    <stolen_bases>18</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>3</sacrifice_hits>
    <sacrifice_flies>3</sacrifice_flies>
    <errors>4</errors>
    <passed_by_ball>1</passed_by_ball>
    <walks>10</walks>
    <strike_outs>33</strike_outs>
    <hit_by_pitch>2</hit_by_pitch>
  </player>
  <player>
    <first_name>Orlando </first_name>
    <surname>Palmeiro</surname>
    <games_played>74</games_played>
    <at_bats>165</at_bats>
    <runs>28</runs>
    <hits>53</hits>
    <doubles>7</doubles>
    <triples>2</triples>
    <home_runs>0</home_runs>
    <stolen_bases>21</stolen_bases>
    <caught_stealing>4</caught_stealing>
    <sacrifice_hits>7</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>20</walks>
    <strike_outs>11</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Chris </first_name>
    <surname>Pritchett</surname>
    <games_played>31</games_played>
    <at_bats>80</at_bats>
    <runs>12</runs>
    <hits>23</hits>
    <doubles>2</doubles>
    <triples>1</triples>
    <home_runs>2</home_runs>
    <stolen_bases>8</stolen_bases>
    <caught_stealing>0</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>1</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>4</walks>
    <strike_outs>16</strike_outs>
    <hit_by_pitch>0</hit_by_pitch>
  </player>
  <player>
    <first_name>Tim </first_name>
    <surname>Salmon</surname>
    <games_played>136</games_played>
    <at_bats>463</at_bats>
    <runs>84</runs>
    <hits>139</hits>
    <doubles>28</doubles>
    <triples>1</triples>
    <home_runs>26</home_runs>
    <stolen_bases>88</stolen_bases>
    <caught_stealing>1</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>10</sacrifice_flies>
    <errors>2</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>90</walks>
    <strike_outs>100</strike_outs>
    <hit_by_pitch>3</hit_by_pitch>
  </player>
  <player>
    <first_name>Craig </first_name>
    <surname>Shipley</surname>
    <games_played>77</games_played>
    <at_bats>147</at_bats>
    <runs>18</runs>
    <hits>38</hits>
    <doubles>7</doubles>
    <triples>1</triples>
    <home_runs>2</home_runs>
    <stolen_bases>17</stolen_bases>
    <caught_stealing>4</caught_stealing>
    <sacrifice_hits>4</sacrifice_hits>
    <sacrifice_flies>1</sacrifice_flies>
    <errors>3</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>5</walks>
    <strike_outs>22</strike_outs>
    <hit_by_pitch>5</hit_by_pitch>
  </player>
  <player>
    <first_name>Randy </first_name>
    <surname>Velarde</surname>
    <games_played>51</games_played>
    <at_bats>188</at_bats>
    <runs>29</runs>
    <hits>49</hits>
    <doubles>13</doubles>
    <triples>1</triples>
    <home_runs>4</home_runs>
    <stolen_bases>26</stolen_bases>
    <caught_stealing>2</caught_stealing>
    <sacrifice_hits>0</sacrifice_hits>
    <sacrifice_flies>1</sacrifice_flies>
    <errors>4</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>34</walks>
    <strike_outs>42</strike_outs>
    <hit_by_pitch>1</hit_by_pitch>
  </player>
  <player>
    <first_name>Matt </first_name>
    <surname>Walbeck</surname>
    <games_played>108</games_played>
    <at_bats>338</at_bats>
    <runs>41</runs>
    <hits>87</hits>
    <doubles>15</doubles>
    <triples>2</triples>
    <home_runs>6</home_runs>
    <stolen_bases>46</stolen_bases>
    <caught_stealing>1</caught_stealing>
    <sacrifice_hits>5</sacrifice_hits>
    <sacrifice_flies>5</sacrifice_flies>
    <errors>7</errors>
    <passed_by_ball>8</passed_by_ball>
    <walks>30</walks>
    <strike_outs>68</strike_outs>
    <hit_by_pitch>2</hit_by_pitch>
  </player>
  <player>
    <first_name>Reggie </first_name>
    <surname>Williams</surname>
    <games_played>29</games_played>
    <at_bats>36</at_bats>
    <runs>7</runs>
    <hits>13</hits>
    <doubles>1</doubles>
    <triples>0</triples>
    <home_runs>1</home_runs>
    <stolen_bases>5</stolen_bases>
    <caught_stealing>3</caught_stealing>
    <sacrifice_hits>1</sacrifice_hits>
    <sacrifice_flies>0</sacrifice_flies>
    <errors>0</errors>
    <passed_by_ball>0</passed_by_ball>
    <walks>7</walks>
    <strike_outs>11</strike_outs>
    <hit_by_pitch>1</hit_by_pitch>
  </player>
</players>

A Shortcut

import java.io.*;
import org.jdom.*;
import org.jdom.output.XMLOutputter;


public class BaseballTabToXMLShortcut {

  public static void main(String[] args) {
     
    Element root = new Element("players");
    
    try {
      FileInputStream fin = new FileInputStream(args[0]);
      BufferedReader in 
       = new BufferedReader(new InputStreamReader(fin));    

      String playerStats;  
      while ((playerStats = in.readLine()) != null) {
        String[] stats = splitLine(playerStats);
        
        Element player = new Element("player");

        player.addContent((new Element("first_name")).setText(stats[1]));
        player.addContent((new Element("surname")).setText(stats[0]));
        player.addContent((new Element("games_played")).setText(stats[4]));
        player.addContent((new Element("at_bats")).setText(stats[6]));
        player.addContent((new Element("runs")).setText(stats[7]));
        player.addContent((new Element("hits")).setText(stats[8]));
        player.addContent((new Element("doubles")).setText(stats[9]));
        player.addContent((new Element("triples")).setText(stats[10]));
        player.addContent((new Element("home_runs")).setText(stats[11]));
        player.addContent((new Element("runs_batted_in")).setText(stats[12]));
        player.addContent((new Element("stolen_bases")).setText(stats[13]));
        player.addContent((new Element("caught_stealing")).setText(stats[14]));
        player.addContent((new Element("sacrifice_hits")).setText(stats[15]));
        player.addContent((new Element("sacrifice_flies")).setText(stats[16]));
        player.addContent((new Element("errors")).setText(stats[17]));
        player.addContent((new Element("passed_by_ball")).setText(stats[18]));
        player.addContent((new Element("walks")).setText(stats[19]));
        player.addContent((new Element("strike_outs")).setText(stats[20]));
        player.addContent((new Element("hit_by_pitch")).setText(stats[21]));
        
        root.addContent(player);
      }  
      
      Document doc = new Document(root);
      // serialize it into a file
      FileOutputStream fout 
       = new FileOutputStream("baseballstats.xml");
      
      XMLOutputter serializer = new XMLOutputter(); 
      serializer.output(doc, fout);
      fout.flush();	
      fout.close();
      in.close();
      
    }
    catch (IOException e) {
      System.err.println(e);
    }
    catch (ArrayIndexOutOfBoundsException e) {
      System.out.println(
       "Usage: java BaseballTabToXML input_file.tab");
    }

  }

  public static String[] splitLine(String playerStats) {
    
    // count the number of tabs
    int numTabs = 0;
    for (int i = 0; i < playerStats.length(); i++) {
      if (playerStats.charAt(i) == '\t') numTabs++;
    }
    int numFields = numTabs + 1;
    String[] fields = new String[numFields];
    int position = 0;
    for (int i = 0; i < numFields; i++) {
      StringBuffer field = new StringBuffer();
      while (position < playerStats.length() 
       && playerStats.charAt(position++) != '\t') {
        field.append(playerStats.charAt(position-1));
      }
      fields[i] = field.toString();
    }    
    return fields;
    
  }

}

Converting data to XML while Processing it

import java.io.*;
import java.text.*;
import java.util.*;
import org.jdom.*;
import org.jdom.output.XMLOutputter;

public class JDOMBattingAverage {

  public static void main(String[] args) {
     
    Element root = new Element("players");
     
    try {
      FileInputStream fin = new FileInputStream(args[0]);
      BufferedReader in 
       = new BufferedReader(new InputStreamReader(fin));
      
      String playerStats;
      
      // for formatting batting averages
      DecimalFormat averages = (DecimalFormat) 
       NumberFormat.getNumberInstance(Locale.US);
      averages.setMaximumFractionDigits(3);
      averages.setMinimumFractionDigits(3);
      averages.setMinimumIntegerDigits(0);
      
      while ((playerStats = in.readLine()) != null) {
        String[] stats = splitLine(playerStats);
        
        String formattedAverage;
        try {
          int atBats         = Integer.parseInt(stats[6]);
          int hits           = Integer.parseInt(stats[8]);
        
          if (atBats <= 0) formattedAverage = "N/A";
          else {
            double average = hits / (double) atBats;
            formattedAverage = averages.format(average);
          }       
        }
        catch (Exception e) {
          // skip this player
          continue; 
        }

        Element player = new Element("player");

        Element first_name = new Element("first_name");
        first_name.setText(stats[1]);
        player.addContent(first_name);
             
        Element surname = new Element("surname");
        surname.setText(stats[0]);
        player.addContent(surname);
       
        Element battingAverage = new Element("batting_average");
        battingAverage.setText(formattedAverage);
        player.addContent(battingAverage);
   
        root.addContent(player);
        
      }  
      
      
      Document doc = new Document(root);
      // serialize it into a file
      FileOutputStream fout 
       = new FileOutputStream("battingaverages.xml");
      
      XMLOutputter serializer = new XMLOutputter(); 
      serializer.output(doc, fout);
      fout.flush();	
      fout.close();
      in.close();

    }
    catch (IOException e) {
      System.err.println(e);
    }
    catch (ArrayIndexOutOfBoundsException e) {
      System.out.println("Usage: java JDOMBattingAverage input_file.tab");
    }

  }

  public static String[] splitLine(String playerStats) {
    
    // count the number of tabs
    int numTabs = 0;
    for (int i = 0; i < playerStats.length(); i++) {
      if (playerStats.charAt(i) == '\t') numTabs++;
    }
    int numFields = numTabs + 1;
    String[] fields = new String[numFields];
    int position = 0;
    for (int i = 0; i < numFields; i++) {
      StringBuffer field = new StringBuffer();
      while (position < playerStats.length() 
       && playerStats.charAt(position++) != '\t') {
        field.append(playerStats.charAt(position-1));
      }
      fields[i] = field.toString();
    }    
    return fields;
    
  }

}

View Output in Browser

Batting Averages in XML

<?xml version="1.0"?>
<players>
  <player>
    <first_name>Garret </first_name>
    <surname>Anderson</surname>
    <batting_average>.294</batting_average>
  </player>
  <player>
    <first_name>Justin </first_name>
    <surname>Baughman</surname>
    <batting_average>.255</batting_average>
  </player>
  <player>
    <first_name>Frank </first_name>
    <surname>Bolick</surname>
    <batting_average>.156</batting_average>
  </player>
  <player>
    <first_name>Gary </first_name>
    <surname>Disarcina</surname>
    <batting_average>.287</batting_average>
  </player>
  <player>
    <first_name>Jim </first_name>
    <surname>Edmonds</surname>
    <batting_average>.307</batting_average>
  </player>
  <player>
    <first_name>Darin </first_name>
    <surname>Erstad</surname>
    <batting_average>.296</batting_average>
  </player>
  <player>
    <first_name>Carlos </first_name>
    <surname>Garcia</surname>
    <batting_average>.143</batting_average>
  </player>
  <player>
    <first_name>Troy </first_name>
    <surname>Glaus</surname>
    <batting_average>.218</batting_average>
  </player>
  <player>
    <first_name>Todd </first_name>
    <surname>Greene</surname>
    <batting_average>.254</batting_average>
  </player>
  <player>
    <first_name>Eric </first_name>
    <surname>Helfand</surname>
    <batting_average>N/A</batting_average>
  </player>
  <player>
    <first_name>Dave </first_name>
    <surname>Hollins</surname>
    <batting_average>.242</batting_average>
  </player>
  <player>
    <first_name>Gregg </first_name>
    <surname>Jefferies</surname>
    <batting_average>.347</batting_average>
  </player>
  <player>
    <first_name>Mark </first_name>
    <surname>Johnson</surname>
    <batting_average>.071</batting_average>
  </player>
  <player>
    <first_name>Chad </first_name>
    <surname>Kreuter</surname>
    <batting_average>.250</batting_average>
  </player>
  <player>
    <first_name>Norberto </first_name>
    <surname>Martin</surname>
    <batting_average>.215</batting_average>
  </player>
  <player>
    <first_name>Damon </first_name>
    <surname>Mashore</surname>
    <batting_average>.235</batting_average>
  </player>
  <player>
    <first_name>Ben </first_name>
    <surname>Molina</surname>
    <batting_average>.000</batting_average>
  </player>
  <player>
    <first_name>Phil </first_name>
    <surname>Nevin</surname>
    <batting_average>.228</batting_average>
  </player>
  <player>
    <first_name>Charlie </first_name>
    <surname>Obrien</surname>
    <batting_average>.257</batting_average>
  </player>
  <player>
    <first_name>Orlando </first_name>
    <surname>Palmeiro</surname>
    <batting_average>.321</batting_average>
  </player>
  <player>
    <first_name>Chris </first_name>
    <surname>Pritchett</surname>
    <batting_average>.288</batting_average>
  </player>
  <player>
    <first_name>Tim </first_name>
    <surname>Salmon</surname>
    <batting_average>.300</batting_average>
  </player>
  <player>
    <first_name>Craig </first_name>
    <surname>Shipley</surname>
    <batting_average>.259</batting_average>
  </player>
  <player>
    <first_name>Randy </first_name>
    <surname>Velarde</surname>
    <batting_average>.261</batting_average>
  </player>
  <player>
    <first_name>Matt </first_name>
    <surname>Walbeck</surname>
    <batting_average>.257</batting_average>
  </player>
  <player>
    <first_name>Reggie </first_name>
    <surname>Williams</surname>
    <batting_average>.361</batting_average>
  </player>
</players>

Advantages of JDOM for Writing Documents

You don't need to worry about well-formedness rules
Very configurable output
You can pick any encoding Java supports.
Validity is not automatically maintained.

Reading XML with JDOM

The stereotypical "Desperate Perl Hacker" (DPH) is supposed to be able to write an XML parser in a weekend.
The parser does the hard work for you.
Your code reads the document through by hooking up JDOM to the parser.
JDOM can connect to any parser that supports SAX or DOM.

JDOM Compatible Parsers for Java

Any SAX or DOM compatible parser including:

Apache XML Project's Xerces Java: http://xml.apache.org/xerces-j/index.html
Oracle's XML Parser for Java: http://technet.oracle.com/tech/xml/parser_java2
Sun's Java API for XML http://java.sun.com/products/xml

The Design of the DOM API

Parser independent interfaces; parser dependent implementation classes. Most programs must use the parser dependent classes. JAXP helps solve this, but so far only for DOM Level 1.
Everything's a Node:
- Extensive use of polymorphism
- Lots of casting
Language independence means there's very limited use of the Java class library; Various features are reinvented
Language independence requires no method overloading because not all languages support it.
Several features are poor design in Java, if not in other languages:
- Named constants are often shorts
- Only one kind of exception; details provided by constants
- No Java-specific utility methods like equals(), hashCode(), clone(), or toString()

The JDOM Process

Construct an org.jdom.input.SAXBuilder or an org.jdom.input.DOMBuilder; no parser specific code is needed!
Invoke the builder's build() method to build a Document object from a
- Reader
- InputStream
- URL
- File
- String containing a SYSTEM ID
If there's a problem building the document, a JDOMException is thrown
Work with the resulting Document object

Parsing a Document with JDOM

import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;


public class JDOMChecker {

  public static void main(String[] args) {
  
    if (args.length == 0) {
      System.out.println("Usage: java JDOMChecker URL1 URL2..."); 
    } 
      
    SAXBuilder builder = new SAXBuilder();
     
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        builder.build(args[i]);
        // If there are no well-formedness errors, 
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
             // indicates a well-formedness or other error
      catch (JDOMException e) { 
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage());
      }
      
    }   
  
  }

}

Parser Results

% java JDOMChecker shortlogs.xml HelloJDOM.java
shortlogs.xml is well formed.
HelloJDOM.java is not well formed.
The markup in the document preceding the root element must be well-formed.: 
Error on line 1 of XML document: The markup in the document preceding the 
root element must be well-formed.

Turning on Validation in JDOM

Not all parsers are validating but Xerces-J is.
Validity errors are not fatal; therefore they do not necessarily cause a JDOMException
However, you can tell the builder you want it to validate by passing true to its constructor:
```
    SAXBuilder builder = new SAXBuilder(true);
```

JDOM Validator

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;


public class JDOMValidator {

  public static void main(String[] args) {

    XMLReader parser;
    try {
     parser = XMLReaderFactory.createXMLReader();
    }
    catch (Exception e) {
      // fall back on Xerces parser by name
      try {
        parser = XMLReaderFactory.createXMLReader(
         "org.apache.xerces.parsers.SAXParser");
      }
      catch (Exception ee) {
        System.err.println("Couldn't locate a SAX parser");
        return;
      }
    }

    // turn on validation
    try {
      parser.setFeature(
       "http://xml.org/sax/features/validation", true);
      parser.setErrorHandler(new ValidityErrorReporter());
    }
    catch (SAXNotRecognizedException e) {
      System.err.println(
       "Installed XML parser cannot validate;"
       + " checking for well-formedness instead...");
    }
    catch (SAXNotSupportedException e) {
      System.err.println(
       "Cannot turn on validation here; "
       + "checking for well-formedness instead...");
    }

    if (args.length == 0) {
      System.out.println("Usage: java JDOMValidator URL1 URL2...");
    }

    // start parsing...
    for (int i = 0; i < args.length; i++) {

      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
        // If there are no well-formedness errors,
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber()
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not check " + args[i]
         + " because of the IOException " + e);
      }

    }

  }

}

Validation Output

% java JDOMValidator invalid_fibonacci.xml
invalid_fibonacci.xml is not valid.
Element type "title" must be declared.: Error on line 8 of XML document: 
Element type "title" must be declared.

% java JDOMValidator validfibonacci.xml
validfibonacci.xml is valid.

Building with DOM instead of SAX

Use DOMBuilder instead of SAXBuilder
Must have an existing DOM tree, specifically an org.w3c.dom.Document (Note the name conflict with org.jdom.Document)
DOM validation is currently broken.
Approximately doubles the memory usage.
In general, SAX is easier and more efficient.

DOMBuilder Example

import org.jdom.*;
import org.jdom.input.DOMBuilder;
import org.apache.xerces.parsers.*;


public class DOMValidator {

  public static void main(String[] args) {
  
    if (args.length == 0) {
      System.out.println("Usage: java DOMValidator URL1 URL2..."); 
    }      
      
    DOMBuilder builder = new DOMBuilder(true);
                             /*         ^^^^       */
                             /* Turn on validation */
    // start parsing... 
    DOMParser parser = new DOMParser();  // Xerces specific class
    for (int i = 0; i < args.length; i++) {
        
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
    
        org.w3c.dom.Document domDoc  = parser.getDocument();
        org.jdom.Document    jdomDoc = builder.build(domDoc);

        // If there are no validity errors, 
        // then no exception is thrown
        System.out.println(args[i] + " is valid.");
      }
             // indicates a well-formedness or validity error
      catch (Exception e) { 
        System.out.println(args[i] + " is not valid.");
        System.out.println(e.getMessage());
      }
      
    }   
  
  }

}

Weblogs with JDOM

UserLand's RSS based list of Web logs at http://static.userland.com/weblogMonitor/logs.xml:

<?xml version="1.0"?>
<!-- <!DOCTYPE foo SYSTEM "http://msdn.microsoft.com/xml/general/htmlentities.dtd"> -->
<weblogs>
	<log>
		<name>MozillaZine</name>
		<url>http://www.mozillazine.org</url>
		<changesUrl>http://www.mozillazine.org/contents.rdf</changesUrl>
		<ownerName>Jason Kersey</ownerName>
		<ownerEmail>kerz@en.com</ownerEmail>
		<description>THE source for news on the Mozilla Organization.  DevChats, Reviews, Chats, Builds, Demos, Screenshots, and more.</description>
		<imageUrl></imageUrl>
		<adImageUrl>http://static.userland.com/weblogMonitor/ads/kerz@en.com.gif</adImageUrl>
		</log>
	<log>
		<name>SalonHerringWiredFool</name>
		<url>http://www.salonherringwiredfool.com/</url>
		<ownerName>Some Random Herring</ownerName>
		<ownerEmail>salonfool@wiredherring.com</ownerEmail>
		<description></description>
		</log>
	<log>
		<name>Scripting News</name>
		<url>http://www.scripting.com/</url>
		<ownerName>Dave Winer</ownerName>
		<ownerEmail>dave@userland.com</ownerEmail>
		<description>News and commentary from the cross-platform scripting community.</description>
		<imageUrl>http://www.scripting.com/gifs/tinyScriptingNews.gif</imageUrl>
		<adImageUrl>http://static.userland.com/weblogMonitor/ads/dave@userland.com.gif</adImageUrl>
		</log>
	<log>
		<name>SlashDot.Org</name>
		<url>http://www.slashdot.org/</url>
		<ownerName>Simply a friend</ownerName>
		<ownerEmail>afriendofweblogs@weblogs.com</ownerEmail>
		<description>News for Nerds, Stuff that Matters.</description>
		</log>
	</weblogs>

Full list

Goal: Return a list of all the URLs in this list as java.net.URL objects

Design Decisions

Should we return an array, an Enumeration, a List, or what?
Perhaps we should use multiple threads?

JDOM Design

We can easily find out how many URLs there will be when we start parsing.
Single threaded by nature; no benefit to mutiple threads since no data will be available until the entire document has been read and parsed.
The character data of each url element needs to be read. Everything else can be ignored.
The format is very straight-forward so we don't need to traverse the entire tree.
The XML parsing is so straight-forward it can be done inside one method. No extra class is required.

Weblogs with JDOM

import org.jdom.*;
import org.jdom.input.SAXBuilder;
import java.util.*;
import java.net.*;


public class WeblogsJDOM {
   
  public static String DEFAULT_SYSTEM_ID 
   = "http://static.userland.com/weblogMonitor/logs.xml"; 
     
  public static List listChannels() throws JDOMException {
    return listChannels(DEFAULT_SYSTEM_ID); 
  }
  
  public static List listChannels(String systemID) 
   throws JDOMException, NullPointerException {
    
    if (systemID == null) {
      throw new NullPointerException("URL must be non-null");   
    }
    
    SAXBuilder builder = new SAXBuilder();
    // Load the entire document into memory 
    // from the network or file system
    Document doc = builder.build(systemID);
    
    // Descend the tree and find the URLs. It helps that
    // the document has a very regular structure.
    Element weblogs = doc.getRootElement();
    List logs = weblogs.getChildren("log");
    Vector urls = new Vector(logs.size());
    Iterator iterator = logs.iterator();
    while (iterator.hasNext()) {
      Object o = iterator.next();
      Element log = (Element) o;
      try {
                         // This will probably be changed to 
                         //  getElement() or getChildElement() 
        Element url = log.getChild("url"); 
        if (url == null) continue;
        String content = url.getTextTrim();
        URL u = new URL(content);
        urls.addElement(u);
      }
      catch (MalformedURLException e) {
        // bad input data from one third party; just ignore it 
      }
    }
    return urls;
    
  }
  
  public static void main(String[] args) {
   
    try {
      List urls;
      if (args.length > 0) {
        urls = listChannels(args[0]);
      }
      else {
        urls = listChannels();
      }
      Iterator iterator = urls.iterator();
      while (iterator.hasNext()) {
        System.out.println(iterator.next()); 
      }
    }
    catch (/* Unexpected */ Exception e) {
      e.printStackTrace(); 
    }
    
  }
  
}

Weblogs Output

% java WeblogsJDOM
http://2020Hindsight.editthispage.com/
http://www.sff.net/people/mitchw/weblog/weblog.htp
http://nate.weblogs.com/
http://plugins.launchpoint.net
http://404.psistorm.net
http://home.att.net/~geek9000
http://daubnet.tzo.com/weblog
several hundred more...

The org.jdom Package

The classes that represent an XML document and its parts

Document
Element
Attribute
Comment
DocType
EntityRef
Text
CDATA
ProcessingInstruction
Verifier
plus assorted exceptions

The Document Node

The root node containing the entire document; not the same as the root element
Contains:
- one element
- zero or more processing instructions
- zero or more comments
- zero or one document type declarations

The Document Class

package org.jdom;

public class Document implements Serializable, Cloneable {

  protected List    content;
  protected DocType docType;

  protected Document() {}
  public    Document(Element rootElement) {}
  public    Document(Element rootElement, DocType docType) {}
  public    Document(List content) {}
  public    Document(List content, DocType doctype) {}

  public Element   getRootElement() {}
  public Document  setRootElement(Element rootElement) {}
  public DocType   getDocType() {}
  public Document  setDocType(DocType docType) {}
  public List      getMixedContent() {}
  public Document  addContent(ProcessingInstruction pi) {}
  public Document  addContent(Comment comment) {}
  public Document  setMixedContent(List mixedContent) {}
  
  // basic utility methods
  public final String  toString() {}
  public final boolean equals(Object ob) {}
  public final int     hashCode() {}
  public final Object  clone() {}

}

Document Example

import org.jdom.Document;
import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;
import java.io.IOException;


public class XMLPrinter {

  public static void main(String[] args) {
  
    if (args.length == 0) {
      System.out.println("Usage: java XMLPrinter URL1 URL2..."); 
    } 
      
    SAXBuilder builder = new SAXBuilder();
     
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        Document doc = builder.build(args[i]);
        System.out.println("*************" + args[i] 
         + "*************");
        XMLOutputter outputter = new XMLOutputter();
        outputter.output(doc, System.out);
      }
      // indicates a well-formedness or other error
      catch (JDOMException e) { 
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage());
      }
      // shouldn't happen because System.out eats exceptions
      catch (IOException e) { 
        System.out.println(e.getMessage());
      }
      
    }   
  
  }

}

Output from XMLPrinter

% java XMLPrinter shortlogs.xml
*************shortlogs.xml*************
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo SYSTEM "http://msdn.microsoft.com/xml/general/htmlentities.dtd"><weblogs>
        <log>
                <name>MozillaZine</name>
                <url>http://www.mozillazine.org</url>
                <changesUrl>http://www.mozillazine.org/contents.rdf</changesUrl>

                <ownerName>Jason Kersey</ownerName>
                <ownerEmail>kerz@en.com</ownerEmail>
                <description>THE source for news on the Mozilla Organization.  DevChats, Reviews, Chats, Builds, Demos, Screenshots, and more.</description>
                <imageUrl />
                <adImageUrl>http://static.userland.com/weblogMonitor/ads/kerz@en.com.gif</adImageUrl>
                </log>
        <log>
                <name>SalonHerringWiredFool</name>
                <url>http://www.salonherringwiredfool.com/</url>
                <ownerName>Some Random Herring</ownerName>
                <ownerEmail>salonfool@wiredherring.com</ownerEmail>
                <description />
                </log>
        <log>
                <name>SlashDot.Org</name>
                <url>http://www.slashdot.org/</url>
                <ownerName>Simply a friend</ownerName>
                <ownerEmail>afriendofweblogs@weblogs.com</ownerEmail>
                <description>News for Nerds, Stuff that Matters.</description>
                </log>
        </weblogs>

Element Nodes

Represents a complete element including its start tag, end tag, and content
Contains:
- Child Elements
- Processing Instructions
- Comments
- Text
- CDATA sectiond
- Entity references
JDOM enforces restrictions on element names and possibly values; e.g. name cannot contain start with a digit.

Element Class Implementation

The content is stored as a java.util.List which contains
- One String (soon to be Text) object per text node
- One Element object per child element
- One Comment object per comment
- One CDATA object per CDATA section (Text?)
- One ProcessingInstruction object per processing instruction
Use the regular methods of java.util.List to add, remove, and inspect the contents of an element
Since the methods of java.util.List expect to work with Object objects, casting back to JDOM types and String is frequent
Various utility methods mean you don't always have to work with the full list.
Attributes and namespaces are available as separate lists since these are not children.

The Element Class

package org.jdom;

public class Element implements Serializable, Cloneable {

    protected           String    name;
    protected           Namespace namespace;
    protected           Object    parent;
    protected           List      attributes;
    protected transient ArrayList additionalNamespaces
    protected           ArrayList content;

    protected Element() {}
    public    Element(String name, Namespace namespace) {}
    public    Element(String name) {}
    public    Element(String name, String uri) {}
    public    Element(String name, String prefix, String uri) {}

    public String     getName() {}
    public Namespace  getNamespace() {}
    public Namespace  getNamespace(String prefix) {}
    public String     getNamespacePrefix() {}
    public String     getNamespaceURI() {}
    public String     getQualifiedName() {}
    public Element    getParent() {}
    
    protected Element setParent(Element parent) {}
    public    boolean isRootElement() {}
    protected Element setIsRootElement(boolean isRootElement) {}
    public    Element setChildren(List children)
    protected Element setDocument(Document document)
    public    Element setMixedContent(List mixedContent)
    public    Element setName(String name)
    public    Element setNamespace(Namespace namespace)
    public    Element setText(String text)

    public String    getText() {} 
    public String    getTextTrim() {} 
    public String    getTextNormalize() {} 
    public List      getMixedContent() {}
    public String    getChildText(String name) {} 
    public String    getChildTextTrim(String name) {} 
    public String    getChildText(String name, Namespace ns) {} 

    public Element   setMixedContent(List mixedContent) {} 
    public List      getChildren() {} 
    public Element   setChildren(List children) {} 
    public List      getChildren(String name) {} 
    public List      getChildren(String name, Namespace ns) {} 
    public Element   getChild(String name, Namespace ns) {} 
    public Element   getChild(String name) {} 
    public boolean   removeChild(String name) {} 
    public boolean   removeChild(String name, Namespace ns) {} 
    public boolean   removeChildren(String name) {}
    public boolean   removeChildren(String name, Namespace ns) {} 
    public boolean   removeChildren() {} 
    
    public Element   addContent(String text) {}
    public Element   addContent(Element element) {} 
    public Element   addContent(ProcessingInstruction pi) {} 
    public Element   addContent(EntityRef entity) {} 
    public Element   addContent(Comment comment) {} 
    public Element   addContent(CDATA cdata) {} 
    public boolean   removeContent(Element element) {} 
    public boolean   removeContent(CDATA cdata) {} 
    public boolean   removeContent(ProcessingInstruction pi) {} 
    public boolean   removeContent(EntityRef entity) {} 
    public boolean   removeContent(Comment comment) {} 
    
    public List      getAttributes() {} 
    public Attribute getAttribute(String name) {} 
    public Attribute getAttribute(String name, Namespace ns) {} 
    public String    getAttributeValue(String name) {} 
    public String    getAttributeValue(String name, Namespace ns) {} 
    public Element   setAttribute(Attribute attribute) {} 
    public Element   setAttributes(List attributes) {} 
    public boolean   removeAttribute(String name) {} 
    public boolean   removeAttribute(String name, Namespace ns) {} 

    public void addNamespaceDeclaration(Namespace additionalNamespace) {}
    public void removeNamespaceDeclaration(Namespace additionalNamespace) {}
    public List getAdditionalNamespaces() {}

    public Element detach() {}
    
    ///////////////////////////////////////
    // Basic Utility Methods
    /////////////////////////////////////// 
    public final String  toString() {}
    public final boolean equals(Object ob) {}
    public final int     hashCode() {}
    public final Object  clone() {}
    
}

Element Example: XCount

import org.jdom.*;
import org.jdom.input.SAXBuilder;
import java.util.*;


public class XCount {

  public static void main(String[] args) {
  
    if (args.length == 0) {
      System.out.println("Usage: java XCount URL1 URL2..."); 
    } 
      
    SAXBuilder builder = new SAXBuilder();
     
    System.out.println(
     "File\tElements\tAttributes\tComments\tProcessing Instructions\tCharacters");
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        Document doc = builder.build(args[i]);
        System.out.print(args[i] + ":\t");
        String result = count(doc);
        System.out.println(result);
      }
             // indicates a well-formedness or other error
      catch (JDOMException e) { 
        System.out.println(args[i] 
         + " is not a well formed XML document.");
        System.out.println(e.getMessage());
      }
      
    }   
  
  }  

  private static int numCharacters             = 0;
  private static int numComments               = 0;
  private static int numElements               = 0;
  private static int numAttributes             = 0;
  private static int numProcessingInstructions = 0;
      
  public static String count(Document doc) {

    numCharacters = 0;
    numComments = 0;
    numElements = 0;
    numAttributes = 0;
    numProcessingInstructions = 0;  

    List children = doc.getMixedContent();
    Iterator iterator = children.iterator();
    while (iterator.hasNext()) {
      Object o = iterator.next();
      if (o instanceof Element) {
        numElements++;
        count((Element) o);
      }
      else if (o instanceof Comment) numComments++;
      else if (o instanceof ProcessingInstruction) {
        numProcessingInstructions++;   
      }
    }
    
    String result = numElements + "\t" + numAttributes + "\t" 
     + numComments + "\t" + numProcessingInstructions + "\t" 
     + numCharacters;
    return result;
       
  }     

  public static void count(Element element) {

    List attributes = element.getAttributes();
    numAttributes += attributes.size();
    List children = element.getMixedContent();
    Iterator iterator = children.iterator();
    while (iterator.hasNext()) {
      Object o = iterator.next();
      if (o instanceof Element) {
        numElements++;
        count((Element) o);
      }
      else if (o instanceof Comment) numComments++;
      else if (o instanceof ProcessingInstruction) {
        numProcessingInstructions++;   
      }
      else if (o instanceof String) {
        String s = (String) o;
        numCharacters += s.length();
      }   
    }
        
  }  

}

XCount Output

% java XCount shortlogs.xml hotcop.xml
File    Elements        Attributes      Comments        Processing Instructions
Characters
shortlogs.xml:  30      0       0       0       736
hotcop.xml:     11      8       2       1       95

Handling Attributes in JDOM

Each attribute is represented as an Attribute object
Each Attribute has:
- A local name, a String
- A value, a String
- A Namespace object (which may be Namespace.NO_NAMESPACE)
Everything else can be determined from these three items.

Convenience methods can convert the attribute value to various types like int or double
JDOM enforces restrictions on attribute names and values; e.g. value may not contain < or >
Attributes are stored in a java.util.List in the Element that contains them
This list only contains Attribute objects.

The Attribute Class

package org.jdom;

public class Attribute implements Serializable, Cloneable {

    protected String    name;
    protected Namespace namespace;
    protected String    value;
    protected Element   parent;

    protected Attribute() {}
    public    Attribute(String name, String value) {}
    public    Attribute(String name, String value, Namespace namespace) {}

    public String    getName() {}
    public Attribute setName(String name) {}
    public String    getQualifiedName() {}
    public String    getNamespacePrefix() {}
    public String    getNamespaceURI() {}
    public Namespace getNamespace() {}
    public String    getValue() {}
    public Attribute setValue(String value) {}
    protected Attribute setParent(Element parent) {}
    
    public Attribute detach() {}

    /////////////////////////////////////////////////////////////////
    // Basic Utility Methods
    /////////////////////////////////////////////////////////////////

    public final String  toString() {}
    public final boolean equals(Object ob) {}
    public final int     hashCode() {}
    public final Object  clone() {}

    /////////////////////////////////////////////////////////////////
    // Convenience Methods below here
    /////////////////////////////////////////////////////////////////

    public String  getValue(String defaultValue) {}
    public int     getIntValue(int defaultValue) {}
    public int     getIntValue() throws DataConversionException {}
    public long    getLongValue(long defaultValue) {}
    public long    getLongValue() throws DataConversionException {}
    public float   getFloatValue(float defaultValue) {}
    public float   getFloatValue() throws DataConversionException {}
    public double  getDoubleValue(double defaultValue) {}
    public double  getDoubleValue() throws DataConversionException {}
    public boolean getBooleanValue(boolean defaultValue) {}
    public boolean getBooleanValue() throws DataConversionException {}
    public char    getCharValue(char defaultValue) {}
    public char    getCharValue() throws DataConversionException {}

}

IDTagger

import java.io.IOException;
import org.jdom.*;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;
import java.util.*;


public class JDOMIDTagger {

  private static int id = 1;

  public static void processElement(Element element) {

    if (element.getAttribute("ID") == null) {
      element.addAttribute(new Attribute("ID", "_" + id));
      id = id + 1; 
    }
    
    // recursion
    List children = element.getChildren();
    Iterator iterator = children.iterator();
    while (iterator.hasNext()) {
      processElement((Element) iterator.next());   
    }
    
  }

  public static void main(String[] args) {
     
    SAXBuilder builder = new SAXBuilder();
    
    for (int i = 0; i < args.length; i++) {
        
      try {
        // Read the entire document into memory
        Document document = builder.build(args[i]); 
       
        processElement(document.getRootElement());
        
        // now we serialize the document...
        XMLOutputter serializer = new XMLOutputter(); 
        serializer.output(document, System.out);
        System.out.flush();	        
      }
      catch (JDOMException e) {
        System.err.println(e);
        continue; 
      }
      catch (IOException e) {
        System.err.println(e);
        continue; 
      }
      
    }
  
  } // end main

}

Before IDTagger

<?xml version="1.0"?><backslash
xmlns:backslash="http://slashdot.org/backslash.dtd">

 <story>
    <title>The Onion to buy the New York Times</title>
    <url>http://slashdot.org/articles/00/02/19/1128240.shtml</url>
    <time>2000-02-19 17:25:15</time>
    <author>CmdrTaco</author>
    <department>stuff-to-read</department>
    <topic>media</topic>
    <comments>20</comments>
    <section>articles</section>
    <image>topicmedia.gif</image>
  </story>
 <story>
    <title>Al Gore's Webmaster Answers Your Questions</title>
    <url>http://slashdot.org/interviews/00/02/19/0932207.shtml</url>
    <time>2000-02-19 17:00:52</time>
    <author>Roblimo</author>
    <department>political-process-online</department>
    <topic>usa</topic>
    <comments>49</comments>
    <section>interviews</section>
    <image>topicus.gif</image>
  </story>
 <story>
    <title>Open Source Africa</title>
    <url>http://slashdot.org/articles/00/02/19/1016216.shtml</url>
    <time>2000-02-19 16:05:58</time>
    <author>emmett</author>
    <department>songs-by-toto</department>
    <topic>linux</topic>
    <comments>50</comments>
    <section>articles</section>
    <image>topiclinux.gif</image>
  </story>
 <story>
    <title>Microsoft Funded by NSA, Helps Spy on Win Users?</title>
    <url>http://slashdot.org/articles/00/02/19/0750247.shtml</url>
    <time>2000-02-19 14:07:04</time>
    <author>Roblimo</author>
    <department>deep-dark-conspiracy-theories</department>
    <topic>microsoft</topic>
    <comments>154</comments>
    <section>articles</section>
    <image>topicms.gif</image>
  </story>
 <story>
    <title>X-Men Trailer Released</title>
    <url>http://slashdot.org/articles/00/02/18/0829209.shtml</url>
    <time>2000-02-19 13:47:06</time>
    <author>emmett</author>
    <department>mutant</department>
    <topic>movies</topic>
    <comments>70</comments>
    <section>articles</section>
    <image>topicmovies.gif</image>
  </story>
 <story>
    <title>Connell Replies to "Grok" Comments</title>
    <url>http://slashdot.org/articles/00/02/18/202240.shtml</url>
    <time>2000-02-19 05:01:37</time>
    <author>Hemos</author>
    <department>replying-to-things</department>
    <topic>linux</topic>
    <comments>197</comments>
    <section>articles</section>
    <image>topiclinux.gif</image>
  </story>
 <story>
    <title>etoy.com Returns</title>
    <url>http://slashdot.org/yro/00/02/18/1739216.shtml</url>
    <time>2000-02-19 02:35:06</time>
    <author>nik</author>
    <department>NP:-gimme-shelter</department>
    <topic>internet</topic>
    <comments>77</comments>
    <section>yro</section>
    <image>topicinternet.jpg</image>
  </story>
 <story>
    <title>New Propaganda Series: Rebirth</title>
    <url>http://slashdot.org/articles/00/02/18/205232.shtml</url>
    <time>2000-02-19 01:05:26</time>
    <author>Hemos</author>
    <department>as-pretty-as-always</department>
    <topic>graphics</topic>
    <comments>120</comments>
    <section>articles</section>
    <image>topicgraphics3.gif</image>
  </story>
 <story>
    <title>Giving Back</title>
    <url>http://slashdot.org/features/00/02/18/1631224.shtml</url>
    <time>2000-02-18 22:27:26</time>
    <author>emmett</author>
    <department>salvation-army</department>
    <topic>news</topic>
    <comments>122</comments>
    <section>features</section>
    <image>topicnews.gif</image>
  </story>
 <story>
    <title>Connectix Considering Open Sourcing VGS?</title>
    <url>http://slashdot.org/articles/00/02/18/1050225.shtml</url>
    <time>2000-02-18 20:46:20</time>
    <author>emmett</author>
    <department>grain-of-salt</department>
    <topic>news</topic>
    <comments>93</comments>
    <section>articles</section>
    <image>topicnews.gif</image>
  </story>
</backslash>

View Input in Browser

After IDTagger

<?xml version="1.0" encoding="UTF-8"?>
<backslash ID="_1">
  <story ID="_2">
    <title ID="_3">The Onion to buy the New York Times</title>
    <url ID="_4">http://slashdot.org/articles/00/02/19/1128240.shtml</url>
    <time ID="_5">2000-02-19 17:25:15</time>
    <author ID="_6">CmdrTaco</author>
    <department ID="_7">stuff-to-read</department>
    <topic ID="_8">media</topic>
    <comments ID="_9">20</comments>
    <section ID="_10">articles</section>
    <image ID="_11">topicmedia.gif</image>
  </story>
  <story ID="_12">
    <title ID="_13">Al Gore's Webmaster Answers Your Questions</title>
    <url ID="_14">http://slashdot.org/interviews/00/02/19/0932207.shtml</url>
    <time ID="_15">2000-02-19 17:00:52</time>
    <author ID="_16">Roblimo</author>
    <department ID="_17">political-process-online</department>
    <topic ID="_18">usa</topic>
    <comments ID="_19">49</comments>
    <section ID="_20">interviews</section>
    <image ID="_21">topicus.gif</image>
  </story>
  <story ID="_22">
    <title ID="_23">Open Source Africa</title>
    <url ID="_24">http://slashdot.org/articles/00/02/19/1016216.shtml</url>
    <time ID="_25">2000-02-19 16:05:58</time>
    <author ID="_26">emmett</author>
    <department ID="_27">songs-by-toto</department>
    <topic ID="_28">linux</topic>
    <comments ID="_29">50</comments>
    <section ID="_30">articles</section>
    <image ID="_31">topiclinux.gif</image>
  </story>
  <story ID="_32">
    <title ID="_33">Microsoft Funded by NSA, Helps Spy on Win Users?</title>
    <url ID="_34">http://slashdot.org/articles/00/02/19/0750247.shtml</url>
    <time ID="_35">2000-02-19 14:07:04</time>
    <author ID="_36">Roblimo</author>
    <department ID="_37">deep-dark-conspiracy-theories</department>
    <topic ID="_38">microsoft</topic>
    <comments ID="_39">154</comments>
    <section ID="_40">articles</section>
    <image ID="_41">topicms.gif</image>
  </story>
  <story ID="_42">
    <title ID="_43">X-Men Trailer Released</title>
    <url ID="_44">http://slashdot.org/articles/00/02/18/0829209.shtml</url>
    <time ID="_45">2000-02-19 13:47:06</time>
    <author ID="_46">emmett</author>
    <department ID="_47">mutant</department>
    <topic ID="_48">movies</topic>
    <comments ID="_49">70</comments>
    <section ID="_50">articles</section>
    <image ID="_51">topicmovies.gif</image>
  </story>
  <story ID="_52">
    <title ID="_53">Connell Replies to "Grok" Comments</title>
    <url ID="_54">http://slashdot.org/articles/00/02/18/202240.shtml</url>
    <time ID="_55">2000-02-19 05:01:37</time>
    <author ID="_56">Hemos</author>
    <department ID="_57">replying-to-things</department>
    <topic ID="_58">linux</topic>
    <comments ID="_59">197</comments>
    <section ID="_60">articles</section>
    <image ID="_61">topiclinux.gif</image>
  </story>
  <story ID="_62">
    <title ID="_63">etoy.com Returns</title>
    <url ID="_64">http://slashdot.org/yro/00/02/18/1739216.shtml</url>
    <time ID="_65">2000-02-19 02:35:06</time>
    <author ID="_66">nik</author>
    <department ID="_67">NP:-gimme-shelter</department>
    <topic ID="_68">internet</topic>
    <comments ID="_69">77</comments>
    <section ID="_70">yro</section>
    <image ID="_71">topicinternet.jpg</image>
  </story>
  <story ID="_72">
    <title ID="_73">New Propaganda Series: Rebirth</title>
    <url ID="_74">http://slashdot.org/articles/00/02/18/205232.shtml</url>
    <time ID="_75">2000-02-19 01:05:26</time>
    <author ID="_76">Hemos</author>
    <department ID="_77">as-pretty-as-always</department>
    <topic ID="_78">graphics</topic>
    <comments ID="_79">120</comments>
    <section ID="_80">articles</section>
    <image ID="_81">topicgraphics3.gif</image>
  </story>
  <story ID="_82">
    <title ID="_83">Giving Back</title>
    <url ID="_84">http://slashdot.org/features/00/02/18/1631224.shtml</url>
    <time ID="_85">2000-02-18 22:27:26</time>
    <author ID="_86">emmett</author>
    <department ID="_87">salvation-army</department>
    <topic ID="_88">news</topic>
    <comments ID="_89">122</comments>
    <section ID="_90">features</section>
    <image ID="_91">topicnews.gif</image>
  </story>
  <story ID="_92">
    <title ID="_93">Connectix Considering Open Sourcing VGS?</title>
    <url ID="_94">http://slashdot.org/articles/00/02/18/1050225.shtml</url>
    <time ID="_95">2000-02-18 20:46:20</time>
    <author ID="_96">emmett</author>
    <department ID="_97">grain-of-salt</department>
    <topic ID="_98">news</topic>
    <comments ID="_99">93</comments>
    <section ID="_100">articles</section>
    <image ID="_101">topicnews.gif</image>
  </story>
</backslash>

View Output in Browser

Handling Entities in JDOM

Unparsed entities really aren't handled at all.
Most of the time, the parser resolves general entity references and you never see them.
If the parser doesn't resolve a general entity reference, an EntityRef object will be left in the tree.
When writing, the outputter outputs entity references but not the entity's content.
This one is still being thought out.

The EntityRef Class

package org.jdom;

public class EntityRef implements Serializable, Cloneable {

    protected String name;
    protected String publicID;
    protected String systemID;
    protected Element parent;
    protected Document document;

    protected EntityRef() {}
    public EntityRef(String name) {}
    public EntityRef(String name, String publicID, String systemID) {}
    
    public EntityRef detach() {}
    
    public Document  getDocument() {}
    public String    getName() {}
    public Element   getParent() {}
    public String    getPublicID()  {}
    public String    getSystemID() {}

    protected EntityRef setParent(Element parent) {}
    public    EntityRef setName(String newPublicID) {}
    public    EntityRef setPublicID(String newPublicID) {}
    public    EntityRef setSystemID(String newSystemID) {}

    public Object clone() {}
    public final boolean equals(Object o) {}
    public final int hashCode() {}
    public String toString() {}
    
}

Handling Comments in JDOM

A Comment object Represents a comment like this example from the XML 1.0 spec:

<!--* N.B. some readers (notably JC) find the following
paragraph awkward and redundant.  I agree it's logically redundant:
it *says* it is summarizing the logical implications of
matching the grammar, and that means by definition it's
logically redundant.  I don't think it's rhetorically
redundant or unnecessary, though, so I'm keeping it.  It
could however use some recasting when the editors are feeling
stronger. -MSM *-->

No children
JDOM checks the content to make sure it's legal (i.e. does not contain a double-hyphen)

The Comment Class

package org.jdom;

public class Comment implements Serializable, Cloneable {

    protected String text;

    protected Comment() {}
    public    Comment(String text) {}
    
    public String     getText() {}
    public void       setText(String text) {}
    public Comment    detach() {}
    public Document   getDocument() {}
    protected Comment setDocument(Document document) {}
    public Element    getParent() {}
    protected Comment setParent(Element parent){}
    
    public final String  toString() {}
    public final boolean equals(Object ob) {}
    public final int     hashCode() {}
    public final Object  clone() {}

}

Comment Example

import org.jdom.*;
import org.jdom.input.SAXBuilder;
import java.util.*;


public class CommentReader {

  public static void main(String[] args) {
     
    SAXBuilder builder = new SAXBuilder();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        Document doc = builder.build(args[i]);
        List content = doc.getMixedContent();
        Iterator iterator = content.iterator();
        while (iterator.hasNext()) {
          Object o = iterator.next();
          if (o instanceof Comment) {
            Comment c = (Comment) o;
            System.out.println(c.getText());     
            System.out.println();     
          }
          else if (o instanceof Element) {
            processElement((Element) o);   
          }
        }
      }
      catch (JDOMException e) {
        System.err.println(e); 
        e.getRootCause().printStackTrace(); 
      }
      
    }
  
  } // end main

  // note use of recursion
  public static void processElement(Element element) {
    
    List content = element.getMixedContent();
    Iterator iterator = content.iterator();
    while (iterator.hasNext()) {
      Object o = iterator.next();
      if (o instanceof Comment) {
        Comment c = (Comment) o;
        System.out.println(c.getText());     
        System.out.println();     
      }
      else if (o instanceof Element) {
        processElement((Element) o);   
      }
    } // end while
    
  }

}

CommentReader Output

% java CommentReader hotcop.xml
 The publisher is actually Polygram but I needed
       an example of a general entity reference.

 You can tell what album I was
     listening to when I wrote this example

Or try http://www.w3.org/TR/1998/REC-xml-19980210.xml for more interesting output.

ProcessingInstruction Nodes

Represents a processing instruction like
<?robots index="yes" follow="no"?>
No children

Some have pseudo-attributes; some don't:

<?php 
  mysql_connect("database.unc.edu", "clerk", "password"); 
  $result = mysql("music", "SELECT LastName, FirstName  
    FROM Employees ORDER BY LastName, FirstName"); 
  $i = 0;
  while ($i < mysql_numrows ($result)) {
     $fields = mysql_fetch_row($result);
     echo "<person>$fields[1] $fields[0] </person>\r\n";
     $i++;
  }
  mysql_close();
?>

A ProcessingInstruction is represented as either
- Target and Value
- Target and Pseudo-attributes
As usual JDOM checks the contents of each processingInstruction object for well-formedness

The ProcessingInstruction Class

package org.jdom;

public class ProcessingInstruction implements Serializable, Cloneable {

    protected String target;
    protected String rawData;
    protected Map    mapData;
    protected Document document;
    protected Element parent;
    
    protected ProcessingInstruction() {}
    public    ProcessingInstruction(String target, Map data) {}
    public    ProcessingInstruction(String target, String data) {}
    
    public String                getTarget() {}
    public String                getData() {}
    public ProcessingInstruction setData(String data) {}
    public ProcessingInstruction setData(Map data) {}
    public String                getValue(String name) {}
    public ProcessingInstruction setValue(String name, String value) {}
    public boolean               removeValue(String name) {}

    public    Document              getDocument() {}
    protected ProcessingInstruction setDocument(Document document) {}
    public    Element               getParent() {}
    protected ProcessingInstruction setParent(Element parent){}

    public final String  toString() {}
    public final boolean equals(Object ob) {}
    public final int     hashCode() {}
    public final Object  clone() {}
}

XLinkSpider that Respects the robots Processing Instruction

import java.io.*;
import java.util.*;
import org.jdom.*;
import org.jdom.input.SAXBuilder;


public class XLinkSpider {

  private static SAXBuilder builder = new SAXBuilder();
  
  private static Vector visited = new Vector();
  
  private static int maxDepth = 5;
  private static int currentDepth = 0; 
  
  public static void listURIs(String systemID) {
    
    currentDepth++;
    try {
      if (currentDepth < maxDepth) {

        Document document = builder.build(systemID); 
                
        // check to see if we're allowed to spider
        boolean index = true;
        boolean follow = true;
        ProcessingInstruction robots 
         = document.getProcessingInstruction("robots");
        if (robots != null) {
          String indexValue = robots.getValue("index");
          if (indexValue.equalsIgnoreCase("no")) index = false;
          String followValue = robots.getValue("follow");
          if (followValue.equalsIgnoreCase("no")) follow = false;
        }
        Vector uris = new Vector();
        // search the document for uris, 
        // store them in vector, and print them
        if (follow) searchForURIs(document.getRootElement(), uris);
    
        Enumeration e = uris.elements();
        while (e.hasMoreElements()) {
          String uri = (String) e.nextElement();
          visited.addElement(uri);
          if (index) listURIs(uri); 
        }
      
      }
    
    }
    catch (JDOMException e) {
      // couldn't load the document, 
      // probably not well-formed XML, skip it 
    }
    finally { 
      currentDepth--;
      System.out.flush();     
    }
      
  }
  
  private static Namespace xlink 
   = Namespace.getNamespace("http://www.w3.org/1999/xlink");
  
  // use recursion 
  public static void searchForURIs(Element element, Vector uris) {
    
    // look for XLinks in this element
    String uri = element.getAttributeValue("href", xlink);
    if (uri != null && !uri.equals("") 
     && !visited.contains(uri) && !uris.contains(uri)) {
      System.out.println(uri);
      uris.addElement(uri);
    }
    
    // process child elements recursively
    List children = element.getChildren();
    Iterator iterator = children.iterator();
    while (iterator.hasNext()) {
      searchForURIs((Element) iterator.next(), uris); 
    }
    
  }

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java XLinkSpider URL1 URL2..."); 
    } 
      
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      System.err.println(args[i]); 
      listURIs(args[i]);
    } // end for
  
  } // end main

} // end XLinkSpider

Handling Namespaces

JDOM is fully namespace aware
Namespaces are represented by instances of the Namespace class rather than by attributes or raw strings
Always ask for elements and attributes by local names and namespace URIs
Elements and attributes that are not in any namespace can be asked for by local name alone
Never identify an element or attribute by qualified name

The Namespace Class

Mostly for internal parser use
Occasionally useful for tasks like finding out whether a document contains any XLinks

The Namespace Class

package org.jdom;

public final class Namespace {

  public static final Namespace NO_NAMESPACE = new Namespace("", "");
  public static final Namespace XML_NAMESPACE = 
   new Namespace("xml", "http://www.w3.org/XML/1998/namespace");

  // factory methods
  public static Namespace getNamespace(String prefix, String uri) {}
  public static Namespace getNamespace(String uri) {}

  // getter methods
  public String  getPrefix() {}
  public String  getURI() {}

  // utility methods
  public boolean equals(Object ob) {}
  public String  toString() {}
  public int     hashCode() {}

}

DocType Nodes

Represents a document type declaration
Has no children

The DocType class

package org.jdom;

public class DocType implements Serializable, Cloneable {

    protected String elementName;
    protected String publicID;
    protected String systemID;

    protected DocType() {}
    public    DocType(String rootElementName, String publicID, String systemID) {}
    public    DocType(String rootElementName, String systemID) {}
    public    DocType(String rootElementName) {}

    public String  getElementName() {}
    public String  getPublicID() {}
    public DocType setPublicID(String publicID) {}
    public String  getSystemID() {}
    public DocType setSystemID(String systemID) {}

    // Usual utility methods
    public final String  toString() {}
    public final boolean equals(Object ob) {}
    public final int     hashCode() {}
    public final Object  clone() {}
    
}

Example of the DocType Class

Verify that a document is correct XHTML
From the XHTML 1.0 spec:
1. It must validate against one of the three DTDs found in Appendix A.
2. The root element of the document must be <html>.
3. The root element of the document must designate the XHTML namespace using the xmlns attribute [XMLNAMES]. The namespace for XHTML is defined to be http://www.w3.org/1999/xhtml.
4. There must be a DOCTYPE declaration in the document prior to the root element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in Appendix A using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions.
```
<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "DTD/xhtml1-strict.dtd">

<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "DTD/xhtml1-transitional.dtd">

<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
     "DTD/xhtml1-frameset.dtd">
```

XHTMLValidator

import java.io.*;
import org.jdom.*;
import org.jdom.input.SAXBuilder;


public class JDOMXHTMLValidator {

  public static void main(String[] args) {
    
    for (int i = 0; i < args.length; i++) {
      validate(args[i]);
    }   
    
  }

  private static SAXBuilder builder = new SAXBuilder(true);
                                                 /*  ^^^^ */
                                              /* turn on validation  */
  
  // not thread safe
  public static void validate(String source) {
        
      Document document;
      try {
        document = builder.build(source); 
      }
      catch (JDOMException e) {  
        System.out.println("Error: " + e.getMessage()); 
        e.printStackTrace();
        return; 
      }
      
      // If we get this far, then the document is valid XML.
      // Check to see whether the document is actually XHTML        
      DocType doctype = document.getDocType();
    
      if (doctype == null) {
        System.out.println("No DOCTYPE"); 
        return;
      }

      String name     = doctype.getElementName();
      String systemID = doctype.getSystemID();
      String publicID = doctype.getPublicID();
      
      if (!name.equals("html")) {
        System.out.println("Incorrect root element name " + name); 
      }
    
      if (publicID == null
       || (!publicID.equals("-//W3C//DTD XHTML 1.0 Strict//EN")
           && !publicID.equals("-//W3C//DTD XHTML 1.0 Transitional//EN")
           && !publicID.equals("-//W3C//DTD XHTML 1.0 Frameset//EN"))) {
        System.out.println(source + " does not seem to use an XHTML 1.0 DTD");
      }
    
      // Check the namespace on the root element
      Element root = document.getRootElement();
      Namespace namespace = root.getNamespace();
      String prefix = namespace.getPrefix();
      String uri = namespace.getURI();
      if (!uri.equals("http://www.w3.org/1999/xhtml")) {
        System.out.println(source 
         + " does not properly declare the"
         + " http://www.w3.org/1999/xhtml namespace"
         + " on the root element");        
      }
      if (!prefix.equals("")) {
        System.out.println(source 
         + " does not use the empty prefix for XHTML");        
      }
    
  }

}

Using the XHTMLValidator

% java JDOMXHTMLValidator http://www.w3.org/TR/xhtml1
Error: File "http://www.w3.org/TR/DTD/xhtml1-strict.dtd" not found.: Error on 
line -1 of XML document: File "http://www.w3.org/TR/DTD/xhtml1-strict.dtd" not 
found.
org.jdom.JDOMException: File "http://www.w3.org/TR/DTD/xhtml1-strict.dtd" not 
found.: Error on line -1 of XML document: File 
"http://www.w3.org/TR/DTD/xhtml1-strict.dtd" not found.
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:227)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:359)
        at XHTMLValidator.validate(XHTMLValidator.java:25)
        at XHTMLValidator.main(XHTMLValidator.java:11)
Root cause: org.jdom.JDOMException: Error on line -1 of XML document: File 
"http://www.w3.org/TR/DTD/xhtml1-strict.dtd" not found.
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:228)
        at org.jdom.input.SAXBuilder.build(SAXBuilder.java:359)
        at XHTMLValidator.validate(XHTMLValidator.java:25)
        at XHTMLValidator.main(XHTMLValidator.java:11)

The Verifier Class

Checks a variety of strings to see if they're legal for particular uses in XML as specified by XML 1.0 and Namespaces in XML.
Mostly for internal parser use

The Verifier Class

package org.jdom;

public final class Verifier {

    public static final String checkElementName(String name) {}
    public static final String checkAttributeName(String name) {}
    public static final String checkCharacterData(String text) {}
    public static final String checkNamespacePrefix(String prefix) {}
    public static final String checkNamespaceURI(String uri) {}
    public static final String checkProcessingInstructionTarget(String target) {}
    public static final String checkCommentData(String data) {}
 
    public static boolean isXMLCharacter(char c) {}
    public static boolean isXMLNameCharacter(char c) {}
    public static boolean isXMLNameStartCharacter(char c) {}
    public static boolean isXMLLetterOrDigit(char c) {}
    public static boolean isXMLLetter(char c) {}
    public static boolean isXMLCombiningChar(char c) {}
    public static boolean isXMLExtender(char c) {}
    public static boolean isXMLDigit(char c) {}

}

JDOMException

A checked exception so you must catch it
Wraps other exceptions that are thrown during JDOM operations like IOException or SAXException
Root cause of exception (if any) is accessible through the getRootCause() method:
public Throwable getRootCause()
Subclasses:
- DataConversionException
- NoSuchAttributeException
- NoSuchChildException
- NoSuchProcessingInstructionException
IllegalArgumentException subclasses:
- IllegalAddException
- IllegalDataException
- IllegalNameException
- IllegalTargetException

JDOMException Class

package org.jdom;

public class JDOMException extends Exception {

    protected Throwable cause;

    public JDOMException() {}
    public JDOMException(String message)  {}
    public JDOMException(String message, Throwable rootCause)  {} 
       
    public String    getMessage() {}
    public void      printStackTrace() {}
    public void      printStackTrace(PrintStream s) {}
    public void      printStackTrace(PrintWriter w) {}
    public Throwable getCause()  {}

}

The org.jdom.output Package

DOMOutputter
SAXOutputter
XMLOutputter

Serialization

The process of taking an in-memory JDOM Document and converting it to a stream of characters that can be written onto an output stream
The org.jdom.output.XMLOutputter class

XMLOutputter

package org.jdom.output;

public class XMLOutputter implements Cloneable {

    protected static final String STANDARD_INDENT = "  ";
    
    public XMLOutputter() {}
    public XMLOutputter(String indent) {}
    public XMLOutputter(String indent, boolean newlines) {}
    public XMLOutputter(String indent, boolean newlines, String encoding) {}
    public XMLOutputter(XMLOutputter that) {}
    
    public void setLineSeparator(String separator) {}
    public void setNewlines(boolean newlines) {}
    public void setEncoding(String encoding) {}
    public void setOmitEncoding(boolean omitEncoding) {}
    public void setOmitDeclaration(boolean omitDeclaration) {}
    public void setExpandEmptyElements(boolean expandEmptyElements) {}
    public void setIndent(String indent) {}
    public void setIndent(boolean doIndent) {}
    public void setIndentSize(int indentSize) {}
    public void setTextNormalize(boolean textNormalize)

    protected String escapeAttributeEntities(String s) {} 
    protected String escapeElementEntities(String s) {}

    protected void indent(Writer out, int level) throws IOException {}
    protected void maybePrintln(Writer out) throws IOException  {}
    protected Writer makeWriter(OutputStream out) 
     throws java.io.UnsupportedEncodingException {}
    protected Writer makeWriter(OutputStream out, String encoding) 
     throws java.io.UnsupportedEncodingException {}
     
    public void output(Document doc, OutputStream out) throws IOException {}
    public void output(Document doc, Writer writer) throws IOException {}
    public void output(Element element, Writer out) throws IOException {}
    public void output(Element element, OutputStream out) {}
    public void output(CDATA cdata, Writer out) throws IOException {}
    public void output(CDATA cdata, OutputStream out) throws IOException {}
    public void output(Comment comment, Writer out) throws IOException {}
    public void output(Comment comment, OutputStream out) throws IOException {}
    public void output(String string, Writer out) throws IOException {}
    public void output(String string, OutputStream out) throws IOException {}
    public void output(EntityRef entity, Writer out) throws IOException {}
    public void output(EntityRef entity, OutputStream out) throws IOException {}
    public void output(ProcessingInstruction processingInstruction, Writer out)
      throws IOException {}
    public void output(ProcessingInstruction processingInstruction, OutputStream out)
     throws IOException {}
     
    public void outputElementContent(Element element, OutputStream out)
    public void outputElementContent(Element element, Writer out)

    public String outputString(Document doc) throws IOException {}
    public String outputString(Element element) throws IOException {}
    public String outputString(CDATA cdata) {}
    public String outputString(Comment comment) {}
    public String outputString(DocType doctype) {}
    public String outputString(EntityRef entity) {}
    public String outputString(ProcessingInstruction pi) {}

    // internal printing methods
    protected void printDeclaration(Document doc, Writer out, String encoding) 
     throws IOException {}    
    protected void printDocType(DocType docType, Writer out) throws IOException {}
    protected void printComment(Comment comment, Writer out, int indentLevel) 
     throws IOException {}
    protected void printProcessingInstruction(ProcessingInstruction pi,
     Writer out, int indentLevel) throws IOException {}
    protected void printCDATASection(CDATA cdata, Writer out, int indentLevel) 
     throws IOException {}
    protected void printElement(Element element, Writer out,
     int indentLevel, NamespaceStack namespaces) throws IOException {}
    protected void printElementContent(Element element, Writer out,
     int indentLevel, NamespaceStack namespaces, List mixedContent) 
     throws IOException {}
    protected void printString(String s, Writer out) throws IOException {}
    protected void printEntity(Entity entity, Writer out) throws IOException {}
    protected void printNamespace(Namespace ns, Writer out) throws IOException {}
    protected void printAttributes(List attributes, Element parent, 
     Writer out, NamespaceStack namespaces)  
     throws IOException {}
    
    public int parseArgs(String[] args, int i) {} 
    
}

Using the XMLOutputter Class Directly

Configured with three variables passed to the constructor:

indent
a String added at each level of output; e.g. two spaces or a tab

lineSeparator
the String to break lines with, no line breaking is performed if this is null or the empty string

encoding
The name of the encoding to use for output; e.g. UTF-16 or ISO-8859-1

Options can be set with these 10 methods:

    public void setLineSeparator(String separator) {}
    public void setNewlines(boolean newlines) {}
    public void setEncoding(String encoding) {}
    public void setOmitEncoding(boolean omitEncoding) {}
    public void setOmitDeclaration(boolean omitDeclaration) {}
    public void setExpandEmptyElements(boolean expandEmptyElements) {}
    public void setIndent(String indent) {}
    public void setIndent(boolean doIndent) {}
    public void setIndentSize(int indentSize) {}
    public void setTextNormalize(boolean textNormalize)

The output() method writes a Document onto a given OutputStream:

  public void output(Document doc, OutputStream out) throws IOException {}
  public void output(Document doc, Writer writer) throws IOException {}

There are also output() methods for other JDOM classes:

  public void output(Element element, Writer out) throws IOException {}
  public void output(Element element, OutputStream out) {}
  public void outputElementContent(Element element, Writer out) throws IOException {}
  public void output(CDATA cdata, Writer out) throws IOException {}
  public void output(CDATA cdata, OutputStream out) throws IOException {}
  public void output(Comment comment, Writer out) throws IOException {}
  public void output(Comment comment, OutputStream out) throws IOException {}
  public void output(String string, Writer out) throws IOException {}
  public void output(String string, OutputStream out) throws IOException {}
  public void output(Entity entity, Writer out) throws IOException {}
  public void output(Entity entity, OutputStream out) throws IOException {}
  public void output(ProcessingInstruction processingInstruction, Writer out)
    throws IOException {}
  public void output(ProcessingInstruction processingInstruction, OutputStream out)
   throws IOException {}
  public String outputString(Document doc) throws IOException {}
  public String outputString(Element element) throws IOException {}

Use the outputString() methods to store a document in a string

Using the XMLOutputter Class Indirectly

Configured by overriding protected methods:

  protected void printDeclaration(Document doc, Writer out, String encoding) 
  throws IOException {}    
  protected void printDocType(DocType docType, Writer out) throws IOException {}
  protected void printComment(Comment comment, Writer out, int indentLevel) 
   throws IOException {}
  protected void printProcessingInstruction(ProcessingInstruction pi,
   Writer out, int indentLevel) throws IOException {}
  protected void printCDATASection(CDATA cdata, Writer out, int indentLevel) 
   throws IOException {}
  protected void printElement(Element element, Writer out,
   int indentLevel, NamespaceStack namespaces) throws IOException {}
  protected void printElementContent(Element element, Writer out,
   int indentLevel, NamespaceStack namespaces, List mixedContent) 
   throws IOException {}
  protected void printString(String s, Writer out) throws IOException {}
  protected void printEntity(Entity entity, Writer out) throws IOException {}
  protected void printNamespace(Namespace ns, Writer out) throws IOException {}
  protected void printAttributes(List attributes, Element parent, 
   Writer out, NamespaceStack namespaces)  
   throws IOException {}

JDOM based TagStripper

A bug in the current version of JDOM prevents this from working.

import org.jdom.*;
import org.jdom.output.XMLOutputter;
import org.jdom.input.SAXBuilder;
import java.io.*;
import java.util.*;


public class TagStripper extends XMLOutputter {

  public TagStripper() {
    super();
  }

  // Things we won't print at all
  protected void printDeclaration(Document doc, Writer out, String encoding) {}
  protected void printComment(Comment comment, Writer out, int indentLevel) {}
  protected void printDocType(DocType docType, Writer out) {}
  protected void printProcessingInstruction(ProcessingInstruction pi, 
   Writer out, int indentLevel) {}
  protected void printNamespace(Namespace ns, Writer out) {}
  protected void printAttributes(List attributes, Writer out) {}
  
  protected void printElement(Element element, Writer out, 
   int indentLevel, NamespaceStack namespaces) throws IOException {
    
    List content = element.getMixedContent();
    Iterator iterator = content.iterator();
    while (iterator.hasNext()) {
      Object o = iterator.next();
      if (o instanceof String) {
        out.write((String) o);
        this.maybePrintln(out);
      }
      else if (o instanceof Element) {
        printElement((Element) o, out, indentLevel, namespaces);
      }
    }
          
  }

  // Could easily have put main() method in a separate class
  public static void main(String[] args) {
     
    if (args.length == 0) {
      System.out.println(
       "Usage: java TagStripper URL1 URL2..."); 
    } 
      
    TagStripper stripper = new TagStripper();
    SAXBuilder builder   = new SAXBuilder();
    
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        Document doc = builder.build(args[i]);
        stripper.output(doc, System.out);
      }
      catch (JDOMException e) { // a well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage());
      }
      catch (IOException e) { // a well-formedness error
        System.out.println(e.getMessage());
      }
      
    }  
  
  }

}

Output from a JDOM based TagStripper

% java TagStripper hotcop.xml
Hot Cop
Jacques Morali
Henri Belolo
Victor Willis
Jacques Morali
A & M Records
6:20
1978
Village People

Talking to DOM Programs

The process of taking an in-memory JDOM Document and converting it to an org.w3c.dom.Document object

The org.jdom.output.DOMOutputter class:

package org.jdom.output;

public class DOMOutputter {

  // Constructors
  public DOMOutputter() {}

  // Outputter methods
  public org.w3c.dom.Document output(Document document) {}
  public org.w3c.dom.Element  output(Element element) {}
  public org.w3c.dom.Element  output(Element element, String domAdapterClass) {}
  public org.w3c.dom.Document output(Document document, String domAdapterClass) {}

  // utility methods
  protected void buildDOMTree(Object content, org.w3c.dom.Document doc, 
   org.w3c.dom.Element current, boolean atRoot, LinkedList namespaces) {}
  public String getXmlnsTagFor(Namespace ns);
    
}

Talking to SAX Programs

The process of taking an in-memory JDOM Document and walking its tree while firing off SAX events
The org.jdom.output.SAXOutputter class

What JDOM doesn't do

Documents larger than available memory
Byte-for-byte faithful round trips
DTDs
XPath Queries (may be added in 1.1)

To Learn More

JavaWorld: http://javaworld.com/javaworld/jw-05-2000/jw-0518-jdom.html
JDOM Web Site, http://www.jdom.org/
Java and XML, 2nd Edition, Brett McLaughlin, O'Reilly & Associates, 2001, ISBN 0-5960-0197-5, http://www.oreilly.com/catalog/javaxml/

Part VI: dom4J

Created by James Strachan

To Learn More

dom4J Web site: http://dom4j.org

Part VII: TRAX

The TRansformation API for XSLT

To Learn More

This presentation: http://www.ibiblio.org/xml/slides/sd2001east/xmlandjava/
XML in a Nutshell
- Elliotte Rusty Harold and Scott Means
- O'Reilly & Associates, 2001
- ISBN: 0-596-00058-8

Index | Cafe con Leche

Surname	FirstName	Team	Position	Games Played	Games Started	AtBats	Runs	Hits	Doubles	Triples	Home runs	RBI	Stolen Bases	Caught Stealing	Sacrifice Hits	Sacrifice Flies	Errors	PB	Walks	Strike outs	Hit by pitch
Anderson	Garret	ANA	Outfield	156	151	622	62	183	41	7	15	79	8	3	3	3	6	0	29	80	1
Baughman	Justin	ANA	Second Base	62	54	196	24	50	9	1	1	20	10	4	5	3	8	0	6	36	1
Bolick	Frank	ANA	Third Base	21	11	45	3	7	2	0	1	2	0	0	0	0	0	0	11	8	0
Disarcina	Gary	ANA	Shortstop	157	155	551	73	158	39	3	3	56	12	7	12	3	14	0	21	51	8
Edmonds	Jim	ANA	Outfield	154	150	599	115	184	42	1	25	91	7	5	1	1	5	0	57	114	1
Erstad	Darin	ANA	Outfield	133	129	537	84	159	39	3	19	82	20	6	1	3	3	0	43	77	6
Garcia	Carlos	ANA	Second Base	19	10	35	4	5	1	0	0	0	2	0	1	0	1	0	3	11	1
Glaus	Troy	ANA	Third Base	48	45	165	19	36	9	0	1	23	1	0	0	2	7	0	15	51	0
Greene	Todd	ANA	Outfield	29	15	71	3	18	4	0	1	7	0	0	0	0	0	0	2	20	0
Helfand	Eric	ANA	Catcher	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Hollins	Dave	ANA	Third Base	101	98	363	60	88	16	2	11	39	11	3	2	2	17	0	44	69	7
Jefferies	Gregg	ANA	Outfield	19	18	72	7	25	6	0	1	10	1	0	0	0	0	0	0	5	0
Johnson	Mark	ANA	First Base	10	2	14	1	1	0	0	0	0	0	0	0	0	0	0	0	6	0
Kreuter	Chad	ANA	Catcher	96	74	252	27	63	10	1	2	33	1	0	5	1	9	5	33	49	3
Martin	Norberto	ANA	Second Base	79	50	195	20	42	2	0	1	13	3	1	3	2	4	0	6	29	0
Mashore	Damon	ANA	Outfield	43	24	98	13	23	6	0	2	11	1	0	1	0	0	0	9	22	3
Molina	Ben	ANA	Catcher	2	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Nevin	Phil	ANA	Catcher	75	65	237	27	54	8	1	8	27	0	0	0	2	5	20	17	67	5
Obrien	Charlie	ANA	Catcher	62	58	175	13	45	9	0	4	18	0	0	3	3	4	1	10	33	2
Palmeiro	Orlando	ANA	Outfield	74	34	165	28	53	7	2	0	21	5	4	7	0	0	0	20	11	0
Pritchett	Chris	ANA	First Base	31	19	80	12	23	2	1	2	8	2	0	0	0	1	0	4	16	0
Salmon	Tim	ANA	Designated Hitter	136	130	463	84	139	28	1	26	88	0	1	0	10	2	0	90	100	3
Shipley	Craig	ANA	Third Base	77	32	147	18	38	7	1	2	17	0	4	4	1	3	0	5	22	5
Velarde	Randy	ANA	Second Base	51	50	188	29	49	13	1	4	26	7	2	0	1	4	0	34	42	1
Walbeck	Matt	ANA	Catcher	108	91	338	41	87	15	2	6	46	1	1	5	5	7	8	30	68	2
Williams	Reggie	ANA	Outfield	29	7	36	7	13	1	0	1	5	3	3	1	0	0	0	7	11	1