A Brief Review of XML Rules and Terminology
Reading XML through the DOM
Writing XML through the DOM
Active Standards
XML 1.0
Namespaces in XML
XPath
Under development:
Document Object Model Level 2
SAX 2.0
XML Information Set
Canonical XML
Deprecated and obsolete:
Document Object Model Level 1
SAX 1.0
You need a JDK
You need some free class libraries
You need a text editor
You need some data to process
Are familiar with Java including I/O, classes, objects, polymorphism, etc.
Know XML including well-formedness, validity, namespaces, and so forth
I will briefly review proper terminology
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/css" href="song.css"?> <!DOCTYPE SONG SYSTEM "song.dtd"> <SONG xmlns="http://metalab.unc.edu/xml/namespace/song" xmlns:xlink="http://www.w3.org/1999/xlink"> <TITLE>Hot Cop</TITLE> <PHOTO xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg" ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/> <COMPOSER>Jacques Morali</COMPOSER> <COMPOSER>Henri Belolo</COMPOSER> <COMPOSER>Victor Willis</COMPOSER> <PRODUCER>Jacques Morali</PRODUCER> <!-- The publisher is actually Polygram but I needed an example of a general entity reference. --> <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/"> A & M Records </PUBLISHER> <LENGTH>6:20</LENGTH> <YEAR>1978</YEAR> <ARTIST>Village People</ARTIST> </SONG> <!-- You can tell what album I was listening to when I wrote this example -->View in Browser
Markup includes:
Tags
Entity References
Comments
Processing Instructions
Document Type Declarations
XML Declaration
CDATA Section Delimiters
Character data includes everything else
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song"
xmlns:xlink="http://www.w3.org/1999/xlink">
<TITLE>Hot Cop</TITLE>
<PHOTO
xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
<COMPOSER>Jacques Morali</COMPOSER>
<COMPOSER>Henri Belolo</COMPOSER>
<COMPOSER>Victor Willis</COMPOSER>
<PRODUCER>Jacques Morali</PRODUCER>
<!-- The publisher is actually Polygram but I needed
an example of a general entity reference. -->
<PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
A & M Records
</PUBLISHER>
<LENGTH>6:20</LENGTH>
<YEAR>1978</YEAR>
<ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was
listening to when I wrote this example -->
An XML document is made up of one or more physical storage units called entities
Entity references :
Parsed internal general entity references like &
Parsed external general entity references
Unparsed external general entity references
External parameter entity references
Internal parameter entity references
Reading an XML document is not the same thing as reading an XML file
The file contains entity references.
The file document contains the entities' replacement text.
When you use a parser to read a document you'll get the text including characters like <. You will not see the entity references.
Character data left after entity references are replaced with their text
Given the element
<PUBLISHER>A & M Records</PUBLISHER>
The parsed character data is
A & M Records
Used to include large blocks of text with lots of normally
illegal literal characters like
<
and &
, typically XML or HTML.
<p>You can use a default <code>xmlns</code>
attribute to avoid having to add the svg
prefix to all
your elements:</p>
<![CDATA[
<svg xmlns="http://www.w3.org/Graphics/SVG/SVG-19991203.dtd"
width="12cm" height="10cm">
<ellipse rx="110" ry="130" />
<rect x="4cm" y="1cm" width="3cm" height="6cm" />
</svg>
]]>
CDATA is for human authors, not for programs!
<!-- Before posting this page, I need to double check the number
of pelicans in Lousiana in 1970 -->
Comments are for humans, not programs.
Divided into a target and data for the target
The target must be an XML name
The data can have an effectively arbitrary format
<?robots index="yes" follow="no"?>
<?xml-stylesheet href="pelicans.css" type="text/css"?>
<?php
mysql_connect("database.unc.edu", "clerk", "password");
$result = mysql("CYNW", "SELECT LastName, FirstName FROM Employees
ORDER BY LastName, FirstName");
$i = 0;
while ($i < mysql_numrows ($result)) {
$fields = mysql_fetch_row($result);
echo "<person>$fields[1] $fields[0] </person>\r\n";
$i++;
}
mysql_close();
?>
These are for programs
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
Looks like a processing instruction but isn't.
version
attribute
required
always has the value 1.0
encoding
attribute
UTF-8
8859_1
etc.
standalone
attribute
yes
no
<!DOCTYPE SONG SYSTEM "song.dtd">
<!ELEMENT SONG (TITLE, COMPOSER+, PRODUCER*, PUBLISHER*, YEAR?, LENGTH?, ARTIST+)> <!ELEMENT TITLE (#PCDATA)> <!ELEMENT COMPOSER (#PCDATA)> <!ELEMENT PRODUCER (#PCDATA)> <!ELEMENT PUBLISHER (#PCDATA)> <!ELEMENT LENGTH (#PCDATA)> <!-- This should be a four digit year like "1999", not a two-digit year like "99" --> <!ELEMENT YEAR (#PCDATA)> <!ELEMENT ARTIST (#PCDATA)>
Used for element, attribute, and entity names
Can contain any alphabetic, ideographic, or numeric Unicode character
Can contain hyphen, underscore, or period
Can also contain colons but these are reserved for namespaces
Can begin with any alphabetic or ideographic character or the underscore but not digits or other punctuation marks
Raison d'etre:
To distinguish between elements and attributes from different vocabularies with different meanings.
To group all related elements and attributes together so that a parser can easily recognize them.
Each element is given a prefix
Each prefix (as well as the empty prefix) is associated with a URI
Elements with the same URI are in the same namespace
URIs are purely formal. They do not necessarily point to a page.
Elements and attributes that are in namespaces have names that contain exactly one colon. They look like this:
rdf:description
xlink:type
xsl:template
Everything before the colon is called the prefix
Everything after the colon is called the local part or local name.
The complete name including the colon is called the qualified name or raw name.
Each prefix in a qualified name is associated with a URI.
For example, all elements in XSLT 1.0 style sheets are associated with the http://www.w3.org/1999/XSL/Transform URI.
The customary prefix xsl
is a shorthand for the longer URI
http://www.w3.org/1999/XSL/Transform.
You can't use the URI in the element name directly.
Prefixes are bound to namespace URIs by attaching an xmlns:prefix
attribute to the prefixed element or one of its ancestors.
<svg:svg xmlns:svg="http://www.w3.org/Graphics/SVG/SVG-19991203.dtd"
width="12cm" height="10cm">
<svg:ellipse rx="110" ry="130" />
<svg:rect x="4cm" y="1cm" width="3cm" height="6cm" />
</svg:svg>
Bindings have scope within the element where they're declared.
An SVG processor can recognize all three of these elements as SVG elements because they all have prefixes bound to the particular URI defined by the SVG specification.
Indicate that an unprefixed element and all its unprefixed descendant
elements belong to a particular namespace by attaching an xmlns
attribute with no prefix:
<DATASCHEMA xmlns="http://www.w3.org/2000/P3Pv1">
<DATA name="vehicle.make" type="text" short="Make"
category="preference" size="31"/>
<DATA name="vehicle.model" type="text" short="Model"
category="preference" size="31"/>
<DATA name="vehicle.year" type="number" short="Year"
category="preference" size="4"/>
<DATA name="vehicle.license.state." type="postal." short="State"
category="preference" size="2"/>
<DATA name="vehicle.license.number" type="text"
short="License Plate Number" category="preference" size="12"/>
</DATASCHEMA>
Both the DATASCHEMA
and DATA
elements are in the
http://www.w3.org/2000/P3Pv1 namespace.
Default namespaces apply only to elements, not to attributes.
Thus in the above example the name
,
type
, short
, category
, and size
attributes are not in any namespace.
Unprefixed attributes are never in any namespace.
You can change the default namespace within a particular
element by adding an xmlns
attribute to the element.
Namespaces were added to XML 1.0 after the fact, but care was taken to ensure backwards compatibility.
An XML 1.0 parser that does not know about namespaces will most likely not have any troubles reading a document that uses namespaces.
A namespace aware parser also checks to see that all prefixes are mapped to URIs. Otherwise it behaves almost exactly like a non-namespace aware parser.
Other software that sits on top of the raw XML parser, an XSLT engine for example, may treat elements differently depending on what namespace they belong to. However, the XML parser itself mostly doesn't care as long as all well-formedness and namespace constraints are met.
A possible exception occurs in the unlikely event that elements with different prefixes belong to the same namespace or elements with the same prefix belong to different namespaces
Many parsers have the option of whether to report namespace violations so that you can turn namespace processing on or off as you see fit.
A W3C standard for determining when two documents are the same after:
Entity references are resolved
Document is converted to Unicode
Unicode combining forms are combined
Comments are stripped
White space is normalized
Default attribute values are added
If at all possible, your programs should depend only on the canonical form of the document
Canonical form of hotcop.xml:
<?xml-stylesheet type="text/css" href="song.css"?><SONG> <TITLE>Hot Cop</TITLE> <COMPOSER>Jacques Morali</COMPOSER> <COMPOSER>Henri Belolo</COMPOSER> <COMPOSER>Victor Willis</COMPOSER> <PRODUCER>Jacques Morali</PRODUCER> <PUBLISHER>A & M Records</PUBLISHER> <LENGTH>6:20</LENGTH> <YEAR>1978</YEAR> <ARTIST>Village People</ARTIST> </SONG>
An XML document is a tree.
It has a root.
It has nodes.
It is amenable to recursive processing.
Not all applications agree on what the root is.
Not all applications agree on what is and isn't a node.
Defines how XML and HTML documents are represented as objects in programs
Defined in IDL; thus language independent
HTML as well as XML
Writing as well as reading
More complete than SAX; covers everything except internal and external DTD subsets
DOM focuses more on the document; SAX focuses more on the parser.
DOM Level 0:
DOM Level 1, a W3C Standard
DOM Level 2, a W3C Standard
DOM Level 3:
Eight Modules:
Core: org.w3c.dom
*
HTML: org.w3c.dom.html
Views: org.w3c.dom.views
StyleSheets: org.w3c.dom.stylesheets
CSS: org.w3c.dom.css
Events: org.w3c.dom.events
*
Traversal: org.w3c.dom.traversal
*
Range: org.w3c.dom.range
Only the core and traversal modules really apply to XML. The other six are for HTML.
* indicates Xerces support
A DOM application can use the
hasFeature()
method of the DOMImplementation
interface to
determine whether a module is supported or not.
XML Module: "XML"
HTML Module: "HTML"
Views Module: "Views"
StyleSheets Module: "StyleSheets"
CSS Module: "CSS"
CSS (extended interfaces) Module: "CSS2"
Events Module: "Events"
User Interface Events (UIEvent interface) Module: "UIEvents"
Mouse Events Module: "MouseEvents"
Mutation Events Module: "MutationEvents"
HTML Events Module: "HTMLEvents"
Traversal Module: "Traversal"
Range Module: "Range"
import org.apache.xerces.dom.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class ModuleChecker { public static void main(String[] args) { // parser dependent DOMImplementation implementation = DOMImplementationImpl.getDOMImplementation(); String[] features = {"XML", "HTML", "Views", "StyleSheets", "CSS", "CSS2", "Events", "UIEvents", "MouseEvents", "MutationEvents", "HTMLEvents", "Traversal", "Range"}; for (int i = 0; i < features.length; i++) { if (implementation.hasFeature(features[i], "2.0")) { System.out.println("Implementation supports " + features[i]); } else { System.out.println("Implementation does not support " + features[i]); } } } }
D:\speaking\SD2000 East\dom\examples>java ModuleChecker
Implementation supports XML
Implementation does not support HTML
Implementation does not support Views
Implementation does not support StyleSheets
Implementation does not support CSS
Implementation does not support CSS2
Implementation supports Events
Implementation does not support UIEvents
Implementation does not support MouseEvents
Implementation supports MutationEvents
Implementation does not support HTMLEvents
Implementation supports Traversal
Implementation does not support Range
Entire document is represented as a tree.
A tree contains nodes.
Some nodes may contain other nodes (depending on node type).
Each document node contains:
zero or one doctype nodes
one root element node
zero or more comment and processing instruction nodes
17 classes:
Attr
CDATASection
CharacterData
Comment
Document
DocumentFragment
DocumentType
DOMImplementation
Element
Entity
EntityReference
NamedNodeMap
Node
NodeList
Notation
ProcessingInstruction
Text
plus one exception:
DOMException
Plus a bunch of HTML stuff in org.w3c.dom.html
and other packages
we will ignore
Apache XML Project's Xerces Java: http://xml.apache.org/xerces-j/index.html
IBM's XML for Java: http://www.alphaworks.ibm.com/formula/xml
Sun's Java API for XML http://java.sun.com/products/xml
Library specific code creates a parser
The parser parses the document and returns a DOM
org.w3c.dom.Document
object.
The entire document is stored in memory.
DOM methods and interfaces are used to extract data from this object
import org.apache.xerces.parsers.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class DOMParserMaker { public static void main(String[] args) { // This is simpler but less flexible than the SAX approach. // Perhaps a good creational design pattern is needed here? DOMParser parser = new DOMParser(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document d = parser.getDocument(); // work with the document... } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } }
package org.w3c.dom;
public interface Node {
// NodeType
public static final short ELEMENT_NODE = 1;
public static final short ATTRIBUTE_NODE = 2;
public static final short TEXT_NODE = 3;
public static final short CDATA_SECTION_NODE = 4;
public static final short ENTITY_REFERENCE_NODE = 5;
public static final short ENTITY_NODE = 6;
public static final short PROCESSING_INSTRUCTION_NODE = 7;
public static final short COMMENT_NODE = 8;
public static final short DOCUMENT_NODE = 9;
public static final short DOCUMENT_TYPE_NODE = 10;
public static final short DOCUMENT_FRAGMENT_NODE = 11;
public static final short NOTATION_NODE = 12;
public String getNodeName();
public String getNodeValue() throws DOMException;
public void setNodeValue(String nodeValue) throws DOMException;
public short getNodeType();
public Node getParentNode();
public NodeList getChildNodes();
public Node getFirstChild();
public Node getLastChild();
public Node getPreviousSibling();
public Node getNextSibling();
public NamedNodeMap getAttributes();
public Document getOwnerDocument();
public Node insertBefore(Node newChild, Node refChild) throws DOMException;
public Node replaceChild(Node newChild, Node oldChild) throws DOMException;
public Node removeChild(Node oldChild) throws DOMException;
public Node appendChild(Node newChild) throws DOMException;
public boolean hasChildNodes();
public Node cloneNode(boolean deep);
public void normalize();
public boolean supports(String feature, String version);
public String getNamespaceURI();
public String getPrefix();
public void setPrefix(String prefix) throws DOMException;
public String getLocalName();
}
package org.w3c.dom;
public interface NodeList {
public Node item(int index);
public int getLength();
}
Now we're really ready to read a document
import org.w3c.dom.*; // Depth first search of DOM Tree public abstract class NodeIterator { // note use of recursion public void followNode(Node node) { processNode(node); if (node.hasChildNodes()) { NodeList children = node.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { followNode(children.item(i)); } } } // Override this method to do something as each node is visited protected abstract void processNode(Node node); // I could make processNode() a separate method in // a NodeProcessor interface, and make followNode static // but I wanted to keep this example simple. }
import org.apache.xerces.parsers.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class NodeReporter extends NodeIterator { public static void main(String[] args) { DOMParser parser = new DOMParser(); NodeIterator iterator = new NodeReporter(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document d = parser.getDocument(); iterator.followNode(d); } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } // end main public void processNode(Node node) { String name = node.getNodeName(); String type = getTypeName(node.getNodeType()); System.out.println("Type " + type + ": " + name); } public static String getTypeName(int type) { switch (type) { case Node.ELEMENT_NODE: return "Element"; case Node.ATTRIBUTE_NODE: return "Attribute"; case Node.TEXT_NODE: return "Text"; case Node.CDATA_SECTION_NODE: return "CDATA Section"; case Node.ENTITY_REFERENCE_NODE: return "Entity Reference"; case Node.ENTITY_NODE: return "Entity"; case Node.PROCESSING_INSTRUCTION_NODE: return "processing Instruction"; case Node.COMMENT_NODE : return "Comment"; case Node.DOCUMENT_NODE: return "Document"; case Node.DOCUMENT_TYPE_NODE: return "Document Type Declaration"; case Node.DOCUMENT_FRAGMENT_NODE: return "Document Fragment"; case Node.NOTATION_NODE: return "Notation"; default: return "Unknown Type"; } } }
D:\speaking\SD2000 East\dom\examples>java NodeReporter hotcop.xml Type Document: #document Type processing Instruction: xml-stylesheet Type Document Type Declaration: SONG Type Element: SONG Type Text: #text Type Element: TITLE Type Text: #text Type Text: #text Type Element: COMPOSER Type Text: #text Type Text: #text Type Element: COMPOSER Type Text: #text Type Text: #text Type Element: COMPOSER Type Text: #text Type Text: #text Type Element: PRODUCER Type Text: #text Type Text: #text Type Comment: #comment Type Text: #text Type Element: PUBLISHER Type Text: #text Type Text: #text Type Text: #text Type Text: #text Type Element: LENGTH Type Text: #text Type Text: #text Type Element: YEAR Type Text: #text Type Text: #text Type Element: ARTIST Type Text: #text Type Text: #text Type Comment: #comment
Attributes are missing from this output. They are not nodes. They are properties of nodes.
Node Type | Node Value |
---|---|
element node | null |
attribute node | attribute value |
text node | text of the node |
CDATA section node | text of the section |
entity reference node | null |
entity node is null | |
processing instruction node | content of the processing instruction, not including the target |
comment node | text of the comment |
document node | null |
document type declaration node | null |
document fragment node | null |
notation node | null |
import org.apache.xerces.parsers.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class DOMTagStripper extends NodeIterator { public static void main(String[] args) { DOMParser parser = new DOMParser(); NodeIterator iterator = new DOMTagStripper(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document d = parser.getDocument(); iterator.followNode(d); } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } // end main public void processNode(Node node) { int type = node.getNodeType(); if (type == Node.TEXT_NODE) { System.out.print(node.getNodeValue()); } } }
D:\speaking\SD2000 East\dom\examples>java DOMTagStripper hotcop.xml Hot Cop Jacques Morali Henri Belolo Victor Willis Jacques Morali A & M Records 6:20 1978 Village People
The root node representing the entire document; not the same as the root element
Contains:
one element node
zero or more processing instruction nodes
zero or more comment nodes
zero or one document type nodes
package org.w3c.dom;
public interface Document extends Node {
public DocumentType getDoctype();
public DOMImplementation getImplementation();
public Element getDocumentElement();
public Element createElement(String tagName) throws DOMException;
public Element createElementNS(String namespaceURI, String qualifiedName) throws DOMException;
public DocumentFragment createDocumentFragment();
public Text createTextNode(String data);
public Comment createComment(String data);
public CDATASection createCDATASection(String data) throws DOMException;
public ProcessingInstruction createProcessingInstruction(String target, String data)
throws DOMException;
public Attr createAttribute(String name) throws DOMException;
public Attr createAttributeNS(String namespaceURI, String qualifiedName) throws DOMException;
public EntityReference createEntityReference(String name) throws DOMException;
public NodeList getElementsByTagName(String tagname);
public NodeList getElementsByTagNameNS(String namespaceURI, String localName);
public Element getElementById(String elementId);
public Node importNode(Node importedNode, boolean deep) throws DOMException;
}
http://static.userland.com/myUserLandMisc/currentStories.xml
We only want story text elements
import org.apache.xerces.parsers.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class HeadlineGrabber { public static void main(String[] args) { DOMParser parser = new DOMParser(); try { // Read the entire document into memory parser.parse("http://static.userland.com/myUserLandMisc/currentStories.xml"); Document d = parser.getDocument(); NodeList headlines = d.getElementsByTagName("storyText"); for (int i = 0; i < headlines.getLength(); i++) { NodeList storyText = headlines.item(i).getChildNodes(); for (int j = 0; j < storyText.getLength(); j++) { Node textContent = storyText.item(j); System.out.print(textContent.getNodeValue()); } System.out.println(); System.out.println(); System.out.flush(); } } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } // end main }View Output in Browser
Represents a complete element including its start tag, end tag, and content
Contains:
Element nodes
ProcessingInstruction nodes
Comment nodes
Text nodes
CDATASection nodes
EntityReference nodes
package org.w3c.dom;
public interface Element extends Node {
public String getTagName();
public String getAttribute(String name);
public void setAttribute(String name, String value) throws DOMException;
public void removeAttribute(String name) throws DOMException;
public Attr getAttributeNode(String name);
public Attr setAttributeNode(Attr newAttr) throws DOMException;
public Attr removeAttributeNode(Attr oldAttr) throws DOMException;
public NodeList getElementsByTagName(String name);
public String getAttributeNS(String namespaceURI, String localName);
public void setAttributeNS(String namespaceURI, String qualifiedName, String value) throws DOMException;
public void removeAttributeNS(String namespaceURI, String localName) throws DOMException;
public Attr getAttributeNodeNS(String namespaceURI, String localName);
public Attr setAttributeNodeNS(Attr newAttr) throws DOMException;
public NodeList getElementsByTagNameNS(String namespaceURI, String localName);
}
import org.apache.xerces.parsers.DOMParser; import org.w3c.dom.*; import org.xml.sax.*; import java.io.IOException; import org.apache.xml.serialize.*; public class IDTagger extends NodeIterator { int id = 1; public void processNode(Node node) { if (node instanceof Element) { Element element = (Element) node; String currentID = element.getAttribute("ID"); if (currentID == null || currentID.equals("")) { element.setAttribute("ID", "_" + id); id = id + 1; } } } public static void main(String[] args) { DOMParser parser = new DOMParser(); NodeIterator iterator = new IDTagger(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document document = parser.getDocument(); iterator.followNode(document); // now we serialize the document... OutputFormat format = new OutputFormat(document); XMLSerializer serializer = new XMLSerializer(System.out, format); serializer.serialize(document); } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } // end main }View Output in Browser
Represents things that are basically text holders
Super interface of Text
, Comment
,
and CDATASection
package org.w3c.dom;
public interface CharacterData extends Node {
public String getData() throws DOMException;
public void setData(String data) throws DOMException;
public int getLength();
public String substringData(int offset, int count) throws DOMException;
public void appendData(String arg) throws DOMException;
public void insertData(int offset, String arg) throws DOMException;
public void deleteData(int offset, int count) throws DOMException;
public void replaceData(int offset, int count, String arg) throws DOMException;
}
import org.apache.xerces.parsers.DOMParser; import org.apache.xml.serialize.*; import org.w3c.dom.*; import org.xml.sax.SAXException; import java.io.IOException; public class ROT13XML extends NodeIterator { int id = 1; public void processNode(Node node) { if (node instanceof CharacterData) { CharacterData text = (CharacterData) node; String data = text.getData(); text.setData(rot13(data)); } } public static String rot13(String s) { StringBuffer result = new StringBuffer(s.length()); for (int i = 0; i < s.length(); i++) { int c = s.charAt(i); if (c >= 'A' && c <= 'M') result.append((char) (c+13)); else if (c >= 'N' && c <= 'Z') result.append((char) (c-13)); else if (c >= 'a' && c <= 'm') result.append((char) (c+13)); else if (c >= 'n' && c <= 'z') result.append((char) (c-13)); else result.append((char) c); } return result.toString(); } public static void main(String[] args) { DOMParser parser = new DOMParser(); NodeIterator iterator = new ROT13XML(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document document = parser.getDocument(); iterator.followNode(document); // now we serialize the document... OutputFormat format = new OutputFormat(document); XMLSerializer serializer = new XMLSerializer(System.out, format); serializer.serialize(document); } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } // end main }View result in browser
Represents the text content of an element or attribute
Contains only pure text, no markup
Parsers will return a single maximal text node for each contiguous run of pure text
Editing may change this
package org.w3c.dom;
public interface Text extends CharacterData {
public Text splitText(int offset) throws DOMException;
}
Represents a document type declaration
Has no children
package org.w3c.dom;
public interface DocumentType extends Node {
public String getName();
public NamedNodeMap getEntities();
public NamedNodeMap getNotations();
public String getPublicId();
public String getSystemId();
public String getInternalSubset();
}
Verify that a document is correct XHTML
From the XHTML 1.0 spec:
It must validate against one of the three DTDs found in Appendix A.
The root element of the document must be
<html>
.
The root element of the document must designate the XHTML namespace using the
xmlns
attribute [XMLNAMES]. The namespace for XHTML is defined to behttp://www.w3.org/1999/xhtml
.
There must be a DOCTYPE declaration in the document prior to the root element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in Appendix A using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "DTD/xhtml1-frameset.dtd">
import org.w3c.dom.*; import org.apache.xerces.parsers.*; import java.io.*; import org.xml.sax.*; public class XHTMLValidator { public static void main(String[] args) { for (int i = 0; i < args.length; i++) { validate(args[i]); } } private static DOMParser parser = new DOMParser(); static { // turn on validation try { parser.setFeature("http://xml.org/sax/features/validation", true); parser.setErrorHandler(new ValidityErrorReporter()); } catch (SAXNotRecognizedException e) { System.err.println( "Installed XML parser cannot validate; checking for well-formedness instead..."); } catch (SAXNotSupportedException e) { System.err.println( "Cannot turn on validation here; checking for well-formedness instead..."); } } // not thread safe public static void validate(String source) { try { try { parser.parse(source); // ValidityErrorReporter prints any validity errors detected } catch (SAXException e) { System.out.println(source + " is not well formed."); return; } // If we get this far, then the document is well-formed XML. // Check to see whether the document is actually XHTML Document document = parser.getDocument(); DocumentType doctype = document.getDoctype(); if (doctype == null) { System.out.println("No DOCTYPE"); return; } String name = doctype.getName(); String systemID = doctype.getSystemId(); String publicID = doctype.getPublicId(); if (!name.equals("html")) { System.out.println("Incorrect root element name " + name); } if (publicID == null || (!publicID.equals("-//W3C//DTD XHTML 1.0 Strict//EN") && !publicID.equals("-//W3C//DTD XHTML 1.0 Transitional//EN") && !publicID.equals("-//W3C//DTD XHTML 1.0 Frameset//EN"))) { System.out.println(source + " does not seem to use an XHTML 1.0 DTD"); } // Check the namespace on the root element Element root = document.getDocumentElement(); String xmlnsValue = root.getAttribute("xmlns"); if (!xmlnsValue.equals("http://www.w3.org/1999/xhtml")) { System.out.println(source + " does not properly declare the http://www.w3.org/1999/xhtml namespace on the root element"); } // get ready for the next parse parser.reset(); } catch (IOException e) { System.err.println("Could not read " + source); } catch (Exception e) { System.err.println(e); e.printStackTrace(); } } }
Represents an entity reference like &
or &signature;
Opitonal: some parsers (including Xerces) just expand entities
Contains:
Element nodes
ProcessingInstruction nodes
Comment nodes
Text nodes
CDATASection nodes
EntityReference nodes
package org.w3c.dom;
public interface EntityReference extends Node {
}
Represents an attribute
Contains:
Text nodes
EntityReference nodes
package org.w3c.dom;
public interface Attr extends Node {
public String getName();
public boolean getSpecified();
public String getValue();
public void setValue(String value) throws DOMException;
public Element getOwnerElement();
}
import org.xml.sax.*; import org.apache.xerces.parsers.*; import java.io.*; import java.util.*; import org.w3c.dom.*; public class DOMSpider { private static DOMParser parser = new DOMParser(); // namespace suport is turned off by default in Xerces static { try { parser.setFeature("http://xml.org/sax/features/namespaces", true); } catch (Exception e) { System.err.println(e); } } private static Vector visited = new Vector(); private static int maxDepth = 5; private static int currentDepth = 0; public static void listURIs(String systemId) { currentDepth++; try { if (currentDepth < maxDepth) { parser.parse(systemId); Document document = parser.getDocument(); Vector uris = new Vector(); // search the document for uris, // store them in vector, and print them searchForURIs(document.getDocumentElement(), uris); Enumeration e = uris.elements(); while (e.hasMoreElements()) { String uri = (String) e.nextElement(); visited.addElement(uri); listURIs(uri); } } } catch (SAXException e) { // couldn't load the document, // probably not well-formed XML, skip it } catch (IOException e) { // couldn't load the document, // likely network failure, skip it } finally { currentDepth--; System.out.flush(); } } // use recursion public static void searchForURIs(Element element, Vector uris) { // look for XLinks in this element String uri = element.getAttribute("xlink:href"); // Namespace support seems buggy // String uri = element.getAttributeNS("href", "http://www.w3.org/1999/xlink"); if (uri != null && !uri.equals("") && !visited.contains(uri) && !uris.contains(uri)) { System.out.println(uri); uris.addElement(uri); } // process child elements recursively NodeList children = element.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { Node n = children.item(i); if (n instanceof Element) { searchForURIs((Element) n, uris); } } } public static void main(String[] args) { if (args.length == 0) { System.out.println("Usage: java DOMSpider URL1 URL2..."); } // start parsing... for (int i = 0; i < args.length; i++) { try { listURIs(args[i]); } catch (Exception e) { System.err.println(e); e.printStackTrace(); } } // end for } // end main } // end DOMSpider
Represents a processing instruction like
<?robots index="yes" follow="no"?>
No children
package org.w3c.dom;
public interface ProcessingInstruction extends Node {
public String getTarget();
public String getData();
public void setData(String data) throws DOMException;
}
import org.xml.sax.*; import org.apache.xerces.parsers.*; import java.io.*; import java.util.*; import org.w3c.dom.*; public class PoliteDOMSpider { private static DOMParser parser = new DOMParser(); // namespace suport is turned off by default in Xerces static { try { parser.setFeature("http://xml.org/sax/features/namespaces", true); } catch (Exception e) { System.err.println(e); } } private static Vector visited = new Vector(); private static int maxDepth = 5; private static int currentDepth = 0; public static void listURIs(String systemId) { currentDepth++; try { if (currentDepth < maxDepth) { parser.parse(systemId); Document document = parser.getDocument(); if (robotsAllowed(document)) { Vector uris = new Vector(); // search the document for uris, // store them in vector, print them searchForURIs(document.getDocumentElement(), uris); Enumeration e = uris.elements(); while (e.hasMoreElements()) { String uri = (String) e.nextElement(); visited.addElement(uri); listURIs(uri); } } } } catch (SAXException e) { // couldn't load the document, // probably not well-formed XML, skip it } catch (IOException e) { // couldn't load the document, // likely network failure, skip it } finally { currentDepth--; System.out.flush(); } } public static boolean robotsAllowed(Document document) { NodeList children = document.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { Node n = children.item(i); if (n instanceof ProcessingInstruction) { ProcessingInstruction pi = (ProcessingInstruction) n; if (pi.getTarget().equals("robots")) { String data = pi.getData(); if (data.indexOf("follow=\"no\"") >= 0) { return false; } } } } return true; } // use recursion public static void searchForURIs(Element element, Vector uris) { // look for XLinks in this element String uri = element.getAttribute("xlink:href"); // Namespace support seems buggy // String uri = element.getAttributeNS("href", "http://www.w3.org/1999/xlink"); if (uri != null && !uri.equals("") && !visited.contains(uri) && !uris.contains(uri)) { System.out.println(uri); uris.addElement(uri); } // process child elements recursively NodeList children = element.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { Node n = children.item(i); if (n instanceof Element) { searchForURIs((Element) n, uris); } } } public static void main(String[] args) { if (args.length == 0) { System.out.println("Usage: java PoliteDOMSpider URL1 URL2..."); } // start parsing... for (int i = 0; i < args.length; i++) { try { listURIs(args[i]); } catch (Exception e) { System.err.println(e); e.printStackTrace(); } } // end for } // end main } // end PoliteDOMSpider
Represents a comment like this example from the XML 1.0 spec:
<!--* N.B. some readers (notably JC) find the following
paragraph awkward and redundant. I agree it's logically redundant:
it *says* it is summarizing the logical implications of
matching the grammar, and that means by definition it's
logically redundant. I don't think it's rhetorically
redundant or unnecessary, though, so I'm keeping it. It
could however use some recasting when the editors are feeling
stronger. -MSM *-->
No children
package org.w3c.dom;
public interface Comment extends CharacterData {
}
import org.apache.xerces.parsers.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class CommentReader { public static void main(String[] args) { DOMParser parser = new DOMParser(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document d = parser.getDocument(); processNode(d); } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } // end main // note use of recursion public static void processNode(Node node) { int type = node.getNodeType(); if (type == Node.COMMENT_NODE) { System.out.println(node.getNodeValue()); System.out.println(); } else { if (node.hasChildNodes()) { NodeList children = node.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { processNode(children.item(i)); } } } } }
D:\speaking\SD2000 East\dom\examples>java CommentReader hotcop.xml
The publisher is actually Polygram but I needed
an example of a general entity reference.
You can tell what album I was
listening to when I wrote this example
Or try http://www.w3.org/TR/1998/REC-xml-19980210.xml for more interesting output
Represents a CDATA section like this example from a hypothetical SVG tutorial:
<p>You can use a default <code>xmlns</code> attribute to avoid
having to add the svg prefix to all your elements:</p>
<![CDATA[
<svg xmlns="http://www.w3.org/Graphics/SVG/SVG-19991203.dtd"
width="12cm" height="10cm">
<ellipse rx="110" ry="130" />
<rect x="4cm" y="1cm" width="3cm" height="6cm" />
</svg>
]]>
No children
package org.w3c.dom;
public interface CDATASection extends Text {
}
Represents an actual entity, not an entity reference!
Contains:
Element nodes
ProcessingInstruction nodes
Comment nodes
Text nodes
CDATASection nodes
EntityReference nodes
package org.w3c.dom;
public interface Entity extends Node {
public String getPublicId();
public String getSystemId();
public String getNotationName();
}
A runtime exception but you should catch it
Error code gives more detailed information:
DOMException.INDEX_SIZE_ERR
DOMException.DOMSTRING_SIZE_ERR
String
DOMException.HIERARCHY_REQUEST_ERR
DOMException.WRONG_DOCUMENT_ERR
DOMException.INVALID_CHARACTER_ERR
DOMException.NO_DATA_ALLOWED_ERR
DOMException.NO_MODIFICATION_ALLOWED_ERR
DOMException.NOT_FOUND_ERR
DOMException.NOT_SUPPORTED_ERR
DOMException.INUSE_ATTRIBUTE_ERR
DOMException.INVALID_STATE_ERR
DOMException.SYNTAX_ERR
DOMException.INVALID_MODIFICATION_ERR
DOMException.NAMESPACE_ERR
DOMException.INVALID_ACCESS_ERR
Current value accessible from the public code
field
DOM is for both input and output
New documents are created with a parser-specific API
A serializer + output format converts the DOM to a byte stream
A Xerces-specific class used to create new DOM documents
package org.apache.xerces.dom;
public class DOMImplementationImpl implements DOMImplementation {
public boolean hasFeature(String feature, String version)
public static DOMImplementation getDOMImplementation()
public DocumentType createDocumentType(String qualifiedName,
String publicID,
String systemID,
String internalSubset)
public Document createDocument(String namespaceURI,
String qualifiedName,
DocumentType doctype)
throws DOMException
}
import java.math.*; import java.io.*; import org.w3c.dom.*; import org.apache.xerces.dom.*; public class FibonacciDOM { public static void main(String[] args) { try { DOMImplementationImpl impl = (DOMImplementationImpl) DOMImplementationImpl.getDOMImplementation(); DocumentType type = impl.createDocumentType("Fibonacci_Numbers", null, null); // type is supposed to be able to be null, // but in practice that didn't work DocumentImpl fibonacci = (DocumentImpl) impl.createDocument(null, "Fibonacci_Numbers", type); BigInteger low = BigInteger.ZERO; BigInteger high = BigInteger.ONE; Element root = fibonacci.createElement("Fibonacci_Numbers"); // This not only creates the element; it also makes it the // root element of the document. for (int i = 0; i < 101; i++) { Element number = fibonacci.createElement("fibonacci"); number.setAttribute("index", Integer.toString(i)); Text text = fibonacci.createTextNode(low.toString()); number.appendChild(text); root.appendChild(number); BigInteger temp = high; high = high.add(low); low = temp; } // Now that the document is created we need to *serialize* it } catch (DOMException e) { e.printStackTrace(); } } }
The process of taking an in-memory DOM tree and converting it to a stream of characters that can be written onto an output stream
Not a standard part of the DOM
The public interface DOMSerializer public interface Serializer public abstract class BaseMarkupSerializer
extends Object
implements DocumentHandler, org.xml.sax.misc.LexicalHandler, DTDHandler,
org.xml.sax.misc.DeclHandler, DOMSerializer, Serializer public class HTMLSerializer
extends BaseMarkupSerializer public final class TextSerializer
extends BaseMarkupSerializer public final class XHTMLSerializer
extends HTMLSerializer public final class XMLSerializer
extends BaseMarkupSerializerorg.apache.xml.serialize
package
import java.math.*; import java.io.*; import org.w3c.dom.*; import org.apache.xerces.dom.*; import org.apache.xml.serialize.*; public class FibonacciDOMSerializer { public static void main(String[] args) { try { DOMImplementationImpl impl = (DOMImplementationImpl) DOMImplementationImpl.getDOMImplementation(); DocumentType type = impl.createDocumentType("Fibonacci_Numbers", null, null); // type is supposed to be able to be null, // but in practice that didn't work DocumentImpl fibonacci = (DocumentImpl) impl.createDocument(null, "Fibonacci_Numbers", type); BigInteger low = BigInteger.ZERO; BigInteger high = BigInteger.ONE; Element root = fibonacci.createElement("Fibonacci_Numbers"); // This not only creates the element; it also makes it the // root element of the document. for (int i = 0; i <= 25; i++) { Element number = fibonacci.createElement("fibonacci"); number.setAttribute("index", Integer.toString(i)); Text text = fibonacci.createTextNode(low.toString()); number.appendChild(text); root.appendChild(number); BigInteger temp = high; high = high.add(low); low = temp; } try { // Now that the document is created we need to *serialize* it OutputFormat format = new OutputFormat(fibonacci); XMLSerializer serializer = new XMLSerializer(System.out, format); serializer.serialize(root); } catch (IOException e) { System.err.println(e); } } catch (DOMException e) { e.printStackTrace(); } } }
<?xml version="1.0" encoding="UTF-8"?> <Fibonacci_Numbers><fibonacci index="0">0</fibonacci><fibonacci index="1">1</fibonacci><fibonacci index="2">1</fibonacci><fibonacci index="3">2</fibonacci><fibonacci index="4">3</fibonacci><fibonacci index="5">5</fibonacci><fibonacci index="6">8</fibonacci><fibonacci index="7">13</fibonacci><fibonacci index="8">21</fibonacci><fibonacci index="9">34</fibonacci><fibonacci index="10">55</fibonacci><fibonacci index="11">89</fibonacci><fibonacci index="12">144</fibonacci><fibonacci index="13">233</fibonacci><fibonacci index="14">377</fibonacci><fibonacci index="15">610</fibonacci><fibonacci index="16">987</fibonacci><fibonacci index="17">1597</fibonacci><fibonacci index="18">2584</fibonacci><fibonacci index="19">4181</fibonacci><fibonacci index="20">6765</fibonacci><fibonacci index="21">10946</fibonacci><fibonacci index="22">17711</fibonacci><fibonacci index="23">28657</fibonacci><fibonacci index="24">46368</fibonacci><fibonacci index="25">75025</fibonacci> </Fibonacci_Numbers>
package org.apache.xml.serialize;
public class OutputFormat extends Object {
public OutputFormat()
public OutputFormat(String method, String encoding, boolean indenting)
public OutputFormat(Document doc)
public OutputFormat(Document doc, String encoding, boolean indenting)
public String getMethod()
public void setMethod(String method)
public String getVersion()
public void setVersion(String version)
public int getIndent()
public boolean getIndenting()
public void setIndent(int indent)
public void setIndenting(boolean on)
public String getEncoding()
public void setEncoding(String encoding)
public String getMediaType()
public void setMediaType(String mediaType)
public void setDoctype(String publicID, String systemID)
public String getDoctypePublic()
public String getDoctypeSystem()
public boolean getOmitXMLDeclaration()
public void setOmitXMLDeclaration(boolean omit)
public boolean getStandalone()
public void setStandalone(boolean standalone)
public String[] getCDataElements()
public boolean isCDataElement(String tagName)
public void setCDataElements(String[] cdataElements)
public String[] getNonEscapingElements()
public boolean isNonEscapingElement(String tagName)
public void setNonEscapingElements(String[] nonEscapingElements)
public String getLineSeparator()
public void setLineSeparator(String lineSeparator)
public boolean getPreserveSpace()
public void setPreserveSpace(boolean preserve)
public int getLineWidth()
public void setLineWidth(int lineWidth)
public char getLastPrintable()
public static String whichMethod(Document doc)
public static String whichDoctypePublic(Document doc)
public static String whichDoctypeSystem(Document doc)
public static String whichMediaType(String method)
}
Latin-1 encoding
Indentation
Word wrapping
Document type declaration
try {
// Now that the document is created we need to *serialize* it
OutputFormat format = new OutputFormat(fibonacci, "8859_1", true);
format.setLineSeparator("\r\n");
format.setLineWidth(72);
format.setDoctype(null, "fibonacci.dtd");
XMLSerializer serializer = new XMLSerializer(System.out, format);
serializer.serialize(root);
}
catch (IOException e) {
System.err.println(e);
}
Question: Why won't this let us add an xml-stylesheet
directive?
<?xml version="1.0" encoding="8859_1"?> <!DOCTYPE Fibonacci_Numbers SYSTEM "fibonacci.dtd"> <Fibonacci_Numbers> <fibonacci index="0">0</fibonacci> <fibonacci index="1">1</fibonacci> <fibonacci index="2">1</fibonacci> <fibonacci index="3">2</fibonacci> <fibonacci index="4">3</fibonacci> <fibonacci index="5">5</fibonacci> <fibonacci index="6">8</fibonacci> <fibonacci index="7">13</fibonacci> <fibonacci index="8">21</fibonacci> <fibonacci index="9">34</fibonacci> <fibonacci index="10">55</fibonacci> <fibonacci index="11">89</fibonacci> <fibonacci index="12">144</fibonacci> <fibonacci index="13">233</fibonacci> <fibonacci index="14">377</fibonacci> <fibonacci index="15">610</fibonacci> <fibonacci index="16">987</fibonacci> <fibonacci index="17">1597</fibonacci> <fibonacci index="18">2584</fibonacci> <fibonacci index="19">4181</fibonacci> <fibonacci index="20">6765</fibonacci> <fibonacci index="21">10946</fibonacci> <fibonacci index="22">17711</fibonacci> <fibonacci index="23">28657</fibonacci> <fibonacci index="24">46368</fibonacci> <fibonacci index="25">75025</fibonacci> </Fibonacci_Numbers>
import org.apache.xerces.parsers.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; import org.apache.xerces.dom.*; import org.apache.xml.serialize.*; public class DOMPrettyPrinter { public static void main(String[] args) { DOMParser parser = new DOMParser(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document document = parser.getDocument(); OutputFormat format = new OutputFormat(document, "UTF-8", true); format.setLineSeparator("\r\n"); format.setLineWidth(72); format.setIndent(2); format.setPreserveSpace(false); XMLSerializer serializer = new XMLSerializer(System.out, format); serializer.serialize(document); } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } // end main }
Using the DOM to write documents automatically maintains well-formedness constraints
Validity is not automatically maintained.
This presentation: http://metalab.unc.edu/xml/slides/sd2000east/dom