The Document interface, summarized in Example 10.4, serves two purposes in DOM:
An abstract factory that creates instances of other nodes for that document
The representation of the document node
Example 10.4. The Document interface
package org.w3c.dom; public interface Document extends Node { public Element createElement(String tagName) throws DOMException; public Element createElementNS(String namespaceURI, String qualifiedName) throws DOMException; public Text createTextNode(String data); public Comment createComment(String data); public CDATASection createCDATASection(String data) throws DOMException; public ProcessingInstruction createProcessingInstruction( String target, String data) throws DOMException; public Attr createAttribute(String name) throws DOMException; public Attr createAttributeNS(String namespaceURI, String qualifiedName) throws DOMException; public DocumentFragment createDocumentFragment(); public EntityReference createEntityReference(String name) throws DOMException; public DocumentType getDoctype(); public DOMImplementation getImplementation(); public Element getDocumentElement(); public Node importNode(Node importedNode, boolean deep) throws DOMException; public NodeList getElementsByTagName(String tagname); public NodeList getElementsByTagNameNS( String namespaceURI, String localName); public Element getElementById(String elementId); }
Don’t forget that besides the methods listed here, each Document object also has all the methods of the Node interface discussed in the last chapter. These are key parts of the functionality of the class.
I’ll begin with its use as an abstract factory. You’ll notice that the Document interface has nine separate createXXX() methods for creating seven different kinds of node objects. (There are two methods each for creating element and attribute nodes because you can create these with or without namespaces.) For example, given a Document object doc, this code fragment creates a new processing instruction and a comment:
ProcessingInstruction xmlstylesheet = doc.createProcessingInstruction("xml-stylesheet", "type=\"text/css\" href=\"standard.css\""); Comment comment = doc.createComment( "An example from Chapter 10 of Processing XML with Java");
Although these two nodes are associated with the document, they are not yet parts of its tree. To add them, it’s necessary to use the insertBefore() method of the Node interface which Document extends. Specifically, I’ll insert each of these nodes before the root element of the document, which can be retrieved via getDocumentElement():
Node rootElement = doc.getDocumentElement(); doc.insertBefore(comment, rootElement); doc.insertBefore(xmlstylesheet, rootElement);
To add content inside the root element, it’s necessary to use the Node methods on the root element. For example, this code fragment, adds a desc child element to the root element:
Element desc = doc.createElementNS("http://www.w3.org/2000/svg", "desc"); rootElement.appendChild(desc);
Each node is created by the owner document. However, it is inserted using the parent node. For example, this code fragment adds a text node child containing the words “An example from Processing XML with Java” to the previous desc element node:
Text descText = doc.createTextNode("An example from Processing XML with Java"); desc.appendChild(descText);
Example 10.5 puts this all together to create a program that builds a complete, albeit very simple, SVG document in memory using DOM. JAXP loads the DOMImplementation so that the program is reasonably parser-independent. The JAXP ID-transform hack introduced in the last chapter dumps the document on System.out.
Example 10.5. Building an SVG document in memory using DOM
import javax.xml.parsers.*; import javax.xml.transform.*; import javax.xml.transform.stream.StreamResult; import javax.xml.transform.dom.DOMSource; import org.w3c.dom.*; public class SimpleSVG { public static void main(String[] args) { try { // Find the implementation DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setNamespaceAware(true); DocumentBuilder builder = factory.newDocumentBuilder(); DOMImplementation impl = builder.getDOMImplementation(); // Create the document DocumentType svgDOCTYPE = impl.createDocumentType( "svg", "-//W3C//DTD SVG 1.0//EN", "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd" ); Document doc = impl.createDocument( "http://www.w3.org/2000/svg", "svg", svgDOCTYPE); // Fill the document Node rootElement = doc.getDocumentElement(); ProcessingInstruction xmlstylesheet = doc.createProcessingInstruction("xml-stylesheet", "type=\"text/css\" href=\"standard.css\""); Comment comment = doc.createComment( "An example from Chapter 10 of Processing XML with Java"); doc.insertBefore(comment, rootElement); doc.insertBefore(xmlstylesheet, rootElement); Node desc = doc.createElementNS( "http://www.w3.org/2000/svg", "desc"); rootElement.appendChild(desc); Text descText = doc.createTextNode( "An example from Processing XML with Java"); desc.appendChild(descText); // Serialize the document onto System.out TransformerFactory xformFactory = TransformerFactory.newInstance(); Transformer idTransform = xformFactory.newTransformer(); Source input = new DOMSource(doc); Result output = new StreamResult(System.out); idTransform.transform(input, output); } catch (FactoryConfigurationError e) { System.out.println("Could not locate a JAXP factory class"); } catch (ParserConfigurationException e) { System.out.println( "Could not locate a JAXP DocumentBuilder class" ); } catch (DOMException e) { System.err.println(e); } catch (TransformerConfigurationException e) { System.err.println(e); } catch (TransformerException e) { System.err.println(e); } } }
When this program is run, it produces the following output:
C:\XMLJAVA>java SimpleSVG <?xml version="1.0" encoding="utf-8"?><!--An example from Chapter 10 of Processing XML with Java--><?xml-stylesheet type="text/css" href="standard.css"?><svg><desc>An example from Processing XML with Java</desc></svg>
I’ve inserted line breaks to make the output fit on this page. However, the actual output doesn’t have any. In the prolog, that’s because the JAXP ID transform doesn’t include any. In the document, that’s because the program did not add any text nodes containing only white space. Many parser vendors include custom serialization packages that allow you to more closely manage the placement of white space and other syntax sugar in the output. In addition, this will be a standard part of DOM3. We’ll explore these options for prettifying the output in Chapter 13.
The lack of namespace declarations and possibly the lack of a DOCTYPE declaration is a result of bugs in JAXP implementations. I’ve reported the problem to several XSLT processor/XML parser vendors and am hopeful that at least some of them will fix this bug before the final draft of this book. As of May, 2002 GNU JAXP and Oracle include the namespace declaration while Xerces 2.0.1 leaves it out. So far no implementation I've seen includes the DOCTYPE declaration.
The same techniques can be used for all the nodes in the tree: text, comments, elements, processing instructions, and entity references. Attributes are not children though. Attribute nodes can only be set on element nodes and only by using the methods of the Element interface. I’ll take that up in the next chapter. However, Attr objects are created by Document objects, just like all the other DOM node objects.
DOM is not picky about whether you work from the top down or the bottom up. You can start at the root, and add its children; then add the child nodes to these nodes, and continue on down the tree. Alternately, you can start by creating the deepest nodes in the tree, and then create their parents, and then the parents of the parents, and so on back up to the root. Or you can mix and match as seems appropriate in your program. DOM really doesn’t care as long as there’s always a root element.
Each node that’s created is firmly associated with the document that created it. If document A creates node X, then node X cannot be inserted into document B. A copy of node X can be imported into document B, but node X itself is always attached only to document A.
We’re now in a position to repeat some examples from Chapter 3 but this time using DOM to create the document rather than just writing strings onto a stream. Among other advantages this means that many well-formedness constraints are automatically satisfied. Furthermore, the programs will have a much more object oriented feel to them.
I’ll begin with the simple Fibonacci problem of Example 3.3. That program produced documents that look like this:
<?xml version="1.0"?> <Fibonacci_Numbers> <fibonacci>1</fibonacci> <fibonacci>1</fibonacci> <fibonacci>2</fibonacci> <fibonacci>3</fibonacci> <fibonacci>5</fibonacci> <fibonacci>8</fibonacci> <fibonacci>13</fibonacci> <fibonacci>21</fibonacci> <fibonacci>34</fibonacci> <fibonacci>55</fibonacci> </Fibonacci_Numbers>
This is a straightforward element based hierarchy that does not use namespaces or document type declarations. Although simple, these sorts of documents are important. XML-RPC is just one of many real-world applications that does not use anything more than element, text, and document nodes.
Example 10.6 is a DOM-based program that generates documents of this form. It is at least superficially more complex than the equivalent program from Chapter 3. However, it has some advantages that program does not. In particular, well-formedness of the output is almost guaranteed. It’s a lot harder to produce incorrect XML with DOM than simply by writing strings on a stream. Secondly, the data structure is a lot more flexible. Here, the document is written more or less from beginning to end. However, if this were part of a larger program that ran for a longer period of time, nodes could be added and deleted in almost random order anywhere in the tree at any time. It’s not necessary to know all the information that will ever go into the document before you begin writing it. The downside to this is that DOM programs tend to eat substantially more RAM than the streaming equivalents because they have to keep the entire document in memory at all times. This can be a significant problem for large documents.
Example 10.6. A DOM program that outputs the Fibonacci numbers as an XML document
import org.w3c.dom.*; import javax.xml.parsers.*; import javax.xml.transform.*; import javax.xml.transform.dom.DOMSource; import javax.xml.transform.stream.StreamResult; import java.math.BigInteger; public class FibonacciDOM { public static void main(String[] args) { try { // Find the implementation DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setNamespaceAware(true); DocumentBuilder builder = factory.newDocumentBuilder(); DOMImplementation impl = builder.getDOMImplementation(); // Create the document Document doc = impl.createDocument(null, "Fibonacci_Numbers", null); // Fill the document BigInteger low = BigInteger.ONE; BigInteger high = BigInteger.ONE; Element root = doc.getDocumentElement(); for (int i = 0; i < 10; i++) { Element number = doc.createElement("fibonacci"); Text text = doc.createTextNode(low.toString()); number.appendChild(text); root.appendChild(number); BigInteger temp = high; high = high.add(low); low = temp; } // Serialize the document onto System.out TransformerFactory xformFactory = TransformerFactory.newInstance(); Transformer idTransform = xformFactory.newTransformer(); Source input = new DOMSource(doc); Result output = new StreamResult(System.out); idTransform.transform(input, output); } catch (FactoryConfigurationError e) { System.out.println("Could not locate a JAXP factory class"); } catch (ParserConfigurationException e) { System.out.println( "Could not locate a JAXP DocumentBuilder class" ); } catch (DOMException e) { System.err.println(e); } catch (TransformerConfigurationException e) { System.err.println(e); } catch (TransformerException e) { System.err.println(e); } } }
As usual, this code is broken up into the four main parts of creating a new XML document with DOM:
Locate a DOMImplementation
Create a new Document object.
Fill the Document with various kinds of nodes.
Serialize the Document onto a stream.
Most DOM programs that create new documents follow this structure. They may hide the different parts in different methods, or use DOM3 to serialize instead of JAXP; but they all have to locate a DOMImplementation, use that to create a Document object, fill the document with other nodes created by the Document object, and then finally serialize the result. (A few programs may occasionally skip the serialization step.)
The only part that really changes from one program to the next is how the document is filled with content. This naturally depends on the structure of the document. A program that reads tables from a database to get the data will naturally look very different from a program like this one that algorithmically generates numbers. And both of these will look very different from a program that asks the user to type in information. However, all three and many more besides will use the same methods of the Document and Node interfaces to build the structures they need.
Here’s the output when this program is run:
C:\XMLJAVA>java FibonacciDOM <?xml version="1.0" encoding="utf-8"?><Fibonacci_Numbers> <fibonacci>1</fibonacci><fibonacci>1</fibonacci><fibonacci>2 </fibonacci><fibonacci>3</fibonacci><fibonacci>5</fibonacci> <fibonacci>8</fibonacci><fibonacci>13</fibonacci><fibonacci>21 </fibonacci><fibonacci>34</fibonacci><fibonacci>55</fibonacci> </Fibonacci_Numbers>
You see once again that the white space is not quite what was expected. One way to fix this is to add the extra text nodes that represent the white space. For example,
for (int i = 0; i < 10; i++) { Text space = doc.createTextNode("\n "); root.appendChild(space); Element number = doc.createElement("fibonacci"); Text text = doc.createTextNode(low.toString()); number.appendChild(text); root.appendChild(number); BigInteger temp = high; high = high.add(low); low = temp; } Text lineBreak = doc.createTextNode("\n"); root.appendChild(lineBreak);
Alternately, you can use a more sophisticated serializer and tell it to add the extra white space. I prefer this approach because it’s much simpler and does not clutter up the code with basically insignificant white space. I’ll demonstrate this in Chapter 13. Of course, if you really do care about white space, then you need to manage the white-space only text nodes explicitly and tell whichever serializer you use to leave the white space alone.
Adding namespaces or a DOCTYPE declaration pointing to an external DTD subset, is not significantly harder. For example, suppose you want to generate valid MathML like Example 10.7:
Example 10.7. A valid MathML document containing Fibonacci numbers
<?xml version="1.0"?> <!DOCTYPE math PUBLIC "-//W3C//DTD MathML 2.0//EN" "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd"> <math xmlns:mathml="http://www.w3.org/1998/Math/MathML"> <mrow><mi>f(1)</mi><mo>=</mo><mn>1</mn></mrow> <mrow><mi>f(2)</mi><mo>=</mo><mn>1</mn></mrow> <mrow><mi>f(3)</mi><mo>=</mo><mn>2</mn></mrow> <mrow><mi>f(4)</mi><mo>=</mo><mn>3</mn></mrow> <mrow><mi>f(5)</mi><mo>=</mo><mn>5</mn></mrow> <mrow><mi>f(6)</mi><mo>=</mo><mn>8</mn></mrow> <mrow><mi>f(7)</mi><mo>=</mo><mn>13</mn></mrow> <mrow><mi>f(8)</mi><mo>=</mo><mn>21</mn></mrow> <mrow><mi>f(9)</mi><mo>=</mo><mn>34</mn></mrow> <mrow><mi>f(10)</mi><mo>=</mo><mn>55</mn></mrow> </math>
The markup is somewhat more complex, but the Java code is not significantly more so. You simply need to use the implementation to create a new DocumentType object, and include both that and the namespace URL in the call to createDocument(). Example 10.8 demonstrates.
Example 10.8. A DOM program that outputs the Fibonacci numbers as a MathML document
import org.w3c.dom.*; import javax.xml.parsers.*; import javax.xml.transform.*; import javax.xml.transform.dom.DOMSource; import javax.xml.transform.stream.StreamResult; import java.math.BigInteger; public class FibonacciMathMLDOM { public static void main(String[] args) { try { // Find the implementation DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setNamespaceAware(true); DocumentBuilder builder = factory.newDocumentBuilder(); DOMImplementation impl = builder.getDOMImplementation(); // Create the document DocumentType mathml = impl.createDocumentType("math", "-//W3C//DTD MathML 2.0//EN", "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd"); Document doc = impl.createDocument( "http://www.w3.org/1998/Math/MathML", "math", mathml); // Fill the document BigInteger low = BigInteger.ONE; BigInteger high = BigInteger.ONE; Element root = doc.getDocumentElement(); for (int i = 1; i <= 10; i++) { Element mrow = doc.createElement("mrow"); Element mi = doc.createElement("mi"); Text function = doc.createTextNode("f(" + i + ")"); mi.appendChild(function); Element mo = doc.createElement("mo"); Text equals = doc.createTextNode("="); mo.appendChild(equals); Element mn = doc.createElement("mn"); Text value = doc.createTextNode(low.toString()); mn.appendChild(value); mrow.appendChild(mi); mrow.appendChild(mo); mrow.appendChild(mn); root.appendChild(mrow); BigInteger temp = high; high = high.add(low); low = temp; } // Serialize the document onto System.out TransformerFactory xformFactory = TransformerFactory.newInstance(); Transformer idTransform = xformFactory.newTransformer(); Source input = new DOMSource(doc); Result output = new StreamResult(System.out); idTransform.transform(input, output); } catch (FactoryConfigurationError e) { System.out.println("Could not locate a JAXP factory class"); } catch (ParserConfigurationException e) { System.out.println( "Could not locate a JAXP DocumentBuilder class" ); } catch (DOMException e) { System.err.println(e); } catch (TransformerConfigurationException e) { System.err.println(e); } catch (TransformerException e) { System.err.println(e); } } }
Internal DTD subsets are a little harder though, and not really supported at all in DOM2. For example, let’s suppose you want to use a namespace prefix on your MathML elements, but still have the document be valid MathML. The MathML DTD is designed in such a way that you can change the prefix and whether or not prefixes are used by redefining the MATHML.prefixed and MATHML.prefix parameter entities. For instance, Example 10.9 uses the prefix math:
Example 10.9. A valid MathML document using prefixed names
<?xml version="1.0"?> <!DOCTYPE math:math PUBLIC "-//W3C//DTD MathML 2.0//EN" "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd" [ <!ENTITY % MATHML.prefixed "INCLUDE"> <!ENTITY % MATHML.prefix "math"> ]> <math:math xmlns:mathml="http://www.w3.org/1998/Math/MathML"> <math:mrow> <math:mi>f(1)</math:mi> <math:mo>=</math:mo> <math:mn>1</math:mn> </math:mrow> <math:mrow> <math:mi>f(2)</math:mi> <math:mo>=</math:mo> <math:mn>1</math:mn> </math:mrow> <math:mrow> <math:mi>f(3)</math:mi> <math:mo>=</math:mo> <math:mn>2</math:mn> </math:mrow> <math:mrow> <math:mi>f(4)</math:mi> <math:mo>=</math:mo> <math:mn>3</math:mn> </math:mrow> </math:math>
Using prefixed names in DOM code is straightforward enough. However, there’s no way to override the entity definitions in the DTD to tell it to validate against the prefixed names. DOM does not provide any means to create a new internal DTD subset or change an existing one. This means that in order for the document you generate to be valid, it must use the same prefix the DTD does.
There are some hacks that can work around this. Some of the concrete classes that implement the DocumentType interface such as Xerces’s org.apache.xerces.dom.DocumentTypeImpl include a non-standard setInternalSubset() method. Or instead of pointing to the normal DTD, you can point to an external DTD that overrides the namespace parameter entity references and then imports the usual DTD. You could even generate this DTD on the fly using a separate output stream that writes strings containing entity declarations into a file. However, the bottom line is that the internal DTD subset just isn’t well supported by DOM, and any program that needs access to it should use a different API.
Copyright 2001, 2002 Elliotte Rusty Harold | elharo@metalab.unc.edu | Last Modified July 12, 2002 |
Up To Cafe con Leche |