DOM is based on an implicit data model, which is similar to but not quite the same as the data models used by other XML technologies like XPath, the XML Infoset, and SAX. Before we delve too deeply into the nitty-gritty details of the DOM API, it’s helpful to have a higher level understanding of just what DOM thinks an XML document is.
According to DOM, an XML document is a tree made up of nodes of several types. The tree has a single root node, and all nodes in this tree except for root have a single parent node. Furthermore, each node has a list of child nodes. In some cases, this list of children may be empty, in which case the node is called a leaf node.
There can also be nodes that are not part of the tree structure. For instance, each attribute node belongs to one element node but is not considered to be a child of that element. Furthermore, nodes can be removed from the tree or created but not inserted in the tree. Thus a full DOM document is composed of a tree of nodes, various nodes that are somehow associated with other nodes in the tree but are not themselves part of the tree, and a random assortment of disconnected nodes.
DOM trees are not red-black trees, binary trees, B-trees, or any other sort of special-purpose tree. From a data structures point of view, they are just plain-vanilla trees. Recursion works very well on DOM data structures, as it does on any tree. You can use all the techniques you learned in Data Structures 201 for processing trees. Breadth-first search, depth-first search, inorder traversal, preorder traversal, postorder traversal, etc. all work with DOM data structures.
Besides its tree connections, each node has a local name, a namespace URI, and a prefix; though for several kinds of nodes, these may be null. For instance the local name, namespace URI, and prefix of a comment are always null. Each node also has a node name. For an element or attribute, the node name is the prefixed name. For other named things like notations or entities, the node name is the name of the thing. For nodes without names such as text nodes, the node name is the value from the following list matching the node type:
#document
#comment
#text
#cdata-section
#document-fragment
Finally each node has a string value. For text-ish things like text nodes and comments, this tends to be the text of the node. For attributes, it’s the normalized value of the attribute. For everything else, including elements and documents, the value is null.
DOM divides nodes into twelve types, seven of which can potentially be part of a DOM tree:
Document nodes
Element nodes
Text nodes
Attribute nodes
Processing instruction nodes
Comment nodes
Document type nodes
Document fragment nodes
Notation nodes
CDATA section nodes
Entity nodes
Entity reference nodes
However, of these twelve, the first seven are by far the most important; and a tree built by an XML parser will often contain only the first seven.
Each DOM tree has a single root document node. This node has children. Since all documents have root elements, a document node always has exactly one element node child. If the document has a document type declaration, then it will also have one document type node child. If the document contains any comments or processing instructions before or after the root element, then these will also be child nodes of the document node. The order of all children is maintained. For example, consider the simple XML-RPC document shown in Example 9.2.
Example 9.2. An XML-RPC request document
<?xml version="1.0"?> <?xml-stylesheet type="text/css" href="xml-rpc.css"?> <!-- It's unusual to have an xml-stylesheet processing instruction in an XML-RPC document but it is legal, unlike SOAP where processing instructions are forbidden. --> <!DOCTYPE methodCall SYSTEM "xml-rpc.dtd"> <methodCall> <methodName>getQuote</methodName> <params> <param> <value><string>RHAT</string></value> </param> </params> </methodCall>
The document node representing the root of this document has four child nodes in this order:
A processing instruction node for the xml-stylesheet processing instruction
A comment node for the comment
A document type node for the document type declaration
An element node for the root methodCall element
The XML declaration (including the version, standalone, and encoding declarations) and the white space between these nodes are not included in the tree. They are not part of the model, and the parser does not include them in the tree it builds.
Each element node has a name, a local name, a namespace URI (which may be null if the element is not in any namespace) and a prefix (which may also be null). It also contains children. For example, consider this value element:
<value><string>RHAT</string></value>
When represented in DOM, it becomes a single element node with the name value. This node has a single element node child for the string element. Thie string element node has a single text node child containing the text RHAT.
Or consider this para element:
<db:para xmlns:db="http://www.example.com/" xmlns="http://namespaces.cafeconleche.org/"> Or consider this <markup>para</markup> element: </db:para>
In DOM it’s represented as an element node with the name db:para, the local name para, the prefix db, and the namespace URI http://www.example.com/. It has three children:
A text node containing the text Or consider this
An element node with the name markup, the local name markup, the namespace URI http://namespaces.cafeconleche.org/, and a null prefix.
Another text node containing the text element:.
White space is included in text nodes, even if it’s ignorable. For example, consider this methodCall element:
<methodCall> <methodName>getQuote</methodName> <params> <param> <value><string>RHAT</string></value> </param> </params> </methodCall>
It is represented as an element node with the name methodCall and five child nodes:
A text node containing only white space
An element node with the name methodName
A text node containing only white space
An element node with the name params
A text node containing only white space
Of course, these element nodes also have their own child nodes.
As well as element and text nodes, an element node can also contain comment and processing instruction nodes. Depending on how the parser behaves, an element node might also contain some CDATA section nodes and/or entity reference nodes. However, many parsers resolve these automatically into their component text and element nodes and do not report them separately.
An attribute node has a name, a local name, a prefix, a namespace URI, and a string value. The value is normalized as required by the XML 1.0 specification. That is, entity and character references in the value are resolved, and all white space characters are converted to a single space. If the attribute has any type other than CDATA, then leading and trailing white space is stripped from its value, and all other runs of white space are converted to a single space. An attribute node also has children, all of which are text and entity reference nodes forming the value of the attribute. However, it’s unusual to access these directly instead of by the value.
If a validating parser builds an XML document from a file, then default attributes from the DTD are included in the DOM tree. If the parser supports schemas, then default attributes can be read from the schema as well.
DOM does not provide the type of the attribute as specified by the DTD or schema, or the list of values available for an enumerated type attribute. This is a major shortcoming.
Attributes are not considered to be children of the element they’re attached to. Instead they are part of a separate set of nodes. For example, consider this Quantity element:
<Quantity amount="17" />
This element has no children, but it does have a single attribute with the name amount and the value 17.
Attributes that declare namespaces are not treated specially in DOM. They are reported by DOM parsers like any other attribute. Furthermore, DOM always provides the fully qualified names and namespace URIs for all element and attribute nodes.
Only document, element, attribute, entity, and entity reference nodes can have children. The remaining node types are much simpler.
Text nodes contain character data from the document stored as a String. Any characters like 𝄢 from outside Unicode’s Basic Multilingual Plane are represented as surrogate pairs. Characters like & and < that are represented in the document by predefined entity or character references are replaced by the actual characters they represent. If these nodes are written out again into an XML document, these characters need to be reescaped.
When a parser reads an XML document to form a DOM Document, it will put as much text as possible into each text node before being interrupted by non-predefined entities, comments, tags, CDATA section delimiters or other markup. Thus no text node will immediately follow any other text node. There is always an intervening non-text node. However, if a DOM is created or modified in memory, then the client program may divide text between immediately adjacent text nodes, so it’s not guaranteed that each text node contains the maximum possible contiguous run of text, just that this is the case immediately after a document is parsed.
A comment node has a name (which is always #comment), a string value (the text of the comment) and a parent (the node that contains it). That’s all. For example, consider this comment:
<!-- Don't forget to fix this! -->
The value of this node is Don't forget to fix this! . The white space on either end is included.
A processing instruction node has a name (the target of the processing instruction), a string value (the data of the processing instruction) and a parent (the node that contains it). That’s all. For example, consider this processing instruction:
<?xml-stylesheet type="text/css" href="xml-rpc.css"?>
The name of this node is xml-stylesheet. The value is type="text/css" href="xml-rpc.css". The white space between the target and the data is not included. White space between the data and the closing ?> is included. Even if the processing instruction uses a pseudo-attribute format like this one does, it is not considered to actually have attributes or children. Its data is just a string that happens to have some equal signs and quote marks in suggestive positions.
A CDATA section node is a special text node that represents the contents of a CDATA section. Its name is #cdata-section. Its value is the text content of the section. For example, consider this CDATA section:
<![CDATA[<?xml-stylesheet type="text/css" href="xml-rpc.css"?>]]>
Its name is #cdata-section and its value is <?xml-stylesheet type="text/css" href="xml-rpc.css"?>.
When a parser encounters a general entity reference such as Æ or ©right_notice;, it may or may not replace it with the entity’s replacement text. Validating parsers always replace entity references. Non-validating parsers may do so at their option.
If a parser does not replace entity references, then the DOM tree will include entity reference nodes. Each entity reference node has a name. If the parser has read the DTD, then you should be able to look up the public and system IDs for this entity reference using the map of entity nodes available on the document type node. Furthermore, the child list of the entity will contain this entity reference’s replacement text. However, if the parser has not read the DTD and resolved external entity references, then the child list may be empty.
If a parser does replace entity references, then the DOM tree may or may not include entity reference nodes. Some parsers resolve all entity reference nodes completely and leave no trace of them in the parsed tree. However, some parsers instead include entity reference nodes in the DOM tree that have a list of children. The child list contains text nodes, element nodes, comment nodes, and so forth representing the replacement text of the entity.
For example, suppose an XML document contains this element:
<para>Ælfred is a very nice XML parser.</para>
If the parser is not resolving entity references, then the para element node contains two children, an entity reference node with the name AElig and a text node containing the text “lfred is a very nice XML parser.” The AElig entity reference node will not have any children.
Now suppose the parser is resolving entity references, and the replacement text for the AElig entity reference is the single ligature character Æ. Then the parser has a choice. It can represent the children of the para element as a single text node containing the full sentence, “Ælfred is a very nice XML parser.” Alternately, it can represent the children of the para element as an entity reference node with the name AElig followed by a text node containing the text, “lfred is a very nice XML parser.” If it chooses the second option, then the AElig entity reference node contains a single read-only text node child containing single ligature character Æ.
DOM never includes entity reference nodes for the five predefined entity references: &, <, >, ', and ". These are simply replaced by their respective characters and included in a text node.
Similarly, character references like   and   are not specially represented in DOM as any kind of node. The characters they represent are simply added to the relevant text node.
A document type node has a name (the name the document type declaration specifies for the root element), a public ID (which may be null), a system ID (required), an internal DTD subset (which may be null), a parent (the document that contains it), and lists of the notations and general entities declared in the DTD. The value of a document type node is always null. For example, consider this document type declaration:
<!DOCTYPE mml:math PUBLIC "-//W3C//DTD MathML 2.0//EN" "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd" [ <!ENTITY % MATHML.prefixed "INCLUDE"> <!ENTITY % MATHML.prefix "mml"> ]>
The name of the corresponding node is mml:math. The public ID is -//W3C//DTD MathML 2.0//EN. The system ID is http://www.w3.org/TR/MathML2/dtd/mathml2.dtd. The internal DTD subset is the complete text between the [ and the ].
There are four kinds of DOM nodes that are part of the document but not the document’s tree: attribute nodes, entity nodes, notation nodes, and document fragment nodes. You’ve already seen that attribute nodes are attached to element nodes, but are not children of those nodes. Entity and notation nodes are available as properties of the document type node. Document fragment nodes are only used when building DOM trees in memory, not when reading them from a parsed file.
Entity nodes (not to be confused with entity reference nodes) represent the parsed and unparsed entities declared in the document’s DTD. If the parser reads the DTD, then it will attach a map of entity nodes to the document type node. This map is indexed by the entity names. You can use it to match entity reference nodes to entity nodes.
Each entity node has a name and a system ID. It can also have a public ID if one was used in the DTD. Furthermore, if the parser read the entity, then the entity node has a list of children containing the replacement text of the entity. However, these children are read-only and cannot be modified, unlike children of similar type elsewhere in the document. For example, suppose the following entity declaration appears in the document’s DTD:
<!ENTITY AElig "Æ">
If the parser reads the DTD, then it will create an entity node with the name AElig. This node will have a null public and system ID (because the entity’s purely internal) and one child, a read-only text node containing the single character Æ.
For another example, suppose this entity declaration appears in the document’s DTD:
<!ENTITY Copyright SYSTEM "copyright.xml">
If the parser reads the DTD, then it will create an entity node with the name Copyright, the system ID copyright.xml, and a null public ID. The children of this node will depend on what’s found at the relative URL copyright.xml. Suppose that document contains the following content:
<copyright> <year>2002</year> <person>Elliotte Rusty Harold</person> </copyright>
Then the child list of the Copyright entity node will contain a single read-only element child with the name copyright. The element will contain its own read-only element and text node children.
Notation nodes represent the notations declared in the document’s DTD. If the parser reads the DTD, then it will attach a map of notation nodes to the document type node. This map is indexed by the notation name. You can use it to look up the notation for each entity node that corresponds to an unparsed entity or the notations associated with particular processing instruction targets.
Besides its name, each notation node has a public ID or a system ID, whichever was used to declare it in the DTD. Notation nodes do not have any children. For example, suppose this notation declaration for PNG images were included in the DTD:
<!NOTATION PNG SYSTEM "http://www.w3.org/TR/REC-png">
This would produce a notation node with the name PNG and the system ID http://www.w3.org/TR/REC-png. The public ID would null.
For another example, suppose this notation declaration for TeX documents were included in the DTD:
<!NOTATION TEX PUBLIC "+//ISBN 0-201-13448-9::Knuth//NOTATION The TeXbook//EN">
This would produce a notation node with the name TEX and the public ID +//ISBN 0-201-13448-9::Knuth//NOTATION The TeXbook//EN. The system ID would null. (XML doesn’t allow notations to have both public and system IDs.)
The document fragment node is an alternative root node for a DOM tree. It can contain anything an element can contain (i.e. element nodes, text nodes, processing instruction nodes, comment nodes, etc.). A parser will never produce such a node. However, your own programs may create such a node when extracting part of an XML document in order to move it elsewhere.
In DOM, the non-root nodes never exist alone. That is, there’s never a text node or an element node or a comment node that’s not part of a document or a document fragment. They may be temporarily disconnected from the main tree, but they always know which document or fragment they belong to. The document fragment node enables you to work with pieces of a document that are composed of more than one node.
Table 9.1 summarizes the DOM data model with the name, value, parent, and possible children for each kind of node.
Table 9.1. Node properties
Node type | name | value | parent | children |
---|---|---|---|---|
Document | #document | null | null | Comment, Processing Instruction, 1 Element |
DocumentType | Root element name specified by the DOCTYPE declaration | null | Document | none |
Element | prefixed name | null | Element, Document, or Document fragment | Comment, Processing Instruction, Text, Element, Entity reference, CDATA section |
Text | #text | text of the node | Element, Attr, Entity, or Entity reference | none |
Attr | prefixed name | normalized attribute value | Element | Text, Entity reference |
Comment | #comment | text of comment | Element, Document, or Document fragment | none |
Processing Instruction | target | data | Element, Document, or Document fragment | none |
Entity Reference | name | null | Element or Document Fragment | Comment, Processing Instruction, Text, Element, Entity reference, CDATA section |
Entity | entity name | null | null | Comment, Processing Instruction, Text, Element, Entity Reference, CDATA section |
CDATA section | #cdata-section | text of the section | Element, Entity, or Entity reference | none |
Notation | notation name | null | null | none |
Document fragment | #document-fragment | null | null | Comment, Processing Instruction, Text, Element, Entity reference, CDATA section |
One thing to keep in mind is the parts of the XML document that are not exposed in this data model:
The XML declaration, including the version, standalone, and encoding declarations. These will be added as properties of the document node in DOM3, but they are not provided by current parsers.
Most information from the DTD and/or schema is not provided including element and attribute types and content models.
Any white space outside the root element.
Whether or not each character was provided by a character reference. Parsers may provide information about entity references, but are not required to do so.
A DOM program cannot manipulate any of these constructs. It cannot, for example, read in an XML document, then write it out again in the same encoding the original document used because it doesn’t know what encoding the original document used. It cannot treat $var differently than $var because it doesn’t know which was originally written.
Copyright 2001, 2002 Elliotte Rusty Harold | elharo@metalab.unc.edu | Last Modified September 08, 2002 |
Up To Cafe con Leche |