This is not an introductory book about XML. I certainly expect that you have some experience with XML documents before now. Nonetheless, when writing programs to process XML it’s even more important to make sure that you are totally crystal clear about the exact terminology used when discussing XML. Therefore I’d like to take a few pages to briefly review the proper terminology for discussing XML, as well as to clarify a few points that are often confused or misunderstood.
The precise meaning of “XML document” is defined by the XML 1.0 specification published by the Worldwide Web Consortium (W3C). This specification provides a detailed BNF grammar defining exactly what is and is not an XML document. Anything that satisfies the document production in that BNF grammar and adheres to the fifteen well-formedness constraints is an XML document.[2] Anything that does not is not an XML document.
Well-formedness is the minimum requirement for an XML document. A document that is not well-formed is not an XML document. Parsers cannot read it. A parser is not allowed to fix a malformed document. It cannot take a best-guess at what the document author intended. When a parser encounters a malformed document, it stops parsing and reports the error. It will not read any further in the document. [3] Depending on which API you’re accessing the parser through, you may or may not have already received some information from the parts of the document before the error. However, under no circumstances will the parser give you any data from after the first well-formedness error in the document.
The detailed rules an XML document must follow aren’t so important here since the parser will check them for you. Very roughly an XML document must have a single root element. All start-tags must be matched by end-tags. All attribute values must be quoted. And only the Unicode characters that are legal in XML may be used in the document. (Almost all Unicode characters are legal in XML documents. The only ones really ruled out are the C0 controls like null, bell, and form feed.)
Occasionally developers ask how they can parse a document that is almost, but not quite a well-formed XML document. For example, it may end with a form feed inserted by some Unix text editor to separate documents. Or perhaps it’s part of an infinite stream of elements, the last of which is never seen so there’s no end-tag for the root element. Imagine, for example, weather observations or stock quotes being pushed across the Internet as XML elements.
The short answer is that you can’t parse these things because they are not XML documents, even if they use a lot of tags and attributes and other XML-like markup. The long answer is that you may be able to write a non-XML-aware program to preprocess the streams, fix up any well-formedness mistakes you see, and only then pass the fixed documents to the XML parser. However, the XML parser must receive a complete well-formed document. It cannot work with anything less.
There’s another way to look at XML documents besides simply as a sequence of characters that adheres to certain rules, and it’s one that sometimes makes sense, especially when writing programs that process XML documents. An XML document is a tree. It has a root node that contains various child nodes. Some of these child nodes have children of their own. Others are leaf nodes that have no children.
There are roughly five different kinds of nodes in an XML tree:
Also known as the document node, this is the abstract node that contains the entire XML document. Its children include comments, processing instructions, and the root element of the document.
An XML element with a name, a list of attributes, a list of in-scope namespaces, and a list of children.
The parsed character data between two tags (or any other kind of non-text node).
An XML comment such as <!-- This needs to be fixed. -->. The contents of the comment are its data. A comment does not have any children.
A processing instruction such as <?xml-stylesheet type="text/css" href="order.css"?> A processing instruction has a target and a value. It does not have any children.
Depending on context, some details of this tree structure can be understood differently. For example, some tree models consider parsed entities or CDATA sections to be additional kinds of nodes. Others simply merge them into the tree structure as elements and text nodes. Some models allow one text node to follow another. Others require each text node to be the maximum contiguous run of text not interrupted by some other kind of node. Some models include the document type declaration and/or the XML declaration as a node. Others ignore them. Probably the most hotly debated point is how to handle attributes and namespaces. I chose to not consider them as nodes in the tree in their own right, treating them instead as properties of elements. Generally even those tree models such as XPath that do treat them as separate nodes still don’t make them children of the element they belong to. For now the details aren’t too important. The broad outline is the same for pretty much all the tree models.
There’s some argument about whether it really makes sense to talk about an XML document as having any independent existence separate from the text that makes up the document. After all, the XML 1.0 specification only defines concepts like document and element in terms of text strings. Later W3C specifications like the XML Information Set (Infoset) and the Document Object Model (DOM) do suggest a more abstract understanding of the components of an XML document. However, these specifications are much more controversial than XML 1.0 itself, and not as broadly implemented or accepted. For the purposes of writing programs that process XML, I do find it useful to consider XML documents more abstractly; and I will do so in this book. However, even here there’s a split depending on which API you choose. DOM is a very abstract model of XML documents that defines classes representing elements, attributes, comments, and more. SAX defines almost no such classes, however. It presents the content of an XML document almost exclusively as strings and arrays of characters.
An XML application is a specific XML vocabulary that contains particular elements and attributes. It is not a software program that somehow uses XML like the EditML Pro XML editor or the Mozilla web browser. XML applications limit the very flexible rules of XML to a finite set of elements of certain types. For example, DocBook is an XML application designed for technical manuscripts such as the book you’re reading now. Elements it defines include book, chapter, para, sect1, sect2, programlisting, and several hundred others. When writing a DocBook document, you have to use these elements; and you have to use them in certain ways. For instance, a sect2 element can be a child of a sect1 but not a child of a sect3 or a chapter. Scalable Vector Graphics (SVG) is an XML application for line art. Elements it defines include line, circle, ellipse, polygon, polyline, and so forth. All SVG documents are XML documents, but not all XML documents are SVG documents.
An XML application can have a schema that defines what is and is not a legal document for that application. Schemas can be written in a variety of languages including Document Type Definitions (DTDs), the W3C XML Schema Language, RELAX NG, Schematron, and numerous others. Depending on the power of the schema language used, it may also be necessary to specify additional rules for the application in less-formal prose. For example, the XHTML 1.1 specification includes the requirement that “There must be a DOCTYPE declaration in the document prior to the root element. If present, the public identifier included in the DOCTYPE declaration must reference the DTD found in Appendix C using its Formal Public Identifier.” None of the common schema languages allow you to require anything about the DOCTYPE declaration.
An instance document is an instance of an XML application, whether formally defined or not. That is, it is an XML document with a root element and whatever other content it possesses that satisfies all the rules of some XML application. There are many possible instance documents for any one XML application, just as there are many programs that can be written in any one programming language.
The fundamental unit of XML is the element. You can write good XML documents without using any other XML construct. If for some reason you have a grudge against comments, processing instructions, attributes, or namespaces, you can pretend they don’t exist and still write well-formed XML documents. However, you must use elements. Every XML document has at least one element. You cannot write XML documents without using elements.
Logically every element has four key pieces:
A name
The attributes of the element
The namespaces in scope on the element
The content of the element
In addition, once schemas become more prevalent and parsers and APIs are revised to support them, it may also make sense to talk about the element’s type. For now, though, there’s not a lot of practical help to be gained by considering the type.
Furthermore, DOM and XPath also have mutually incompatible concepts of the value of an element. However, in both cases, the value is derived purely from the element content, so it’s not really a separate thing.
Syntactically, in the text form of an XML document, elements are delimited by tags. Start-tags begin with a < immediately followed by the element name. End-tags begin with a </ immediately followed by the element name. Both start and end-tags terminate with >. Everything in between the two tags is the content of the element. For example, this is a Quantity element with the content “12”:
<Quantity>12</Quantity>
Tags and elements are closely related, but they are not the same thing. Be wary of books that confuse them. An element is the whole sandwich including bread, meat, cheese, pickles, and mayonnaise, while the tags are just the bread. An element is composed of a start-tag, followed by content, followed by an end-tag.
It is possible that an element may have no content. In this case it is called an empty element. For example, this is an empty Quantity element:
<Quantity></Quantity>
The start-tag butts right up against the end-tag. There is not even a single space character between them. By contrast, this next element is not empty because it does contain some white space, even if it doesn’t contain anything else:
<Quantity> </Quantity>
Besides start-tags and end-tags, there is one other kind of tag, the empty-element tag. An empty-element tag begins with a < followed by an element name like a start-tag. However, it ends with a />. For example, this is an empty Quantity tag:
<Quantity/>
This tag both starts and ends a Quantity element. The content of this element is nothing, just like the content of <Quantity></Quantity>. Indeed <Quantity/> is just syntax sugar for <Quantity></Quantity>. They mean exactly the same thing. No application should treat these two constructs as different in any way. Indeed, most XML parsers and APIs won’t even tell you which form the element took in the source document. In both cases, what’s reported is an empty element with the name “Quantity”. How that element was represented is not important.
As well as text, an element can also contain one or more child elements. These are elements that are completely contained between the element’s start-tag and end-tag, and are not contained inside any other element also contained in the parent element. For example, this ShipTo element has four child elements: Street, City, State, and Zip:
<ShipTo> <Street>135 Airline Highway</Street > <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip> </ShipTo>
In addition to the four child elements, this ShipTo element also contains some white space; for example, the single space character between </City> and <State>. These spaces form text nodes that are also counted among the element’s children. Text nodes like these that are composed of nothing but white space are sometimes called ignorable white space. This is an unfortunate turn of phrase. Sometimes you can ignore these nodes, but most of the time you can’t. The more proper term is white space in element content.[4]
All the elements contained inside an element are called the element’s descendants. Only the highest level are the children. The descendants include not only the children, but the children of the children, the children of the children’s children, and so forth. If you look at Example 1.2 again, you’ll see that the Order element has 15 descendant elements.
An element can also have mixed content. This is when an element contains both child elements and text nodes containing non-whitespace characters. For example, this variant ShipTo element has both the child elements you saw before as well as text nodes containing the strings “Chez Fred” and “Apt. 17D”:
<ShipTo> Chez Fred <Street>135 Airline Highway</Street > Apt. 17D <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip> </ShipTo>
Mixed content is very useful, indeed almost essential, for XML applications that contain narratives such as books and stories. Such applications include XHTML, DocBook, TEI, and XSL Formatting Objects. Mixed content is much less useful and much more cumbersome for data-oriented applications. XML documents that are intended for computers to read, as opposed to XML documents that are intended for humans to read, should use mixed content sparingly, if at all.
XML documents are text. Each XML document is a sequence of characters. These characters are taken from the Unicode character set. However, XML documents can be written in any character set which your XML parser knows how to convert to Unicode, providing that it is properly specified in the document’s encoding declaration in the XML declaration.
Many developers have decided that they can make XML more efficient by defining a binary version. This tends to be based on some vague notion that binary formats are inherently smaller or faster than text formats. These developers rarely have any actual evidence to back up this claim, which is not surprising since it isn’t true. XML documents are routinely smaller and faster to read than the equivalent binary files in standard applications like Oracle, Microsoft Word, Microsoft Excel, and so forth. The fact is modern binary file formats are quite bloated, but disks have gotten so large that almost no one’s noticed or cared. Nonetheless, there seems to be a large pool of programmers who mistakenly believe:
File size matters.
They can compress better than gzip.
Human legible/human editable data doesn’t matter.
All three beliefs have been empirically proven false time and time again. Nonetheless, about once a month some developer somewhere announces that they’ve come up with yet another special purpose binary compression format for XML. These have proven completely pointless in practice. There is no actual benefit to these formats, and no one needs one. Worse yet, such a format substantially eliminates many of the existing benefits of XML.
Unicode is a character set with room for over one million different characters, though currently (Unicode 3.2) a few less than 100,000 are defined. Scripts covered by Unicode include Latin, Cyrillic, Greek, Hebrew, Arabic, Devanagari, the Han ideographs, and many more.
Contrary to what you may have heard, Unicode is not a two-byte character set and really never has been. Since there are more than a million different spaces for characters in Unicode, an arbitrary Unicode character cannot be represented by a single two-byte unsigned integer such as Java’s char data type. Prior to Unicode 3.1 all defined Unicode characters had code points less than 65,536, which fooled some developers into thinking they could get away with using two-byte chars. However, it’s long been known that more than 65,536 characters are actually used on Earth today and that Unicode would have to assign characters outside the Basic Multilingual Plane (the first 65,535 characters) to accommodate them. Although characters were not actually assigned code points greater than 65,536 until Unicode 3.1, the space for them has long been reserved. XML was designed by forward-thinkers who saw the problems ahead, and prepared for the eventual expansion of Unicode. Consequently XML documents can use the full range of all million-plus characters available in Unicode. Java’s designers weren’t as prescient though, and restricted the char data type to two-bytes. Consequently Java programmers need to go through some pretty nasty gyrations to adequately handle Unicode documents (including XML documents).
With a very few exceptions any character defined in Unicode can be used in the text content of an element or the value of an attribute. In brief, the exceptions are:
The non-printing characters such as null and formfeed, between code points 0 and 31 (decimal). The carriage return, linefeed, and the horizontal tab are allowed.
The surrogate blocks are two sets of 1024 code points each, which are used to extend Unicode beyond the Basic Multilingual Plane by allowing some characters to be represented as two surrogate characters. You can include surrogate pairs in an XML document in an encoding like UTF-16 that uses surrogate pairs. You just can’t treat an individual half of a surrogate pair as a character by itself.
The byte order mark, also known as the zero-width non-breaking space, can be used at the beginning of a document to indicate the encoding and endianness of the document, but cannot be used elsewhere in the document.
All other characters are fair game, including some you probably shouldn’t be using anyway such as characters in the private use area and compatibility characters Unicode offers purely for interoperability with existing character sets.
The rules for characters used in the names of things (elements, attributes, entities, etc.) are a little stricter. In brief, only letters, digits, and ideographs defined in Unicode 2.0 can be used. In addition the punctuation marks -, ., _, and : are also legal. Digits, the hyphen, and the period cannot be the first character in a name. Other punctuation marks as well as new characters first defined in Unicode 3.0 and later are not allowed anywhere in a name. These are essentially the same rules used for naming variables, methods, and classes in Java. The major difference is that XML allows the hyphen and Java doesn’t (it’s reserved for the minus sign) while Java allows the dollar sign and XML doesn’t. XML also allows the colon, unlike Java. However, XML reserves this for use with namespaces. It should not be used as an arbitrary name character.
XML parsers faithfully preserve white space. A string containing only white space is not the same as a string containing nothing at all. A string with leading and trailing white space is not the same as the equivalent string with white space trimmed. Some specific XML applications may decide that white space is not significant in certain contexts. However, in generic XML all white space is significant and must be accounted for.
Attributes are name value pairs associated with elements. The name of an attribute may be any legal XML name. The value may be any string of text, even potentially including characters like < and ". The document author needs to escape such characters as < and ". However, the parser will resolve these references before passing the data to your application. The attribute value is enclosed in either single or double quotes, and the name is separated from the value by an equals sign. For example, this Subtotal element has a currency attribute with the value USD:
<Subtotal currency='USD'>393.85</Subtotal>
The quote marks are not part of the attribute value. Whether single or double quotes are used or whether there’s extra white space around the equals sign is not important. Most parsers don’t bother to report the difference. These two elements are also the same as the previous one:
<Subtotal currency="USD">393.85</Subtotal> <Subtotal currency = "USD">393.85</Subtotal>
Attributes are unordered. There is no difference between these two elements:
<Tax rate="7.0" currency="USD">27.57</Tax> <Tax currency="USD" rate="7.0">27.57</Tax>
When a parser tells you which attributes are attached to an element, it may or may not provide them in the same order they had in the input document. Some APIs report the attributes using an unordered data structure like a hash table. Others use an array or a list, but even in these cases there’s no guarantee that the order of the attributes in the list matches the order of the attributes in the start-tag.
Perhaps most surprisingly, attribute values whose type is not CDATA are normalized. This means that all leading and trailing white space is stripped from the value, and runs of white space characters are compressed to a single space. This does not apply to any of the attributes in the examples seen so far because untyped attributes are not normalized. However, once you add a DTD it is possible to declare that an attribute has type ID, IDREF, IDREFS, NMTOKEN, and several other types. Attributes of these types are always normalized before being passed to the client application.
Tim Bray, one of the primary authors of XML 1.0, has admitted that normalization of attribute values was a mistake. In his words, “Why the $#%%!@! should attribute values be ‘normalized’ anyhow? This was a pure process failure: at no point during the 18-month development cycle of XML 1.0 did anyone stand up and say ‘why are you doing this?’ I’d bet big bucks that if someone had, the silly thing would have died a well-deserved death.” [5]
Most XML documents begin with an XML declaration. An XML declaration has a version attribute with the value 1.0 and may have optional standalone and encoding attributes. For example, this XML declaration says that the document is written in XML 1.0 in the ISO-8859-1 (Latin-1) character set and does not require the parser to read the external DTD subset:
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
The version attribute always has the value 1.0. If XML 1.0 is ever revised, this may change to some other value. As I write this, there’s a hotly debated proposal at the W3C for a new version of XML code named “Blueberry” which would make XML marginally more compatible with Unicode 3.0 and later as well as making it easier to edit with some brain damaged IBM mainframe software that can’t handle files where lines end in carriage returns, line feeds, or both. If this gets adopted (and I for one hope it doesn’t) this may lead to a new value for the version attribute. However, for now, version is effectively fixed with the value 1.0.
The encoding attribute identifies the character set and encoding in which the document is written. Whatever the encoding is, one of the jobs of the parser is to convert the document to Unicode before passing it to the client application. Most APIs don’t offer any means of finding out what the original encoding was. You’ll simply receive Unicode strings from which all traces of the original encoding have been removed.
The standalone attribute specifies whether the XML parser may have to read parts of the DTD that are outside the instance document to correctly parse the file. This is mostly a hint for the parser. Some parser APIs may tell you what the value was, but you generally don’t need to worry about it. The parser either will or won’t read external entities as necessary. By the time your code gets hold of the document, all of this will have already been taken care of. You need not concern yourself with it.
XML comments are almost exactly like HTML comments. They begin with <!-- and end with -->. For example, here’s a comment you might find in an order document:
<!-- Please make sure this order goes out ASAP! -->
Everything between the <!-- and the --> should be ignored. In fact, most parsers and APIs do make the comments available to you if you want them, mostly so you can round trip documents (read them in and then write them back out again with everything still intact). However, beyond this use case, you really shouldn’t pay much attention to comments in your programs. Some HTML systems abuse comments to support server side includes or editor specific extensions. Since XML is much more flexible than HTML, however, you can use elements, attributes, or, as a last resort, processing instructions for these use cases.
Processing instructions are used to tell particular software how it should handle an XML document after the document has been parsed. Generally, processing instructions are used for meta-information that may apply to documents from many different domains and XML vocabularies. For instance, the most common processing instruction, xml-stylesheet, tells a browser or other formatter where it can find the stylesheet it should apply to the document. This can be used with DocBook documents, XHTML documents, Human Resources Markup Language documents, or the custom XML application you invented last Tuesday to catalog your baseball card collection. For another example, the Apache XML Project’s Cocoon application server reads cocoon-process processing instructions to figure out what processes to apply to a document before sending it to a user. This processing instruction tells Cocoon to replace the XInclude include elements with the contents of the documents they reference:
<?cocoon-process type="xinclude"?>
The basic syntax of a processing instruction is <?, followed immediately by an XML name identifying the target of the processing instruction, followed by white space and any data at all, followed by ?>.
Unlike elements or attributes, processing instructions can be added to a document without considering whether or not the DTD or schema allows it. Most schema languages do not consider the presence, absence, or structure of processing instructions when determining validity. Furthermore, unlike elements, processing instructions can appear before, after, or inside the root element. They are frequently placed in the document prolog, though they can appear in the document body or after the root element as well.
Most of the time, the processing instruction is not associated with any one XML application. For instance, an XML application may describe gene sequences, 16th century Italian love poetry, financial records, or vector graphics. However, each of these might need to be loaded into a Web browser which would apply a stylesheet to it. Processing instructions can be inserted into a document to support this without changing or affecting the normal document structure. In essence, processing instructions provide an out-of-band channel for passing information to software other than the program that would normally read a document.
XML parsers report the target and contents of processing instructions to the client application. However, they provide no further support for interpreting the data in the processing instruction. For instance, many processing instructions use a pseudo-attribute format like this:
<?xml-stylesheet type="text/xml" href="limited.xsl"?>
However, as far as the XML parser is concerned, the data in this processing instruction is just a string that happens to contain some equals signs and quotation marks. These are not treated differently than any other character.[6] Both the syntax and semantics of the data is completely up to the application reading the document. Processing instructions are specifically for information that is not related to XML.
XML documents are not necessarily the same thing as XML files. A single XML document may be composed of several different files. Indeed, the pieces that make up an XML document may not be files at all, but may instead be records in a database, data sent out over the Internet by a web server in response to a CGI query, a small part of a much larger file, or something stranger still.
The individual storage units that make up any one XML document are called entities. Every XML document has at least one entity, the document entity. This is the storage unit, be it a file or something else, that holds the root element of the document. Every other entity in a document has a name. There are five such kinds of named entities, and they are classified according to three criteria:
The replacement text of an internal entity is defined as a string literal in the document’s DTD. The replacement text of an external entity is read out of a different file located via a URL.
A parsed entity contains XML. It is itself well-formed, and may even be a complete XML document if it has a root element. (Some entities that are only intended to be used as parts of other documents do not have root elements). You can think of a parsed entity as something that will be pasted right into the middle of an XML document, such that the resulting document would still be well-formed.
An unparsed entity can contain anything at all, including binary data. Unparsed entities are not pasted (even metaphorically) into XML documents. Instead a URL to the entity’s data is provided in an ENTITY declaration in the DTD. Then this entity is referenced in an attribute with the type ENTITY or ENTITIES in the document. An unparsed entity also has a notation that defines the type of the data in the unparsed entity (e.g. GIF image or C source code). Like the URL, the notation is also specified in the DTD rather than in the instance document. In practice, unparsed entities and notations are not much used.
A general entity is used within the instance document. A general entity reference begins with an &. A parameter entity is used within the DTD. A parameter entity reference begins with a %. Since this book focuses on processing instance documents, we’ll consider general entities primarily.
Not all combinations are possible. In fact, there are exactly five kinds of named entities:
The familiar entity references like & and © that are defined completely in the DTD. For example, this declaration defines the copy entity as the text “Copyright”:
<!ENTITY copy "Copyright">
These entities are used in element content and attribute values.
External parsed general entities are just like internal parsed general entities except that their replacement text is read from a separate document rather than the DTD. The document is identified by a relative or absolute URL. For example, this declaration defines the legal entity as the content read from the URL http://www.example.com/legal.xml:
<!ENTITY legal SYSTEM "http://www.example.com/legal.xml">
The file such an entity is read from is just like another XML document except that it has a text declaration instead of an XML declaration, may not have a document type declaration, and might not have a single root element.
External unparsed general entities refer to files containing non-XML, binary data. They are declared similarly to external parsed entities, but they also have a notation. For example, these definitions identify an unparsed entity named logo at the URL http://www.example.com/logo.png with the notation image/png:
<!NOTATION PNG SYSTEM "image/png"> <!ENTITY logo SYSTEM "http://www.example.com/logo.png" NDATA PNG>
Unparsed entities are referenced by attributes with type ENTITY or ENTITIES rather than by entity references. For example, such an attribute might be declared like this:
<!ELEMENT figure EMPTY> <!ATTLIST figure logo ENTITY #REQUIRED>
Instances of the figure element would look like this:
<figure source="logo"/>
The parser does not actually provide you with the contents of an unparsed entity. Instead it tells you the URI from which the data can be retrieved and the notation for that data. However, you have to use Java’s networking and I/O classes to get the data at that URI.
Internal parsed parameter entities are used purely within the DTD. The replacement text is provided by a string literal in the DTD. References to these entities begin with a percent sign. They’re often used to parameterize content models and attribute types. For example, the DocBook DTD defines the intermod.redecl.module parameter entity as the word IGNORE:
<!ENTITY % intermod.redecl.module "IGNORE">
Unlike a general entity reference, the %intermod.redecl.module; parameter entity reference can only be used in the DTD, not in the instance document. Since our focus is on instance documents, not DTDs, you won’t see a lot of these in this book.
External parsed parameter entities are used purely with the DTD. The replacement text is provided by a DTD fragment at a given URL. References to these entities begin with a percent sign. They often connect the different parts of a modular DTD into one coherent whole. For example, the DocBook DTD defines the dbpool parameter entity using a PUBLIC ID that loads the DTD fragment at the relative URL dbpoolx.mod:
<!ENTITY % dbpool PUBLIC "-//OASIS//ELEMENTS DocBook XML Information Pool V4.1.2//EN" "dbpoolx.mod">
Again, since our focus is on instance documents and not DTDs, you won’t see a lot of these in this book.
Namespaces are not part of XML 1.0. Namespaces were invented about a year after XML 1.0 was released to help sort out the rapidly expanding world of XML applications that all needed to be mixed together in the same documents. There are many good XML applications that don’t use them at all. For example, DocBook 4.1.2, the XML application in which this book was written, is completely namespace free as are XML-RPC and RSS 0.9.1. However, even if you can write very useful XML applications without thinking about namespaces, you’re going to encounter namespaces when you work with XML applications designed by other developers. Consequently it’s important to have a solid understanding of them.
The key idea of namespaces is that each element is bound to a Uniform Resource Identifier (URI) (a URL in practice). If IBM only uses URIs in the ibm.com domain and Sun only uses URIs in the sun.com domain, then there won’t be any confusion between Sun’s Book element and IBM’s Book element, even if they’re used in the same document. Just look at the URIs to tell which is which.
A URI identifies a resource, but it does not necessarily locate it. URIs include not only Uniform Resource Locators (URLs) but also Uniform Resource Names (URNs). For instance, a URN for this book based on its ISBN number is urn:isbn:0201771861; but this does not tell you where you can find a copy of the book. However, most developers agree that only absolute URLs should be used as namespace URIs, and most XML applications follow this suggestion.
The URIs are purely string identifiers. Even if the URI is a URL, the parser does not connect to the server and try to download the document that’s found there. Indeed there may not be any such document. When plugged into web browsers, namespace URLs often produce 404 Not Found errors. You can use namespaces in standalone systems without any network connection at all. You don’t even have to have access to DNS. For the same reason, two different URLs that point to the same page define two different namespaces. For example, the following URLs identify the same page but three different namespaces:
Since URIs contain many characters which are illegal in element names as well as being excessively long to type, short prefixes stand in for the URIs. The prefixes are separated from the local name by a colon. For instance, instead of the URI http://www.w3.org/2001/XInclude you might use the prefix xinclude or xi. An include element in the http://www.w3.org/2001/XInclude namespace would then be written as xi:include. This element has the prefix xi, the local name include, the qualified name xi:include, and the namespace URI http://www.w3.org/2001/XInclude.
xmlns:prefix attributes bind particular prefixes to particular URIs within the element where the attribute appears. For example, inside this Order element, the prefix xi is bound to the URI http://www.w3.org/2001/XInclude:
<Order xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="order_details.xml"/> </Order>
Each prefix used in an element or attribute name must be bound to a URI. Failure to do this is a namespace well-formedness error. Although you can parse documents without considering namespaces, in practice most parsers and APIs check namespaces by default and a violation of namespace well-formedness is as serious as a violation of the rules of XML 1.0.
The prefix can change as long as the URI stays the same. For example, this element is the same as the previous one:
<Order xmlns:xinclude="http://www.w3.org/2001/XInclude"> <xinclude:include href="order_details.xml"/> </Order>
You can also define a default namespace that applies to elements without prefixes. For example, Example 1.6 places the Order element and all its descendants in the http://ns.cafeconleche.org/Orders/ namespace, even though none of them have prefixes.
Example 1.6. An XML document that uses a default namespace
<?xml version="1.0" encoding="ISO-8859-1"?> <Order xmlns="http://ns.cafeconleche.org/Orders/"> <Customer id="c32">Chez Fred</Customer> <Product> <Name>Birdsong Clock</Name> <SKU>244</SKU> <Quantity>12</Quantity> <Price currency="USD">21.95</Price > <ShipTo> <Street>135 Airline Highway</Street > <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip> </ShipTo> </Product> <Subtotal currency='USD'>263.405</Subtotal> <Tax rate="7.0" currency='USD'>18.44</Tax> <Shipping method="USPS" currency='USD'>8.95</Shipping> <Total currency='USD' >290.79</Total> </Order>
Although it’s most common to place the namespace binding attributes on the root element, they can appear on other elements deeper in the hierarchy. They can even override previous bindings in the ancestor elements. This is especially common with the binding of the default namespace. For instance, in Example 1.7 the Order, Customer, Product, Name, SKU, Quantity, Price, Subtotal, Tax, Shipping, and Total elements are all in the http://ns.cafeconleche.org/Orders/ namespace. However, the ShipTo, Street, City, State, and Zip elements are in the http://ns.cafeconleche.org/Address/ namespace.
Example 1.7. An XML document that uses two default namespaces
<?xml version="1.0" encoding="ISO-8859-1"?> <Order xmlns="http://ns.cafeconleche.org/Orders/"> <Customer id="c32">Chez Fred</Customer> <Product> <Name>Birdsong Clock</Name> <SKU>244</SKU> <Quantity>12</Quantity> <Price currency="USD">21.95</Price > <ShipTo xmlns="http://ns.cafeconleche.org/Address/"> <Street>135 Airline Highway</Street > <City>Narragansett</City> <State>RI</State> <Zip>02882</Zip> </ShipTo> </Product> <Subtotal currency='USD'>263.40</Subtotal> <Tax rate="7.0" currency='USD'>18.44</Tax> <Shipping method="USPS" currency='USD'>8.95</Shipping> <Total currency='USD' >290.79</Total> </Order>
Although it’s less common, prefixes can also be attached to attribute names to indicate what namespace the attribute is in. For example, XLink uses this to distinguish between the XLink attributes such as type and href and attributes with the same names that might be used in elements that need to become XLinks. This ShipTo element is also a simple XLink to the recipient’s e-mail address:
<ShipTo xmlns="http://ns.cafeconleche.org/Address/" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple" xlink:href="mailto:chezfred@yahoo.com" > <GiftRecipient>Samuel Johnson</GiftRecipient> <Street>271 Old Homestead Way</Street > <City>Woonsocket</City> <State>RI</State> <Zip>02895</Zip> </ShipTo>
Unprefixed attributes are never in any namespace. Unlike elements, they cannot be in the default namespace. Furthermore, they are not in the same namespace as the element to which they are attached. If an attribute does not have a prefix, it is not in a namespace.
On occasion namespace prefixes are used in attribute values, element content, and even in processing instructions. In these cases the nearest ancestor element that contains a binding for that prefix establishes what URI the prefix is mapped to. Inside an element with an xmlns:prefix attribute, we say that the namespace is in scope even if it isn’t obviously used anywhere in that element. Namespaces in scope on an element include not only those that the element itself declares but also those that are declared on that element’s ancestors. An element can redeclare a namespace prefix so that it’s mapped to a different URI on the element and the element’s children than in the element’s parent. Slightly more commonly, an element can change the default namespace that applies within the element and its content.
When writing software to process XML documents that use namespaces, you almost always want to make your code dependent on the URI, not the prefix. If you’re comparing two elements for equality, compare them by URI and local name, not prefix and local name. If you’re searching for an element of a certain type, look for an element with the right URI and local name, not the right prefix and local name.
[2] The well-formedness constraints specify requirements that are difficult or impossible to express in BNF form; for example, that “The Name in an element’s end-tag must match the element type in the start-tag.”
[3] A few parsers continue reading so they can report further errors after the first one. However, they only report errors, not content.
[4] Technically, whether or not white space only nodes are considered to be “white space in element content” depends on the content specification for the element given by the DTD. A white space only text node is only white space in element content when the content specification for the parent element in the DTD indicates that the parent element can only contain child elements but not mixed content. Since Example 1.2 doesn't have a DTD, this can't possibly be white space in element content.
[5] Re: Attribute normalisation and character entities, posted on the xml-dev mailing list, January 27, 2000
[6] JDOM and dom4j actually do provide special support for processing instructions written in this pseudo-attribute format. However, they both do a substantial amount of work in their own classes to support this interface, beyond what the parser provides.
Copyright 2001, 2002 Elliotte Rusty Harold | elharo@metalab.unc.edu | Last Modified July 22, 2002 |
Up To Cafe con Leche |