Introduction
As I already stated in the preface, this is not an introductory book or an XML tutorial. I assume that you're familiar with the basic structure of an XML document as elements that contain text, that you know how to ask a parser to read an XML document in your language of choice, that you can attach a style sheet to a document as necessary and so forth.
However, I have noticed over the last few years that certain words and phrases have taken on a diverse set of meanings, and are often used inconsistently. Sometimes this just confuses people, but occasionally it's led to serious process failures. Some of this has been caused by authors and trainers (embarrassingly sometimes including the author of this book) who weren't sufficiently careful with their use of words such as element and tag. However, some of the confusion rests with the XML working groups at the W3C who are often not consistent with each other or even within the same specification. Before we proceed with the detailed rules, it is worth taking the time to define our terms carefully, make sure we agree which words mean what, as well as recognizing those areas where there are genuine disagreements about the meaning of common technical terms.
Toward that end, I've prepared the following list of the most frequently confused XML terms:
Element vs. tag
Attribute vs. attribute value
Entity vs. entity reference
Entity reference vs. character reference
Children vs. child elements vs. content
Text vs. Character data vs. markup
Namespace vs. namespace name vs. namespace URI
XML document vs. XML file
XML application vs. XML software
Well-formed vs. valid
DTD vs. DOCTYPE
XML declaration vs. processing instruction
Character set vs. character encoding
URI vs. URI Reference vs. IRI
Schema vs. the W3C XML Schema Language
Confusing these terms often causes much misunderstanding regarding how various APIs and tools work. For instance, if you think that a character reference is an entity reference, you may find yourself wondering why the SAX startEntity method is never invoked for character references in your documents. When you ask a question about this on a mailing list, you may not phrase your question in a way that others can understand. You might even spend several hours carefully devising a test case and filing a bug report on a feature that's operating exactly as it should.
The answers to many apparently difficult questions become almost obvious when you're careful to state exactly what you mean. Thus it behooves us to define our terms carefully.
An element is not a tag and a tag is not an element. An element begins with a start-tag, includes some content, and then finishes with an end-tag. Tags delimit elements. They are part of elements, but not themselves elements, any more than a piece of bread is a sandwich. The tags are like slices of bread. The element is the entire sandwich made up of bread, mustard and mayonnaise, meat and/or cheese. The tags are just the bread. For example, <Headline> is a start-tag. </Headline> is an end-tag. <Headline>Record Crowd Hears Beth Giggle</Headline> is a complete element. Elements may contain other elements. Tags may not contain other tags.
There is one degenerate case. A single empty-element tag may represent an entire element. For instance, <Headline/> is both a headline tag and a headline element. However, this is a special case. It is not true in general. Semantically the empty-element tag is completely equivalent to the two-tag version <Headline></Headline>, and most APIs will not bother to inform you which of the two forms was actually present in the document.
In brief, the structure of an XML document is formed by nested elements. The individual elements are delimited by tags.
An attribute is a property of an element. It has a name and a value, and is normally a part of the element's start-tag. (It can also be defaulted in from the DTD.) For example, consider this element:
<Headline page="10">Record Crowd Hears Beth Giggle</Headline>
The headline element has a page attribute with the value 10. The attribute includes both the name and the value. The attribute value is simply the string 10. Either single or double quotes may surround the attribute value. The type of quote used is not significant. This element is exactly the same as the previous one:
<Headline page='10'>Record Crowd Hears Beth Giggle</Headline>
If an element has multiple attributes, their order is not important. These two elements are equivalent:
<Headline id="A3" page="10">Record Crowd Hears Beth Giggle</Headline>
<Headline page="10" id="A3">Record Crowd Hears Beth Giggle</Headline>
Parsers do not tell you which attribute came first. If order matters, you need to use child elements instead of attributes:
<Headline>
<id>A3</id> <page>10</page>
Record Crowd Hears Beth Giggle
</Headline>
It's not exactly a terminology confusion, but a few technologies (notably the W3C XML Schema Language) have recently dug themselves into deep holes by attempting to treat attributes and child elements as variations of the same thing. They are not. Order is only one of the differences between child elements and attributes. Other important differences include type, normalization, and the ability or inability to express substructure.
An entity is a storage unit that contains a piece of an XML document. This storage unit may be a file, a database record, an object in memory, a stream of bytes returned by a network server, or something else. It may contain an entire XML document or just a few elements or declarations.
Entity references point to these entities. There are two kinds of entity references, general entity references and parameter entity references. A general entity reference begins with an ampersand; for instance & or &chapter1;. These normally appear in the instance document. For example, you might define the chapter1 entity in the DTD like this:
<!ENTITY SYSTEM chapter1 "http://www.example.com/chapter1.xml">
Then in the document you could reference it like this:
<book>
&chapter1;
...
</book>
&chapter1; is an entity reference. The actual content of the document found at http://www.example.com/chapter1.xml is an entity. They are related, but they are not the same thing.
Parameter entities and parameter entity references follow the same pattern. The difference is that parameter entities contain DTD fragments instead of instance document fragments and parameter entity references begin with a percent sign instead of an ampersand. However, it's still the case that the entity reference stands in for and points to the actual entity.
XML APIs are schizophrenic about whether they report entities, entity references, neither, or both. Some, like XOM, simply replace all entity references with their corresponding entities and don't tell you that anything has happened. Others, like JDOM, only report entities they have not resolved. Still others such as DOM and SAX can report both entities and entity references, though this often depends on user preferences and the abilities of the underlying parser; and normally the five predefined entity references &, <, >, " and ' are not reported.
Not everything that begins with an ampersand is an entity reference. Entity references are only used for named entities, including the five predefined entity references such as < and any entities defined with ENTITY declarations in the DTD such as &chapter1; in the example above.
By contrast character references use a hexadecimal or decimal Unicode value to refer to a particular character, not a name. Each always refers to a single character, never to a group of characters. For example,   is a hexadecimal character reference referring to the non-breaking space character.   is a decimal character reference referring to that same character. However, XHTML's is an entity reference referring to that character.
Almost always, even APIs that faithfully report all entity references do not report character references. Instead, the parser silently merges the referenced characters into the surrounding text. Your code should never depend on whether a character was typed literally or escaped with a character reference. Almost all of the time, it shouldn't depend on whether the character was escaped with an entity reference either.
An element's content is everything between the element's start-tag and its end-tag. For example, consider this DocBook para element:
<para>
As far as we know, the Fibonacci series was first discovered by
Leonardo of Pisa around 1200 C.E. Leonardo was trying to answer the question,
<!-- Scritti di Leonardo Piasano. Rome: Baldassarre, 1857.
Volume I, pages 283 - 284.Fibonacci, Leonardo. -->
<quote lang="la"><foreignphrase>Quot paria coniculorum in
uno anno ex uno pario germinatur?</foreignphrase></quote>, or,
in English, <quote>How many pairs of rabbits are born in one
year from one pair?</quote> To solve Leonardo’s problem, first
estimate that rabbits have a one month gestation period, and
can first mate at the age of one month, so that each female
rabbit has its first litter at two months. Then make the simplifying
assumption that each litter consists of exactly one male and one female.
</para>
The content of this para element contains a some text including white space, a comment, some more text, a quote child element, some more plain text, another quote child element, some more plain text, the ’ entity reference, and finally some more text. All of that together, including all the content of child elements such as quote, is the para element's content.
The para element has two child elements, both named quote. However, these are not the only children of the element. This element also contains a comment, lots of character data, and an entity reference. These are considered to be children of the para element as well, though different APIs and systems differ in exactly how they represent these and how many text children there are. At one extreme, each separate character can be a separate child. At the other extreme, each text node contains the maximum contiguous run of text after all entity references are resolved so the para element has exactly four text node children.
On the flip side, the foreignphrase element and other content inside the quote elements are not children of the para element though they are descendants of it.
The common reason for confusing children with child elements is forgetting about the very real possibility of mixed content. However, even when a document has more record like structure, the difference between children and child elements can be important. For example consider this presentation element:
<presentation>
<title>DOM</title>
<date>Thursday, November 21, 2002</date>
<host>Software Development 2002 East</host>
<copyright>2000-2002 Elliotte Rusty Harold</copyright>
<last_modified>November 26, 2002</last_modified>
<author_name>Elliotte Rusty Harold</author_name>
<author_url>http://www.elharo.com/</author_url>
<author_email>elharo@metalab.unc.edu</author_email>
<abstract>Elliotte Rusty Harold's DOM tutorial</abstract>
</presentation>
It may look like this element only has child elements. However, if you're counting child nodes you have to count the white space too. There are at least ten text node children containing only white space. Furthermore, what about the title, date, host, and similar elements? Each of them has a child node containing character data but no child elements. Bottom line: elements are not the only kind of children.
XML documents are composed of text. You'll never find anything in an XML document that is not text. This text is divided into two non-intersecting sets, character data and markup. Markup consists of all the tags, comments, processing instructions, entity references, character references, CDATA section delimiters, XML declarations, text declarations, document type declarations, and white space outside the root element. Everything else is character data. For example, here's the DocBook para element with the markup identified by bold face and the character data is in a plain font:
<para>
As far as we know, the Fibonacci series was first discovered by
Leonardo of Pisa around 1200 C.E. Leonardo was trying to answer the question,
<!-- Scritti di Leonardo Piasano. Rome: Baldassarre, 1857.
Volume I, pages 283 - 284.Fibonacci, Leonardo. -->
<quote lang="la"><foreignphrase>Quot paria coniculorum in
uno anno ex uno pario germinatur?</foreignphrase></quote>, or,
in English, <quote>How many pairs of rabbits are born in one
year from one pair?</quote> To solve Leonardo’s problem, first
estimate that rabbits have a one month gestation period, and
can first mate at the age of one month, so that each female
rabbit has its first litter at two months. Then make the simplifying
assumption that each litter consists of exactly one male and one female.
</para>
The markup includes the <para> and </para> tags, the <quote> and </quote> tags, the <foreignphrase> and </foreignphrase> tags, the comment, and the ’ entity reference. Everything else is character data.
Sometimes the everything else part is called PCDATA or parsed character data after the PCDATA keyword used in DTDs to declare elements like interfacename:
<!ELEMENT interfacename (#PCDATA)>
However, that's not perfectly accurate. Generally speaking, the parsed character data is what's left after the parser has replaced entity and character references by the characters they represent. It contains both character data and markup.
An XML namespace is a collection of names. For example, all the element names defined in XHTML (html, head, title, body, p, div, table, h1, etc.) form the XHTML namespace. The SVG namespace is the collection of element names used in SVG (svg, rect, polygon, polyline, etc.) Only the local parts of prefixed names belong to the namespace. The prefix and the prefixed names are not parts of the namespace.
Each such namespace is identified by a URI reference called the namespace name. For example the namespace name for XHTML is http://www.w3.org/1999/xhtml. The namespace name for SVG is http://www.w3.org/2000/svg. The namespace name identifies the namespace, but it is not the namespace.
The namespace name is supposed to be a URI reference, but it's not technically an error if it's not one. For instance, a namespace name may contain characters such as { or the Greek letter λ that are illegal in URIs. However, since in practice almost all actual namespace names are legal URI references, namespace names are often carelessly called namespace URIs. Actually, they are namespace URI references; but most developers don't bother to make this distinction.
Technically, an XML document is any sequence of Unicode characters which is well-formed according to the rules laid out in the XML 1.0 specification. Such a document may or may not be stored in a file. Instead of being in a file, it can be stored in a database record, created in memory by a program, read from a network stream, printed in a book, painted on a billboard, or scratched into a subway car window. There is not necessarily a file anywhere in the picture. If the XML document is stored in a file, then it may be in a single file or split across multiple files using external entity references. It's even possible for multiple XML documents to be stored in a single file, though this is unusual in practice.
When discussing XML documents it is sometimes useful to distinguish the documents themselves from the DTDs or other forms of schemas. In these cases, the actual document that adheres to the schema is called an instance document. Here the document is an instance of a particular schema.
An XML application is a class of XML documents defined by a schema, specification, or some group of rules. For example, Scalable Vector Graphics (SVG), XHTML, MathML, GedML, XSL Formatting Objects and DocBook are all XML applications. The simple language I invented last Thursday to categorize my comic book collection is also an XML application even though it doesn't have a DTD, schema, or even a specification. An XML application is not a piece of application software that somehow processes XML such as the XML Spy editor, the Mozilla web browser, or the XEP XSL-FO to PDF converter.
There are two levels of "goodness" for an XML document. Well-formedness describes mandatory syntactic constraints. Validity describes optional structural and semantic constraints. There's a tendency to use the word valid in its common English usage to describe any correct document. However, in XML it has a much more specific meaning. Documents can be correct and processable, but not be valid.
Well-formedness is the minimum requirement necessary for an XML document. It includes various syntactic constraints such as every start-tag must have a matching end-tag and the document must have exactly one root element. If a document is not well-formed, it is not an XML document. Parsers that encounter a malformed document are required to report the error and stop parsing. They may not attempt to guess what the document author intended. They may not fix the error and continue. They have to drop the document on the floor.
Validity is stronger constraint than well-formedness, but it's not required in order to process XML documents. Validity describes which elements and attributes are allowed to appear where. It indicates whether a document adheres to the constraints listed in the document type definition (DTD) and the document type declaration (DOCTYPE). Even if a document does not adhere to these constraints, it may still be usefully processed in some cases. The decision of whether and how to reject invalid documents is made by the client application, not by the parser.
The word valid is also sometimes used to refer to validity with respect to a schema rather than a DTD. In cases, where this seems likely to be confusing, particularly where one is likely to want to validate a document against a DTD and against some other schema, the word schema-valid is used. As with DTD validity, whether and how to handle a schema-invalid document is a decision for the client application. A schema-validating parser will inform the client application that a document is invalid. However, it will continue to parse it. The client application gets to decide whether to accept the document or not.
A document type definition is a collection of ELEMENT, ATTLIST, ENTITY and NOTATION declarations that describes a class of valid documents. A document type declaration is placed in the prolog of an XML document. It either contains or point to the document's document type definition (or both). The document type definition and the document type declaration are closely related but they are not the same thing. The acronym DTD refers exclusively to the document type definition, never to the document type declaration. The shorthand form DOCTYPE refers exclusively to the document type declaration, never to the document type definition.
For example, this is a document type declaration:
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
"docbook/docbookx.dtd" >
It points to the DTD with the public identifier -//OASIS//DTD DocBook XML V4.1.2//EN found at the relative URL docbook/docbookx.dtd .
This is also a document type declaration.
<!DOCTYPE book SYSTEM "http://www.example.com/docbook/docbookx.dtd">
It points to the DTD at the absolute URL http://www.example.com/docbook/docbookx.dtd.
This is a document type declaration that completely contains the DTD between the square brackets that delimit the internal DTD subset:
<!DOCTYPE book [
<!ELEMENT book (title, chapter+)>
<!ELEMENT chapter (title, paragraph+)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT paragraph (#PCDATA)>
]>
Finally, this next document type declaration both points to an external DTD and contains an internal DTD subset. The full DTD is formed by combining the declarations in the external DTD subset with those in the internal DTD subset.
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
"docbook/docbookx.dtd" [
<!-- add XIncludes -->
<!ENTITY % local.para.char.mix " | xinclude:include">
<!ELEMENT xinclude:include EMPTY>
<!ATTLIST xinclude:include
xmlns:xinclude CDATA #FIXED "http://www.w3.org/2001/XInclude"
href CDATA #REQUIRED
parse (text | xml) "xml"
>
]>
Whether the DTD is internal, external, or both, it is never the same thing as the document type declaration. The document type declaration specifies the root element. The DTD does not. The DTD specifies the content models and attribute lists of the elements. The document type declaration does not. Most APIs routinely expose the contents of the document type declaration but not those of the document type definition.
One of the more needlessly confusing aspects of the XML specification is that for various technical reasons the following construct, which appears at the top of most XML documents, is in fact not a processing instruction:
<?xml version="1.0"?>
It looks like a processing instruction, but it isn't one. Processing instruction targets are specifically forbidden from being xml, XML, Xml, or any other case combination of the word XML.
APIs may or may not expose the information in the XML declaration to the client application; but if one does, it will not use the same mechanism it uses to report processing instructions. For instance, in SAX 2.1 some of this information is optionally available through the Locator2 interface. However, the parser does not call the processingInstruction method in ContentHandler when it sees the XML declaration.
XML is based on the Unicode character set. A character set is a collection of characters assigned to particular numbers called code points. Currently Unicode 3.2 defines more than 90,000 individual characters. Each character in the set is mapped to a number such as 64, 812, or 87000. These numbers are not ints, shorts, bytes, longs or any other numeric data type. They are simply numbers. Other character sets such as SJIS and Latin-1 contain different collections of characters which are assigned to different numbers, though there's often substantial overlap with the Unicode character set. That is, many character sets assign some or all of their characters to the same numbers Unicode assigns those characters to.
A character encoding represents the members of a character set as bytes in a particular way. There are multiple encodings of Unicode including UTF-8, UTF-16, UCS-2, UCS-4, UTF-32, and several other more obscure ones. Different encodings may encode the same code point using a different sequence of bytes and/or a different number of bytes. They may use big-endian or little-endian data. They can even use non-twos complement representations. They may use two bytes or four bytes for each character. They may even use different numbers of bytes for different characters.
Changing the character set changes which characters can be represented. For instance, the ISO-8859-7 set includes Greek letters. The ISO-8859-1 set does not. Changing the character encoding does not change which characters can be used. It merely changes how each character is encoded in bytes.
XML parsers always convert characters in other sets to Unicode before reporting them to the client application. In effect, they treat other character sets as different encodings of some subset of Unicode. Thus, XML doesn't ever really let you change the character set. This is always Unicode. It only lets you adjust how those characters are represented.
A URI identifies a resource. A URI reference identifies a part of a resource. A URI reference may contain a fragment identifier separated from the URI by an octothorpe (#). A plain URI may not. For example, http://www.w3.org/TR/REC-xml-names/ is a URI. http://www.w3.org/TR/REC-xml-names/#Philosophy is a URI reference.
Most XML related specifications such as Namespaces in XML are actually defined in terms of URI references rather than URIs. For example, the W3C XML Schema language simple type xsd:anyURI actually indicates that elements with that type are URI references. In casual conversation and writing, most people don't bother to make the distinction. Nonetheless, it can be important. For example, the system identifier in the document type declaration can be a URI but not a URI reference.
Note
I've heard it claimed that relative URIs are URI references, not true URIs, and the authors of the XML specification seem to have believed this. However, the URI specification, RFC 2396, does not support this belief. It clearly describes both relative URIs and relative URI references. Perhaps the authors intended to require all URIs to be absolute; but if this is the case, they failed to do so. The only difference between a URI and a URI reference is that the latter allows a fragment identifier while the former does not.
End Note
Currently, the IETF is working on Internationalized Resource Identifiers (IRIs). These are similar to URIs except that they allow non-ASCII characters such as ζ and é that must be percent escaped in URIs. The specification is not finished yet, but several XML specifications are already referring to this. For instance, the XLink href attribute actually contains an IRI, not a URI.
The word schema is a generic term for a document that specifies the layout and permissible content of a class of documents. It actually entered computer science in the context of database schemas. For XML, there are multiple different schema languages with their own strengths and weaknesses including DTDs, RELAX NG, Schematron, and, of course, the W3C XML Schema Language.
There is a tendency among developers to use only the word schemas, or perhaps the only slightly less generic XML Schemas, when referring to the W3C XML Schema Language. This needs to be resisted because the W3C XML Schema Language is neither the only such language, nor, in most people's opinions, the simplest, the most powerful, or the best designed. It is merely one language promulgated by one group of inventors. It has some good points and some bad points, but we should not implicitly ignore all the other languages (some of which are demonstrably simpler and/or more powerful than the W3C XML Schema Language) by using the generic term to refer to the specific.
Unfortunately, the W3C has not chosen to assign its schema language an appellation less cumbersome than "W3C XML Schema Language." Consequently, to avoid repeating this phrase incessantly, I will occasionally succumb to temptation and use the word schemas to mean the W3C XML Schema Language. However, I will only do this in those chapters that discuss this language exclusively, and I will make that very clear at the outset of the chapter. Think of the word schema as more of a pronoun for the schema language currently being discussed than as a proper noun for the W3C's entry into the field.
Words have meanings. XML is a very precisely defined language, so its words have very precise meanings. It pays to use those words correctly. There are indeed some confusing aspects to XML. It doesn't make sense to make the problem worse by adding to the confusion. Using the right words for the right concepts can simplify many unnecessarily complex problems to save the brain power for the things that are genuinely hard.