Effective XML
XML Developers Network of the Capital District
Elliotte Rusty Harold
elharo@metalab.unc.edu
http://www.cafeconleche.org/

Part I: Syntax
Item 1: Include an XML declaration
<?xml version="1.0" encoding="UTF-8"?>
Optional, but treat as required
Specifies version, character set, and encoding
Very important for detecting encoding
Identifies XML when file and media type information is unavailable or unreliable

Item 3: Stay with XML 1.0
XML 1.1:
New name characters
C0 control characters
C1 control characters
NEL
Undeclare namespace prefixes
Incompatible with
Most XML parsers
W3C and RELAX NG schema languages
XOM, JDOM

Part II: Structure
The XML Stack
Item 14: Allow All XML syntax
CDATA sections
Entity references
Processing instructions
Comments
Numeric character references
Document type declarations
Different ways of representing the same core content; not different information

Item 9: Distinguish text from markup
A DocBook element
<programlisting><![CDATA[<value>
  <double>28657</double>
</value>]]></programlisting>
The content is:
<value>
  <double>28657</double>
</value>
This is the same:
<programlisting>&lt;value&gt;
  &lt;double&gt;28657&lt;/double&gt;
 &lt;/value&gt;</programlisting>

The reverse problem
Tools that create XML from strings:
Tree-based editors like <Oxygen/> or XML Spy
WYSIWYG applications like OpenOffice Writer
Programming APIs such as DOM, JDOM, and XOM
The tool automatically escapes reserved characters like <, >, or &.
Just because something looks like an XML tag does not mean it is an XML tag.

Item 10: White space matters
Parsers report all white space in element content, including boundary white space
An xml:space attribute is for the client application only, not the parser
White space in attribute values is normalized
Parsers do not report white space in the prolog, epilog, the document type declaration, and tags.

Item 11: Make structure explicit through markup
Bad
<Transaction>Withdrawal 2003 12 15 200.00</Transaction>
Better
<Transaction type="withdrawal">
  <Date>2003-12-15</Date>
  <Amount>200.00</Amount>
</Transaction>

Item 12: Store metadata in attributes
Material the reader doesnŐt want to see
URLs
IDs
Styles
Revision dates
Authors name
No substructure
Revision tracking
Citations
No multiple elements

Item 13: Remember mixed content
Narrative documents
Record-like documents
The RSS problem
<item>
  <title>Xerlin 1.3 released</title>
  <description>
    Xerlin 1.3, an open source XML Editor written in
    Java, has been released. Users can extend the
    application via custom editor interfaces for
    specific DTDs. New features in version 1.3 include
    XML Schema support, WebDAV capabilities, and
    various user interface enhancements. Java 1.2
    or later is required.
  </description>
<link>http://www.cafeconleche.org/#news2003April7</link>
</item>

What you really want is this:
What people do is this:
Item 16: Prefer URLs to unparsed entities and notations
URLs are simple and well understood
Notations and unparsed entities are confusing and little used
URLs donŐt require the DTD to be read
Many APIs donŐt even support notations and unparsed entities

Part III: Semantics
Item 17: Use processing instructions for process-specific content
For a very particular, even local, process
Describes how a particular process acts on the data in the document
Does not describe or add to the content itself
A unit that can be treated in isolation
Content is not XML-like.
Applies to the entire document

Processing instructions are not appropriate when:
Content is closely related to the content of the document itself.
Structure extends beyond a single processing instruction
Needs to be validated.

Item 18: Include all information in instance documents
Not all parsers read the DTD
Especially browsers
Beware
Default attribute values
Parsed entity references
XInclude
ID type dependence (XPath, DOM, etc.)

Item 19: Encode binary data using quoted printable and/or Base64
Quoted printable works well for mostly text
Base-64 for non-text data
Can you link to the data with a URL instead?

Item 20-22: Use namespaces for modularity and extensibility
Not hard; simple cases can use one default namespace
http URIs are normally preferred
DTD validation is tricky
Code to namespace URIs, not prefixes
Avoid namespace prefixes in element content and attribute values

Item 23: Reuse XHTML for generic narrative content
Item 24: Choose the right schema language for the job
DTDs
The W3C XML Schema Language
RELAX NG
Schematron

Item 25: Pretend there's no such thing as the PSVI
Post Schema Validation Infoset
Adds types like int and gYear to elements
Often not specific enough
Element/attribute names are types

Item 28: Use only what you need
You need
Well-formed XML 1.0
A parser
You probably need:
Namespaces
You may not need:
DTDs
Schemas
XInclude
WS-Kitchen-Sink
etc.

Item 29: Always use a parser
CanŐt use regular expressions:
Detecting encoding
Comments and processing instructions that contain tags
CDATA sections
Unexpected placement of spaces and line breaks within tags
Default attribute values
Character and entity references
Malformed documents
Internal DTD Subset
Why not?
Unfamiliarity with parsers
Too slow

Item 30: Layer Functionality
Item 31-33: Program to  standard APIs
Easier to deploy in Java 1.4/1.5
Different implementations have different performance characteristics
SAX is fast
DOM interoperates
Semi-standard:
JDOM
XOM
Bleeding edge
StAX
JAXB

Item 34: Read the complete DTD
Be conservative in what you generate; liberal in what you accept
Important content from DTD:
Default attribute values
Namespace declarations
Entity references

Item 35: Navigate with XPath
More robust against unexpected structure
Allow optimization by engine
Easier to code; enhanced programmer productivity

Item 36: Serialize XML with XML
Item 37: Validate inside your program with schemas
Part IV: Implementation
Item 38: Write documents in Unicode
Prefer UTF-8
Smaller in English
ASCII compatible
Normalization
ƒ, Ÿ, “ and so forth
NFC
ICU

Item 40: Avoid Vendor Lockin; Beware
Opaque, binary data used in place of marked up text.
Over-abbreviated, inobvious names like F17354 and grgyt
APIs that hide the XML
Products that focus on the "InfosetÓ
Alternate serializations of XML
Patented formats

Item 41: Hang on to your relational database
Item 42: Document Namespaces with RDDL
Item 43: Preprocess XSLT on the server side
Item 44: Serve XML+CSS to the client
Supported by
Safari
IE 5.0 and later
Mozilla
Netscape 6 and later
Konqueror
Opera
Firefox
Omniweb

Item 45: Pick the correct MIME type
application/xml
Not text/xml!
Don't use charset
application/mathml+xml
image/svg+xml
application/xslt+xml

Item 46: TagSoup Your HTML
Item 47: Catalog common resources
<?xml version="1.0"?>
<catalog xmlns=
  "urn:oasis:names:tc:entity:xmlns:xml:catalog"
>
  <public publicId=
     "-//OASIS//DTD DocBook XML V4.2//EN"
          uri=
   "file:///opt/xml/docbook/docbookx.dtd"/>
</catalog>

Item 50: Compress if space is a problem
To Learn More
This Presentation: http://cafeconleche.org/slides/albany/effectivexml
Effective XML: 50 Specific Ways to Improve Your XML Documents
Elliotte Rusty Harold
Addison-Wesley, 2003
ISBN 0-321-15040-6
$44.99
http://cafeconleche.org/books/
effectivexml