Effective XML


	Elliotte Rusty Harold
	elharo@metalab.unc.edu
	http://www.cafeconleche.org/

Part I: Syntax

Stay with XML 1.0


	XML 1.1:
		New name characters
		C0 control characters
		C1 control characters
		NEL
		Undeclare namespace prefixes
	Incompatible with
		Most XML parsers
		W3C and RELAX NG schema languages
		XOM, JDOM

Part II: Structure

The XML Stack

Allow All XML syntax


	CDATA sections
	Entity references
	Processing instructions
	Comments
	Numeric character references
	Document type declarations
	Different ways of representing the same core content; not different information

Distinguish text from markup


	A DocBook element
	<programlisting><![CDATA[<value> <double>28657</double> </value>]]></programlisting>
	The content is: <value> <double>28657</double> </value>
	This is the same: <programlisting><value> <double>28657</double> </value></programlisting>

The reverse problem


	Tools that create XML from strings:
		Tree-based editors like <Oxygen/> or XML Spy
		WYSIWYG applications like OpenOffice Writer
		Programming APIs such as DOM, JDOM, and XOM
	The tool automatically escapes reserved characters like <, >, or &.
	Just because something looks like an XML tag does not mean it is an XML tag.

White space matters


	Parsers report all white space in element content, including boundary white space
	An xml:space attribute is for the client application only, not the parser
	White space in attribute values is normalized
	Parsers do not report white space in the prolog, epilog, the document type declaration, and tags.

Make structure explicit through markup


	Bad
		<Transaction>Withdrawal 2003 12 15 200.00</Transaction>
	Better
		<Transaction type="withdrawal">
		<Date>2003-12-15</Date>
		<Amount>200.00</Amount>
		</Transaction>

Store metadata in attributes


	Material the reader doesn’t want to see
		URLs
		IDs
		Styles
		Revision dates
		Authors name
	No substructure
		Revision tracking
		Citations
	No multiple elements

Remember mixed content


	Narrative documents
	Record-like documents
	The RSS problem
	<item>
	<title>Xerlin 1.3 released</title>
	<description>
	Xerlin 1.3, an open source XML Editor written in
	Java, has been released. Users can extend the
	application via custom editor interfaces for
	specific DTDs. New features in version 1.3 include
	XML Schema support, WebDAV capabilities, and
	various user interface enhancements. Java 1.2
	or later is required.
	</description>
	<link>http://www.cafeconleche.org/#news2003April7</link>
	</item>

What you really want is this:

What people do is this:

Prefer URLs to unparsed entities and notations


	URLs are simple and well understood
	Notations and unparsed entities are confusing and little used
	URLs don’t require the DTD to be read
	Many APIs don’t even support notations and unparsed entities

Part III: Semantics

Use processing instructions for process-specific content


	For a very particular, even local, process
	Describes how a particular process acts on the data in the document
	Does not describe or add to the content itself
	A unit that can be treated in isolation
	Content is not XML-like.
	Applies to the entire document

Processing instructions are not appropriate when:


	Content is closely related to the content of the document itself
	Structure extends beyond a single processing instruction
	Needs to be validated

Include all information in instance documents


	Not all parsers read the DTD
	Especially browsers
	Beware
		Default attribute values
		Parsed entity references
		XInclude
		ID type dependence (XPath, DOM, etc.)

Encode binary data using quoted printable and/or Base64


	Quoted printable works well for mostly text
	Base-64 for non-text data
	Can you link to the data with a URL instead?

Use namespaces for modularity and extensibility


	Not hard; simple cases can use one default namespace
	http URIs are normally preferred
	DTD validation is tricky
	Code to namespace URIs, not prefixes
	Avoid namespace prefixes in element content and attribute values

Reuse XHTML for generic narrative content

Choose the right schema language for the job


	DTDs
	The W3C XML Schema Language
	RELAX NG
	Schematron

Use only what you need


	You need
		Well-formed XML 1.0
		A parser
	You probably need:
		Namespaces
	You may not need:
		DTDs
		Schemas
		XInclude
		SOAP
		WS-Kitchen-Sink
		etc.

Always use a parser


	Can’t use regular expressions:
		Detecting encoding
		Comments and processing instructions that contain tags
		CDATA sections
		Unexpected placement of spaces and line breaks within tags
		Default attribute values
		Character and entity references
		Malformed documents
		Internal DTD Subset
	Why not?
		Unfamiliarity with parsers
		Too slow

Layer Functionality

Program to standard APIs


	Easier to deploy in Java 1.4/1.5
	Different implementations have different performance characteristics
	SAX is fast
	DOM interoperates
	Semi-standard:
		JDOM
		XOM
	Bleeding edge
		StAX
		JAXB

Read the complete DTD


	Be conservative in what you generate; liberal in what you accept
	Important content from DTD:
		Default attribute values
		Namespace declarations
		Entity references
		ID types

Navigate with XPath


	More robust against unexpected structure
	Allow optimization by engine
	Easier to code; enhanced programmer productivity

Validate inside your program with schemas

Part IV: Implementation

Write documents in Unicode


	Prefer UTF-8
		Smaller in English
		ASCII compatible
	Normalization
		É, ü, ì and so forth
		NFC
		ICU

Avoid Vendor Lockin; Beware


	Opaque, binary data used in place of marked up text.
	Over-abbreviated, inobvious names like F17354 and grgyt
	APIs that hide the XML
	Products that focus on the "Infoset”
	Alternate serializations of XML
	Patented formats

Hang on to your relational database

Document Namespaces with RDDL

Pick the correct MIME type


	application/xml
	Not text/xml!
	Don't use charset
	application/mathml+xml
	image/svg+xml
	application/xslt+xml

TagSoup Your HTML

Catalog common resources


	<?xml version="1.0"?>
	<catalog xmlns=
	"urn:oasis:names:tc:entity:xmlns:xml:catalog"
	>

	<public publicId=
	"-//OASIS//DTD DocBook XML V4.2//EN"
	uri=
	"file:///opt/xml/docbook/docbookx.dtd"/>

	</catalog>

Compress if space is a problem

To Learn More


	This Presentation: http://cafeconleche.org/slides/lxny/effectivexml
	Effective XML: 50 Specific Ways to Improve Your XML Documents
		Elliotte Rusty Harold
		Addison-Wesley, 2003
		ISBN 0-321-15040-6
		$44.99
		http://cafeconleche.org/books/ effectivexml