Effective XML


	XML Developers Network of the Capital District
	Elliotte Rusty Harold
	elharo@metalab.unc.edu
	http://www.cafeconleche.org/

Part I: Syntax

Item 1: Include an XML declaration


	<?xml version="1.0" encoding="UTF-8"?>

	Optional, but treat as required
	Specifies version, character set, and encoding
	Very important for detecting encoding
	Identifies XML when file and media type information is unavailable or unreliable

Item 3: Stay with XML 1.0


	XML 1.1:
		New name characters
		C0 control characters
		C1 control characters
		NEL
		Undeclare namespace prefixes
	Incompatible with
		Most XML parsers
		W3C and RELAX NG schema languages
		XOM, JDOM

Part II: Structure

The XML Stack

Item 14: Allow All XML syntax


	CDATA sections
	Entity references
	Processing instructions
	Comments
	Numeric character references
	Document type declarations
	Different ways of representing the same core content; not different information

Item 9: Distinguish text from markup


	A DocBook element
	<programlisting><![CDATA[<value> <double>28657</double> </value>]]></programlisting>
	The content is: <value> <double>28657</double> </value>
	This is the same: <programlisting><value> <double>28657</double> </value></programlisting>

The reverse problem


	Tools that create XML from strings:
		Tree-based editors like <Oxygen/> or XML Spy
		WYSIWYG applications like OpenOffice Writer
		Programming APIs such as DOM, JDOM, and XOM
	The tool automatically escapes reserved characters like <, >, or &.
	Just because something looks like an XML tag does not mean it is an XML tag.

Item 10: White space matters


	Parsers report all white space in element content, including boundary white space
	An xml:space attribute is for the client application only, not the parser
	White space in attribute values is normalized
	Parsers do not report white space in the prolog, epilog, the document type declaration, and tags.

Item 11: Make structure explicit through markup


	Bad
		<Transaction>Withdrawal 2003 12 15 200.00</Transaction>
	Better
		<Transaction type="withdrawal">
		<Date>2003-12-15</Date>
		<Amount>200.00</Amount>
		</Transaction>

Item 12: Store metadata in attributes


	Material the reader doesnÕt want to see
		URLs
		IDs
		Styles
		Revision dates
		Authors name
	No substructure
		Revision tracking
		Citations
	No multiple elements

Item 13: Remember mixed content


	Narrative documents
	Record-like documents
	The RSS problem
	<item>
	<title>Xerlin 1.3 released</title>
	<description>
	Xerlin 1.3, an open source XML Editor written in
	Java, has been released. Users can extend the
	application via custom editor interfaces for
	specific DTDs. New features in version 1.3 include
	XML Schema support, WebDAV capabilities, and
	various user interface enhancements. Java 1.2
	or later is required.
	</description>
	<link>http://www.cafeconleche.org/#news2003April7</link>
	</item>

What you really want is this:

What people do is this:

Item 16: Prefer URLs to unparsed entities and notations


	URLs are simple and well understood
	Notations and unparsed entities are confusing and little used
	URLs donÕt require the DTD to be read
	Many APIs donÕt even support notations and unparsed entities

Part III: Semantics

Item 17: Use processing instructions for process-specific content


	For a very particular, even local, process
	Describes how a particular process acts on the data in the document
	Does not describe or add to the content itself
	A unit that can be treated in isolation
	Content is not XML-like.
	Applies to the entire document

Processing instructions are not appropriate when:


	Content is closely related to the content of the document itself.
	Structure extends beyond a single processing instruction
	Needs to be validated.

Item 18: Include all information in instance documents


	Not all parsers read the DTD
	Especially browsers
	Beware
		Default attribute values
		Parsed entity references
		XInclude
		ID type dependence (XPath, DOM, etc.)

Item 19: Encode binary data using quoted printable and/or Base64


	Quoted printable works well for mostly text
	Base-64 for non-text data
	Can you link to the data with a URL instead?

Item 20-22: Use namespaces for modularity and extensibility


	Not hard; simple cases can use one default namespace
	http URIs are normally preferred
	DTD validation is tricky
	Code to namespace URIs, not prefixes
	Avoid namespace prefixes in element content and attribute values

Item 23: Reuse XHTML for generic narrative content

Item 24: Choose the right schema language for the job


	DTDs
	The W3C XML Schema Language
	RELAX NG
	Schematron

Item 25: Pretend there's no such thing as the PSVI


	Post Schema Validation Infoset
	Adds types like int and gYear to elements
	Often not specific enough
	Element/attribute names are types

Item 28: Use only what you need


	You need
		Well-formed XML 1.0
		A parser
	You probably need:
		Namespaces
	You may not need:
		DTDs
		Schemas
		XInclude
		WS-Kitchen-Sink
		etc.

Item 29: Always use a parser


	CanÕt use regular expressions:
		Detecting encoding
		Comments and processing instructions that contain tags
		CDATA sections
		Unexpected placement of spaces and line breaks within tags
		Default attribute values
		Character and entity references
		Malformed documents
		Internal DTD Subset
	Why not?
		Unfamiliarity with parsers
		Too slow

Item 30: Layer Functionality

Item 31-33: Program to standard APIs


	Easier to deploy in Java 1.4/1.5
	Different implementations have different performance characteristics
	SAX is fast
	DOM interoperates
	Semi-standard:
		JDOM
		XOM
	Bleeding edge
		StAX
		JAXB

Item 34: Read the complete DTD


	Be conservative in what you generate; liberal in what you accept
	Important content from DTD:
		Default attribute values
		Namespace declarations
		Entity references

Item 35: Navigate with XPath


	More robust against unexpected structure
	Allow optimization by engine
	Easier to code; enhanced programmer productivity

Item 36: Serialize XML with XML

Item 37: Validate inside your program with schemas

Part IV: Implementation

Item 38: Write documents in Unicode


	Prefer UTF-8
		Smaller in English
		ASCII compatible
	Normalization
		ƒ, Ÿ, “ and so forth
		NFC
		ICU

Item 40: Avoid Vendor Lockin; Beware


	Opaque, binary data used in place of marked up text.
	Over-abbreviated, inobvious names like F17354 and grgyt
	APIs that hide the XML
	Products that focus on the "InfosetÓ
	Alternate serializations of XML
	Patented formats

Item 41: Hang on to your relational database

Item 42: Document Namespaces with RDDL

Item 43: Preprocess XSLT on the server side

Item 44: Serve XML+CSS to the client


	Supported by
		Safari
		IE 5.0 and later
		Mozilla
		Netscape 6 and later
		Konqueror
		Opera
		Firefox
		Omniweb

Item 45: Pick the correct MIME type


	application/xml
	Not text/xml!
	Don't use charset
	application/mathml+xml
	image/svg+xml
	application/xslt+xml

Item 46: TagSoup Your HTML

Item 47: Catalog common resources


	<?xml version="1.0"?>
	<catalog xmlns=
	"urn:oasis:names:tc:entity:xmlns:xml:catalog"
	>

	<public publicId=
	"-//OASIS//DTD DocBook XML V4.2//EN"
	uri=
	"file:///opt/xml/docbook/docbookx.dtd"/>

	</catalog>

Item 50: Compress if space is a problem

To Learn More


	This Presentation: http://cafeconleche.org/slides/albany/effectivexml
	Effective XML: 50 Specific Ways to Improve Your XML Documents
		Elliotte Rusty Harold
		Addison-Wesley, 2003
		ISBN 0-321-15040-6
		$44.99
		http://cafeconleche.org/books/ effectivexml