Effective XML |
Elliotte Rusty Harold | |
elharo@metalab.unc.edu | |
http://www.cafeconleche.org/ |
Part I: Syntax |
Stay with XML 1.0 |
XML 1.1: | ||
New name characters | ||
C0 control characters | ||
C1 control characters | ||
NEL | ||
Undeclare namespace prefixes | ||
Incompatible with | ||
Most XML parsers | ||
W3C and RELAX NG schema languages | ||
XOM, JDOM |
Part II: Structure |
The XML Stack |
Allow All XML syntax |
CDATA sections | |
Entity references | |
Processing instructions | |
Comments | |
Numeric character references | |
Document type declarations | |
Different ways of representing the same core content; not different information |
Distinguish text from markup |
A DocBook element | |
<programlisting><![CDATA[<value> <double>28657</double> </value>]]></programlisting> |
|
The content is: <value> <double>28657</double> </value> |
|
This is the same: <programlisting><value> <double>28657</double> </value></programlisting> |
The reverse problem |
Tools that create XML from strings: | ||
Tree-based editors like <Oxygen/> or XML Spy | ||
WYSIWYG applications like OpenOffice Writer | ||
Programming APIs such as DOM, JDOM, and XOM | ||
The tool automatically escapes reserved characters like <, >, or &. | ||
Just because something looks like an XML tag does not mean it is an XML tag. |
White space matters |
Parsers report all white space in element content, including boundary white space | |
An xml:space attribute is for the client application only, not the parser | |
White space in attribute values is normalized | |
Parsers do not report white space in the prolog, epilog, the document type declaration, and tags. |
Make structure explicit through markup |
Bad | ||
<Transaction>Withdrawal 2003 12 15 200.00</Transaction> | ||
Better | ||
<Transaction type="withdrawal"> | ||
<Date>2003-12-15</Date> | ||
<Amount>200.00</Amount> | ||
</Transaction> | ||
Store metadata in attributes |
Material the reader doesn’t want to see | ||
URLs | ||
IDs | ||
Styles | ||
Revision dates | ||
Authors name | ||
No substructure | ||
Revision tracking | ||
Citations | ||
No multiple elements |
Remember mixed content |
Narrative documents | ||
Record-like documents | ||
The RSS problem | ||
<item> | ||
<title>Xerlin 1.3 released</title> | ||
<description> | ||
Xerlin 1.3, an open source XML Editor written in | ||
Java, has been released. Users can extend the | ||
application via custom editor interfaces for | ||
specific DTDs. New features in version 1.3 include | ||
XML Schema support, WebDAV capabilities, and | ||
various user interface enhancements. Java 1.2 | ||
or later is required. | ||
</description> | ||
<link>http://www.cafeconleche.org/#news2003April7</link> | ||
</item> | ||
What you really want is this: |
What people do is this: |
Prefer URLs to unparsed entities and notations |
URLs are simple and well understood | |
Notations and unparsed entities are confusing and little used | |
URLs don’t require the DTD to be read | |
Many APIs don’t even support notations and unparsed entities |
Part III: Semantics |
Use processing instructions for process-specific content |
For a very particular, even local, process | |
Describes how a particular process acts on the data in the document | |
Does not describe or add to the content itself | |
A unit that can be treated in isolation | |
Content is not XML-like. | |
Applies to the entire document |
Processing instructions are not appropriate when: |
Content is closely related to the content of the document itself | |
Structure extends beyond a single processing instruction | |
Needs to be validated |
Include all information in instance documents |
Not all parsers read the DTD | ||
Especially browsers | ||
Beware | ||
Default attribute values | ||
Parsed entity references | ||
XInclude | ||
ID type dependence (XPath, DOM, etc.) |
Encode binary data using quoted printable and/or Base64 |
Quoted printable works well for mostly text | |
Base-64 for non-text data | |
Can you link to the data with a URL instead? |
Use namespaces for modularity and extensibility |
Not hard; simple cases can use one default namespace | |
http URIs are normally preferred | |
DTD validation is tricky | |
Code to namespace URIs, not prefixes | |
Avoid namespace prefixes in element content and attribute values |
Reuse XHTML for generic narrative content |
Choose the right schema language for the job |
DTDs | |
The W3C XML Schema Language | |
RELAX NG | |
Schematron |
Use only what you need |
You need | ||
Well-formed XML 1.0 | ||
A parser | ||
You probably need: | ||
Namespaces | ||
You may not need: | ||
DTDs | ||
Schemas | ||
XInclude | ||
SOAP | ||
WS-Kitchen-Sink | ||
etc. |
Always use a parser |
Can’t use regular expressions: | ||
Detecting encoding | ||
Comments and processing instructions that contain tags | ||
CDATA sections | ||
Unexpected placement of spaces and line breaks within tags | ||
Default attribute values | ||
Character and entity references | ||
Malformed documents | ||
Internal DTD Subset | ||
Why not? | ||
Unfamiliarity with parsers | ||
Too slow |
Layer Functionality |
Program to standard APIs |
Easier to deploy in Java 1.4/1.5 | ||
Different implementations have different performance characteristics | ||
SAX is fast | ||
DOM interoperates | ||
Semi-standard: | ||
JDOM | ||
XOM | ||
Bleeding edge | ||
StAX | ||
JAXB |
Read the complete DTD |
Be conservative in what you generate; liberal in what you accept | ||
Important content from DTD: | ||
Default attribute values | ||
Namespace declarations | ||
Entity references | ||
ID types |
Navigate with XPath |
More robust against unexpected structure | |
Allow optimization by engine | |
Easier to code; enhanced programmer productivity |
Validate inside your program with schemas |
Part IV: Implementation |
Write documents in Unicode |
Prefer UTF-8 | ||
Smaller in English | ||
ASCII compatible | ||
Normalization | ||
É, ü, ì and so forth | ||
NFC | ||
ICU |
Avoid Vendor Lockin; Beware |
Opaque, binary data used in place of marked up text. | |
Over-abbreviated, inobvious names like F17354 and grgyt | |
APIs that hide the XML | |
Products that focus on the "Infoset” | |
Alternate serializations of XML | |
Patented formats |
Hang on to your relational database |
Document Namespaces with RDDL |
Pick the correct MIME type |
application/xml | |
Not text/xml! | |
Don't use charset | |
application/mathml+xml | |
image/svg+xml | |
application/xslt+xml |
TagSoup Your HTML |
Catalog common resources |
<?xml version="1.0"?> | |
<catalog xmlns= | |
"urn:oasis:names:tc:entity:xmlns:xml:catalog" | |
> | |
<public publicId= | |
"-//OASIS//DTD DocBook XML V4.2//EN" | |
uri= | |
"file:///opt/xml/docbook/docbookx.dtd"/> | |
</catalog> |
Compress if space is a problem |
To Learn More |
This Presentation: http://cafeconleche.org/slides/lxny/effectivexml | ||
Effective XML: 50 Specific Ways to Improve Your XML Documents | ||
Elliotte Rusty Harold | ||
Addison-Wesley, 2003 | ||
ISBN 0-321-15040-6 | ||
$44.99 | ||
http://cafeconleche.org/books/ effectivexml |