| Effective XML |
| XML Developers Network of the Capital District | |
| Elliotte Rusty Harold | |
| elharo@metalab.unc.edu | |
| http://www.cafeconleche.org/ |
| Part I: Syntax |
| Item 1: Include an XML declaration |
| <?xml version="1.0" encoding="UTF-8"?> | |
| Optional, but treat as required | |
| Specifies version, character set, and encoding | |
| Very important for detecting encoding | |
| Identifies XML when file and media type information is unavailable or unreliable |
| Item 3: Stay with XML 1.0 |
| XML 1.1: | ||
| New name characters | ||
| C0 control characters | ||
| C1 control characters | ||
| NEL | ||
| Undeclare namespace prefixes | ||
| Incompatible with | ||
| Most XML parsers | ||
| W3C and RELAX NG schema languages | ||
| XOM, JDOM | ||
| Part II: Structure |
| The XML Stack |
| Item 14: Allow All XML syntax |
| CDATA sections | |
| Entity references | |
| Processing instructions | |
| Comments | |
| Numeric character references | |
| Document type declarations | |
| Different ways of representing the same core content; not different information |
| Item 9: Distinguish text from markup |
| A DocBook element | |
| <programlisting><![CDATA[<value> <double>28657</double> </value>]]></programlisting> |
|
| The content is: <value> <double>28657</double> </value> |
|
| This is the same: <programlisting><value> <double>28657</double> </value></programlisting> |
| The reverse problem |
| Tools that create XML from strings: | ||
| Tree-based editors like <Oxygen/> or XML Spy | ||
| WYSIWYG applications like OpenOffice Writer | ||
| Programming APIs such as DOM, JDOM, and XOM | ||
| The tool automatically escapes reserved characters like <, >, or &. | ||
| Just because something looks like an XML tag does not mean it is an XML tag. | ||
| Item 10: White space matters |
| Parsers report all white space in element content, including boundary white space | |
| An xml:space attribute is for the client application only, not the parser | |
| White space in attribute values is normalized | |
| Parsers do not report white space in the prolog, epilog, the document type declaration, and tags. |
| Item 11: Make structure explicit through markup |
| Bad | ||
| <Transaction>Withdrawal 2003 12 15 200.00</Transaction> | ||
| Better | ||
| <Transaction type="withdrawal"> | ||
| <Date>2003-12-15</Date> | ||
| <Amount>200.00</Amount> | ||
| </Transaction> | ||
| Item 12: Store metadata in attributes |
| Material the reader doesnŐt want to see | ||
| URLs | ||
| IDs | ||
| Styles | ||
| Revision dates | ||
| Authors name | ||
| No substructure | ||
| Revision tracking | ||
| Citations | ||
| No multiple elements | ||
| Item 13: Remember mixed content |
| Narrative documents | ||
| Record-like documents | ||
| The RSS problem | ||
| <item> | ||
| <title>Xerlin 1.3 released</title> | ||
| <description> | ||
| Xerlin 1.3, an open source XML Editor written in | ||
| Java, has been released. Users can extend the | ||
| application via custom editor interfaces for | ||
| specific DTDs. New features in version 1.3 include | ||
| XML Schema support, WebDAV capabilities, and | ||
| various user interface enhancements. Java 1.2 | ||
| or later is required. | ||
| </description> | ||
| <link>http://www.cafeconleche.org/#news2003April7</link> | ||
| </item> | ||
| What you really want is this: |
| What people do is this: |
| Item 16: Prefer URLs to unparsed entities and notations |
| URLs are simple and well understood | |
| Notations and unparsed entities are confusing and little used | |
| URLs donŐt require the DTD to be read | |
| Many APIs donŐt even support notations and unparsed entities |
| Part III: Semantics |
| Item 17: Use processing instructions for process-specific content |
| For a very particular, even local, process | |
| Describes how a particular process acts on the data in the document | |
| Does not describe or add to the content itself | |
| A unit that can be treated in isolation | |
| Content is not XML-like. | |
| Applies to the entire document |
| Processing instructions are not appropriate when: |
| Content is closely related to the content of the document itself. | |
| Structure extends beyond a single processing instruction | |
| Needs to be validated. |
| Item 18: Include all information in instance documents |
| Not all parsers read the DTD | ||
| Especially browsers | ||
| Beware | ||
| Default attribute values | ||
| Parsed entity references | ||
| XInclude | ||
| ID type dependence (XPath, DOM, etc.) | ||
| Item 19: Encode binary data using quoted printable and/or Base64 |
| Quoted printable works well for mostly text | |
| Base-64 for non-text data | |
| Can you link to the data with a URL instead? |
| Item 20-22: Use namespaces for modularity and extensibility |
| Not hard; simple cases can use one default namespace | |
| http URIs are normally preferred | |
| DTD validation is tricky | |
| Code to namespace URIs, not prefixes | |
| Avoid namespace prefixes in element content and attribute values |
| Item 23: Reuse XHTML for generic narrative content |
| Item 24: Choose the right schema language for the job |
| DTDs | |
| The W3C XML Schema Language | |
| RELAX NG | |
| Schematron |
| Item 25: Pretend there's no such thing as the PSVI |
| Post Schema Validation Infoset | |
| Adds types like int and gYear to elements | |
| Often not specific enough | |
| Element/attribute names are types |
| Item 28: Use only what you need |
| You need | ||
| Well-formed XML 1.0 | ||
| A parser | ||
| You probably need: | ||
| Namespaces | ||
| You may not need: | ||
| DTDs | ||
| Schemas | ||
| XInclude | ||
| WS-Kitchen-Sink | ||
| etc. | ||
| Item 29: Always use a parser |
| CanŐt use regular expressions: | ||
| Detecting encoding | ||
| Comments and processing instructions that contain tags | ||
| CDATA sections | ||
| Unexpected placement of spaces and line breaks within tags | ||
| Default attribute values | ||
| Character and entity references | ||
| Malformed documents | ||
| Internal DTD Subset | ||
| Why not? | ||
| Unfamiliarity with parsers | ||
| Too slow | ||
| Item 30: Layer Functionality |
| Item 31-33: Program to standard APIs |
| Easier to deploy in Java 1.4/1.5 | ||
| Different implementations have different performance characteristics | ||
| SAX is fast | ||
| DOM interoperates | ||
| Semi-standard: | ||
| JDOM | ||
| XOM | ||
| Bleeding edge | ||
| StAX | ||
| JAXB | ||
| Item 34: Read the complete DTD |
| Be conservative in what you generate; liberal in what you accept | ||
| Important content from DTD: | ||
| Default attribute values | ||
| Namespace declarations | ||
| Entity references | ||
| Item 35: Navigate with XPath |
| More robust against unexpected structure | |
| Allow optimization by engine | |
| Easier to code; enhanced programmer productivity |
| Item 36: Serialize XML with XML |
| Item 37: Validate inside your program with schemas |
| Part IV: Implementation |
| Item 38: Write documents in Unicode |
| Prefer UTF-8 | ||
| Smaller in English | ||
| ASCII compatible | ||
| Normalization | ||
| , , and so forth | ||
| NFC | ||
| ICU | ||
| Item 40: Avoid Vendor Lockin; Beware |
| Opaque, binary data used in place of marked up text. | |
| Over-abbreviated, inobvious names like F17354 and grgyt | |
| APIs that hide the XML | |
| Products that focus on the "InfosetÓ | |
| Alternate serializations of XML | |
| Patented formats |
| Item 41: Hang on to your relational database |
| Item 42: Document Namespaces with RDDL |
| Item 43: Preprocess XSLT on the server side |
| Item 44: Serve XML+CSS to the client |
| Supported by | ||
| Safari | ||
| IE 5.0 and later | ||
| Mozilla | ||
| Netscape 6 and later | ||
| Konqueror | ||
| Opera | ||
| Firefox | ||
| Omniweb | ||
| Item 45: Pick the correct MIME type |
| application/xml | |
| Not text/xml! | |
| Don't use charset | |
| application/mathml+xml | |
| image/svg+xml | |
| application/xslt+xml |
| Item 46: TagSoup Your HTML |
| Item 47: Catalog common resources |
| <?xml version="1.0"?> | |
| <catalog xmlns= | |
| "urn:oasis:names:tc:entity:xmlns:xml:catalog" | |
| > | |
| <public publicId= | |
| "-//OASIS//DTD DocBook XML V4.2//EN" | |
| uri= | |
| "file:///opt/xml/docbook/docbookx.dtd"/> | |
| </catalog> |
| Item 50: Compress if space is a problem |
| To Learn More |
| This Presentation: http://cafeconleche.org/slides/albany/effectivexml | ||
| Effective XML: 50 Specific Ways to Improve Your XML Documents | ||
| Elliotte Rusty Harold | ||
| Addison-Wesley, 2003 | ||
| ISBN 0-321-15040-6 | ||
| $44.99 | ||
| http://cafeconleche.org/books/ effectivexml |
||