XML Fundamentals

Elliotte Rusty Harold

Software Development 2001 West

April 10, 2001

elharo@metalab.unc.edu

http://www.ibiblio.org/xml/

What is XML?

Extensible Markup Language
A syntax for documents
A Meta-Markup Language
A Structural and Semantic language, not a formatting language
Not just for Web pages

Extensible Markup Language

Language

It has a grammar
It has a vocabulary (sort of)
It can be parsed by machines

Markup Language

It says what things are; not what they do
It is not a programming language
It is not compiled

Extensible

You can add words to the language

XML is a Meta Markup Language

Not like HTML, troff, LaTeX
Make up the tags you need as you need them
The tags you create can be documented in a Document Type Definition (DTD)
A meta syntax for domain-specific markup languages like MusicML, MathML, and XHTML

XML Applications

A specific markup language that uses the XML meta-syntax is called an XML application
Different XML applications have their own more constricted syntaxes and vocabularies within the broader XML syntax
Further syntax can be layered on top of this; e.g. data typing through schemas

Some XML Applications

Clinical Trial Data Model for drug trials
National Library of Medicine (NLM) XML Data Formats for MEDLINE data over FTP replacing ELHILL Unit Record Format (EURF) on magnetic tape
Health Level Seven XML Patient Record Architecture
ASTM XML Document Type Definitions (DTDs) for Health Care
Molecular Dynamics Markup Language (MoDL)
BIOpolymer Markup Language (BIOML)
Gene Expression Markup Language (GEML)
Chemical Markup Language
Bioinformatic Sequence Markup Language
Open Financial Exchange
Human Resources Markup Language
XML Court Interface
and many more...

XML describes structure and semantics, not formatting

XML documents form a tree
Element and attribute names reflect the kind of the element
Formatting can be added with a style sheet

A Song Description in HTML

<dt>Hot Cop
<dd> by Jacques Morali, Henri Belolo, and Victor Willis
<ul>
<li>Producer: Jacques Morali
<li>Publisher: PolyGram Records
<li>Length: 6:20
<li>Written: 1978
<li>Artist: Village People
</ul>

View Document in Browser

A Song Description in XML

<SONG>
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

View Document in Browser

Editing and Saving XML Files

Plain ASCII or UTF-8 text
.xml is standard file extension
Any standard text editor will work

Style Sheets provide formatting

SONG {display: block; font-family: New York, Times New Roman, serif}
TITLE {display: block; font-size: 24pt; 
       font-weight: bold; font-family: Helvetica, sans}
COMPOSER {display: block}
PRODUCER {display: block}
YEAR {display: block}
PUBLISHER {display: block}
LENGTH {display: block}
ARTIST {display: block; font-style: italic}

Attaching style sheets to documents

<?xml-stylesheet type="text/css" href="song1.css"?>

<?xml-stylesheet type="text/css" href="song.css"?>
<SONG>
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

View Document in Browser

Style Sheet Languages

Cascading Style Sheets Level 1 (CSS1)
- Internet Explorer 5.0
- Mozilla 5.0
Cascading Style Sheets Level 2 (CSS2)
- Internet Explorer 5 (partial)
- Mozilla 5.0 (partial)
Extensible Stylesheet Language (XSL)
- Internet Explorer 5.0 (older draft, buggy)
- LotusXSL, XT, Xalan, Saxon, other non-browser converters
Document Style and Semantics Language (DSSSL)
- Jade

An XSLT stylesheet

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:template match="/">
    <html>
      <head><title>Song</title></head>
      <body>
        <xsl:apply-templates select="SONG"/>    
      </body>
    </html>
  </xsl:template>
  
  <xsl:template match="SONG">
    <h1>
      <xsl:value-of select="TITLE"/> 
      by the 
      <xsl:value-of select="ARTIST"/>
    </h1>
    
    <ul>
      <li>Length: <xsl:value-of select="LENGTH"/></li>
      <li>Producer: <xsl:value-of select="PRODUCER"/></li>
      <li>Publisher: <xsl:value-of select="PUBLISHER"/></li>
      <li>Year: <xsl:value-of select="YEAR"/></li>
      <xsl:apply-templates select="COMPOSER"/>
    </ul>
  </xsl:template>

  <xsl:template match="COMPOSER">
    <li>Composer: <xsl:value-of select="."/></li>
  </xsl:template>

</xsl:stylesheet>

Transforming the Document

D:\fundamentals\examples> saxon hotcop.xml song3.xsl
<html>
<head>
<title>Song</title>
</head>
<body>
<h1>Hot Cop
      by the
      Village People</h1>
<ul>
<li>Length: 6:20</li>
<li>Producer: Jacques Morali</li>
<li>Publisher: PolyGram Records</li>
<li>Year: 1978</li>
<li>Composer: Jacques Morali</li>
<li>Composer: Henri Belolo</li>
<li>Composer: Victor Willis</li>
</ul>
</body>
</html>

Or alternately:

% java com.icl.saxon.StyleSheet -x org.apache.xerces.parsers.SAXParser xml_fundamentals.xml slides.xsl hotcop.xml song3.xsl <html> ...

View Document in Browser

CSS or XSL?

CSS has broader support
CSS is more stable
XSL is much more powerful
XSL can be used without browser support by transforming to HTML on the server side

Well-formedness

Rules:

Open and close all tags
Empty tags end with />
There is a unique root element
Elements may not overlap
Attribute values are quoted
< and & are only used to start tags and entities
Only the five predefined entity references are used
Plus more...

Validity

To be valid an XML document must be

Well-formed
Must have a Document Type Definition (DTD)
Must comply with the constraints specified in the DTD

A DTD for Songs

<!ELEMENT SONG (TITLE, COMPOSER+, PRODUCER*, 
 PUBLISHER*, LENGTH?, YEAR?, ARTIST+)>

<!ELEMENT TITLE (#PCDATA)>

<!ELEMENT COMPOSER (#PCDATA)>
<!ELEMENT PRODUCER (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)>
<!ELEMENT LENGTH (#PCDATA)>
<!-- This should be a four digit year like "1999",
     not a two-digit year like "99" -->
<!ELEMENT YEAR (#PCDATA)>

<!ELEMENT ARTIST (#PCDATA)>

A Valid Song Document

<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG>
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

Checking Validity

To check validity you pass the document through a validating parser which should report any errors it finds. For example,

% java dom.DOMCount -v validhotcop.xml
[Error] validhotcop.xml:13:9: The content of element type "SONG" must match "(TI
TLE,COMPOSER+,PRODUCER*,PUBLISHER*,LENGTH?,YEAR?)".
validhotcop.xml: 550 ms (10 elems, 0 attrs, 28 spaces, 98 chars)

A valid document:

% java dom.DOMCount -v validhotcop.xml
validhotcop.xml: 291 ms (10 elems, 0 attrs, 28 spaces, 98 chars)

A More Complex Example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "expanded_song.dtd">
<SONG xmlns="http://metalab.unc.edu/xml/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->

The XML Declaration

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

version attribute
- required
- always has the value 1.0
standalone attribute
- yes
- no
encoding attribute
- UTF-8
- 8859_1
- etc.

Attributes

<PHOTO xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg" ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200" />

name="value" same as in HTML
Generally used for meta-information
Attribute values are quoted with either single or double quotes:
- Good:
- Bad:

Empty Element Tags

<PHOTO xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg" ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200" />

Ends with /> instead of >
<PHOTO/> is semantically the same as <PHOTO></PHOTO>
Just syntactic sugar

Comments

Essentially the same as in HTML

Namespaces

Let you mix and match different XML vocabularies
URIs identify elements and attributes that belong to different XML applications
Prefixes can change if the URI stay the same

<SONG xmlns="http://www.ibiblio.org/xml/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <ARTIST>Village People</ARTIST>
</SONG>

Entity References

A & M Records

< and & are only used to start tags and entities
- Good:
  <H1>O'Reilly & Associates</H1>
- Bad:
  <H1>O'Reilly & Associates</H1>
- Good:
  <CODE>for (int i = 0; i <= args.length; i++ ) { </CODE>
- Bad:
  <CODE>for (int i = 0; i <= args.length; i++ ) { </CODE>
Only the five predefined entity references are used
- Good:
- Bad:
Entity references must end with a semicolon.
- < is good
- &lt is bad

A More Complex DTD

<!ELEMENT SONG (TITLE, PHOTO?, COMPOSER+, PRODUCER*, 
 PUBLISHER*, LENGTH?, YEAR?, ARTIST+)>
<!ATTLIST SONG xmlns       CDATA #REQUIRED
               xmlns:xlink CDATA #REQUIRED>
<!ELEMENT TITLE (#PCDATA)>

<!ELEMENT PHOTO EMPTY>
<!ATTLIST PHOTO xlink:type CDATA #FIXED "simple"
                xlink:href CDATA #REQUIRED
                xlink:show CDATA #IMPLIED
                ALT        CDATA #REQUIRED
                WIDTH      CDATA #REQUIRED
                HEIGHT     CDATA #REQUIRED
>

<!ELEMENT COMPOSER (#PCDATA)>
<!ELEMENT PRODUCER (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)>
<!ATTLIST PUBLISHER xlink:type CDATA #IMPLIED
                    xlink:href CDATA #IMPLIED
>

<!ELEMENT LENGTH (#PCDATA)>
<!-- This should be a four digit year like "1999",
     not a two-digit year like "99" -->
<!ELEMENT YEAR (#PCDATA)>

<!ELEMENT ARTIST (#PCDATA)>

What is XML used for?

Domain-Specific Markup Languages
Self-Describing Data
Interchange of Data Among Applications

Domain-Specific Markup Languages

Non proprietary format
Don't pay for what you don't use
You don't have to answer the questions you don;t care about.
Many free tools
Huge support infrastructure

Self-Describing Data

Much data is lost due to format problems
XML is very simple
XML is self-describing
XML is well documented

An XML Fragment

<PERSON ID="p1100" SEX="M">
  <NAME>
    <GIVEN>Judson</GIVEN>
    <SURNAME>McDaniel</SURNAME>
  </NAME>
  <BIRTH>
    <DATE>21 Feb 1834</DATE>
  </BIRTH>
  <DEATH>
    <DATE>9 Dec 1905</DATE>
  </DEATH>
</PERSON>

Interchange of Data Among Applications

E-commerce
Syndication
EAI and EDI

Example XML Applications

Web Pages
Mathematical Equations
Music Notation
Vector Graphics
Metadata
and more...

Mathematical Markup Language

<?xml version="1.0"?>
<html xmlns="http://www.w3.org/TR/REC-html40"
      xmlns:m="http://www.w3.org/TR/REC-MathML/"
>
<head>
<title>Fiat Lux</title>
<meta name="GENERATOR" content="amaya V1.3b" />
</head>
<body>

<P>
And God said,
</P>

<math>
  <m:mrow>
    <m:msub>
      <m:mi>&delta;</m:mi>
      <m:mi>&alpha;</m:mi>
    </m:msub>
    <m:msup>
      <m:mi>F</m:mi>
      <m:mi>&alpha;&beta;</m:mi>
    </m:msup>
    <m:mi></m:mi>
    <m:mo>=</m:mo>
    <m:mi></m:mi>
    <m:mfrac>
      <m:mrow>
        <m:mn>4</m:mn>
        <m:mi>&pi;</m:mi>
      </m:mrow>
      <m:mi>c</m:mi>
    </m:mfrac>
    <m:mi></m:mi>
    <m:msup>
      <m:mi>J</m:mi>
      <m:mrow>
        <m:mi>&beta;</m:mi>
        <m:mo></m:mo>
      </m:mrow>
    </m:msup>
  </m:mrow>
</math>

<P>
and there was light
</P>
</body>
</html>

Channel Definition Format

<?xml version="1.0"?>
<CHANNEL HREF="http://www.ibiblio.org/xml/index.html">
  <TITLE>Cafe con Leche</TITLE>
  <ITEM HREF="http://www.ibiblio.org/xml/books.html">
    <TITLE>Books about XML</TITLE>
  </ITEM>
  <ITEM HREF="http://www.ibiblio.org/xml/tradeshows.html">
    <TITLE>Trade shows and conferences about XML</TITLE>
  </ITEM>
  <ITEM HREF="http://www.ibiblio.org/xml/lists.htm">
    <TITLE>Mailing Lists dedicated to XML</TITLE>
  </ITEM>
</CHANNEL>

Classic Literature

Vector Graphics

Vector Markup Language (VML)

Internet Explorer 5.0
Microsoft Office 2000

Scalable Vector Graphics (SVG)

A VML document

The Resource Description Framework (RDF)

Meta-data
Dublin Core
Better Web searching

An Example of RDF

<rdf:RDF 
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/DC/>
  <rdf:Description about="http://www.ibiblio.org/xml/>
    <dc:CREATOR>Elliotte Rusty Harold</dc:CREATOR>
    <dc:TITLE>Cafe con Leche</dc:TITLE>
  </rdf:Description>
</rdf:RDF>

File Formats, in-house applications, and other behind the scenes uses

Microsoft Office 2000
Netscape What's Related

XML for XML

XSL: The Extensible Stylesheet Language
XML Schema
XLinks: The Extensible Linking Language
XInclude

XSL: The Extensible Stylesheet Language

XSLT: XSL Transformations
XSL-FO XSL Formatting Objects

An XML document

<?xml version="1.0"?>
<?xml-stylesheet type="text/xml" href="17-2.xsl"?>
<PERIODIC_TABLE>

  <ATOM STATE="GAS">
    <NAME>Hydrogen</NAME>
    <SYMBOL>H</SYMBOL>
    <ATOMIC_NUMBER>1</ATOMIC_NUMBER>
    <ATOMIC_WEIGHT>1.00794</ATOMIC_WEIGHT>
    <BOILING_POINT UNITS="Kelvin">20.28</BOILING_POINT>
    <MELTING_POINT UNITS="Kelvin">13.81</MELTING_POINT>
    <DENSITY UNITS="grams/cubic centimeter">
      <!-- At 300K, 1 atm -->
      0.0000899
    </DENSITY>
  </ATOM>

  <ATOM STATE="GAS">
    <NAME>Helium</NAME>
    <SYMBOL>He</SYMBOL>
    <ATOMIC_NUMBER>2</ATOMIC_NUMBER>
    <ATOMIC_WEIGHT>4.0026</ATOMIC_WEIGHT>
    <BOILING_POINT UNITS="Kelvin">4.216</BOILING_POINT>
    <MELTING_POINT UNITS="Kelvin">0.95</MELTING_POINT>
    <DENSITY UNITS="grams/cubic centimeter"><!-- At 300K -->
      0.0001785
    </DENSITY>
  </ATOM>

</PERIODIC_TABLE>

An XSLT style sheet that converts to XSL-FO

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:fo="http://www.w3.org/1999/XSL/Format">

  <xsl:output indent="yes"/>

  <xsl:template match="/">
    <fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">

      <fo:layout-master-set>
        <fo:simple-page-master master-name="only">
          <fo:region-body/>
        </fo:simple-page-master>
      </fo:layout-master-set>

      <fo:page-sequence master-name="only">

        <fo:flow flow-name="xsl-region-body">
          <xsl:apply-templates select="//ATOM"/>
        </fo:flow>

      </fo:page-sequence>

    </fo:root>
  </xsl:template>

  <xsl:template match="ATOM">
    <fo:block font-size="20pt" font-family="serif"
              line-height="30pt">
      <xsl:value-of select="NAME"/>
    </fo:block>
  </xsl:template>

</xsl:stylesheet>

The XSL-FO Output

<?xml version="1.0"?>
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">

  <fo:layout-master-set>
    <fo:simple-page-master master-name="only">
      <fo:region-body/>
    </fo:simple-page-master>
  </fo:layout-master-set>

  <fo:page-sequence master-name="only">

    <fo:flow flow-name="xsl-region-body">
      <fo:block font-size="20pt" font-family="serif"
                line-height="30pt">
        Hydrogen
      </fo:block>
      <fo:block font-size="20pt" font-family="serif"
                line-height="30pt" >
        Helium
      </fo:block>
    </fo:flow>

  </fo:page-sequence>

</fo:root>

The PDF Result

W3C XML Schemas

An XML syntax to replace DTDs
Data typing of element and attribute content

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/1999/XMLSchema">
 
  <xsd:element name="SONG" type="SongType"/>

  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE" type="xsd:string" 
                   minOccurs="1" maxOccurs="1"/>
      <xsd:element name="COMPOSER"  type="xsd:string" 
                   minOccurs="1" maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="xsd:string" 
                   minOccurs="0" maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="xsd:string" 
                   minOccurs="0" maxOccurs="1"/>
    
      <xsd:element name="LENGTH" type="xsd:timeDuration" 
                   minOccurs="0" maxOccurs="1"/>
      <xsd:element name="YEAR"   type="xsd:year" 
                   minOccurs="1" maxOccurs="1"/>
  
      <xsd:element name="ARTIST" type="xsd:string" 
                   minOccurs="1" maxOccurs="unbounded"/>
    </xsd:sequence>
  </xsd:complexType>

</xsd:schema>

Not finished!

XML Hypertext

Linking in XML is divided into multiple parts:

A Uniform Resource Identifier (URI) names or locates a resource
An XLink defines connections between two or more documents identified by URIs
XPath identifies particular nodes within a document
An XPointer adds an XPath to a URI
XBase defines the URI against which relative URIs are resolved
XInclude embeds a document identified by a URI inside an XML document.

XML Hypertext Example

<?xml version="1.0"?>
<story date="January 9, 2001"
       xmlns:xlink="http://www.w3.org/1999/xlink"
       xmlns:xinclude="http://www.w3.org/1999/XML/xinclude"
       xml:base="http://www.cafeaulait.org/">

  <p>
    The W3C XML Linking Working Group has pushed the 
    <link xlink:href="http://www.w3.org/TR/2001/WD-xptr-20010108">
      XPointer specification
    </link> 
    back to working draft status. The specific issue that was 
    uncovered during Candidate Recommendation was some 
    <link xlink:type="simple"
      xlink:href="http://www.w3.org/TR/xptr#xpointer(//div[@class='div3'][7])">
      confusion
    </link> 
    over how to integrate XPointers, particularly those in non-XML documents, 
    with namespaces. 
   </p>

   <p>
     It's also come to light in this draft that Sun has 
     <link xlink:type="simple"
      xlink:href=
      "http://lists.w3.org/Archives/Public/www-xml-linking-comments/2000OctDec/0092.html"
      >
      claimed a patent</link> on some of the technologies needed to 
      implement XPointer. I think this is particularly offensive because Eve 
      L. Maler, a Sun employee, served as co-chair of the XML Linking 
      Working Group and a co-editor of the XPointer specification. As usual 
      Sun wants to use this as a club to lock implementers and users into a 
      licensing agreement that goes beyond what Sun and the W3C could 
      otherwise demand. The specific patent is <cite>United States Patent 
      No. 5,659,729, Method and system for implementing hypertext scroll 
      attributes</cite>, issued to Jakob Nielsen in 1997. The patent was 
      filed on February 1, 1996. It claims:
  </p>
  <blockquote>
    <xinclude:include 
      href=
      "http://www.delphion.com/details?&pn=US05659729__#xpointer(//abstract)"
      >
    </xinclude:include>
  </blockquote>
  
</story>

XLinks: The Extensible Linking Language

Any element can be a link
Links can be bi-directional
Links can even be multi-directional
Links can be separated from the documents they connect

<footnote xlink:type="simple" xlink:href="footnote7.xml">7</footnote>

Extended Links

Simple links are very similar to HTML links, one-directional, one-element-to-one-document links
Extended links are multi-directional, many-to-many links
An extended link is a list of nodes and a list of the connections between them

Extended Link Example

<?xml version="1.0"?>
<WEBSITE xmlns:xlink="http://www.w3.org/1999/xlink" 
         xlink:type="extended" xlink:title="Cafe au Lait">
         
  <NAME xlink:type="resource" xlink:label="source">
    Cafe au Lait
  </NAME>

  <HOMESITE xlink:type="locator" 
            xlink:href="http://ibiblio.org/javafaq/"
            xlink:label="us"/>
  
  <MIRROR xlink:type="locator" 
          xlink:title="Cafe au Lait Swedish Mirror"
          xlink:label="se"
          xlink:href="http://sunsite.kth.se/javafaq"/>
  
  <MIRROR xlink:type="locator" 
          xlink:title="Cafe au Lait German Mirror"
          xlink:label="de"
          xlink:href="http://sunsite.informatik.rwth-aachen.de/javafaq/"/>
  
  <MIRROR xlink:type="locator" 
          xlink:title="Cafe au Lait Swiss Mirror"
          xlink:label="ch"
          xlink:href="http://sunsite.cnlab-switch.ch/javafaq/"/>
  
  <CONNECTION xlink:type="arc" xlink:from="source" 
              xlink:to="ch"    xlink:show="replace" 
              xlink:actuate="onRequest"/>
  <CONNECTION xlink:type="arc" xlink:from="source" 
              xlink:to="us"    xlink:show="replace" 
              xlink:actuate="onRequest"/>
  <CONNECTION xlink:type="arc" xlink:from="source" 
              xlink:to="se"    xlink:show="replace" 
              xlink:actuate="onRequest"/>
  <CONNECTION xlink:type="arc" xlink:from="source" 
              xlink:to="sk"    xlink:show="replace" 
              xlink:actuate="onRequest"/>
  
</WEBSITE>

Diagram of an Extended Link

XInclude

A means of merging multiple XML document or parts thereof
Not yet finished

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE book SYSTEM "book.dtd" >
<book xmlns:xlink="http://www.w3.org/1999/xlink"
 xmlns:xinclude="http://www.w3.org/1999/XML/xinclude">
  <title>The Java Developer's Resource</title>
    <last_modified>December 3, 2000</last_modified>
    
<xinclude:include href="getting_started.xml"/>
<xinclude:include href="procedural_java.xml"/>
</book>

Non-XML for XML

XPath, the XML Path Language
XPointers
And MathML and XSL-FO are intended as an output format only. Other languages will be written and then transformed into these formats.
DTDs are only technically XML

XPath

A syntax for addressing into an XML document
Used in XPointer and XSLT
Basis for XML Query Language

descendant::language[position()=2]
/child::spec/child::body/child::*/child::language[2]
/spec/body/*/language[2]

XPointers

A syntax for addressing into an XML document
Extend XPath to support non-well-formed points and ranges
Used by XLink and XInclude

xpointer(id("ebnf"))
xpointer(descendant::language[position()=2])
ebnf
xpointer(/child::spec/child::body/child::*/child::language[2])
xpointer(/spec/body/*/language[2])
/1/14/2
xpointer(id("ebnf"))xpointer(id("EBNF"))

XPointers and URIs

XPointers are normally attached to the end of URIs as fragment identifiers
This is how they're used by XLink and XInclude

http://www.w3.org/TR/1998/REC-xml-19980210.xml#xpointer(id("ebnf")) http://www.w3.org/TR/1998/REC-xml-19980210.xml#xpointer(descendant::language[position()=2]) http://www.w3.org/TR/1998/REC-xml-19980210.xml#ebnf http://www.w3.org/TR/1998/REC-xml-19980210.xml#xpointer(/child::spec/child::body/child::*/child::language[2]) http://www.w3.org/TR/1998/REC-xml-19980210.xml#xpointer(/spec/body/*/language[2]) http://www.w3.org/TR/1998/REC-xml-19980210.xml#/1/14/2 http://www.w3.org/TR/1998/REC-xml-19980210.xml#xpointer(id("ebnf"))xpointer(id("EBNF"))

Programming with XML

Java works best
C, Perl, Python etc. can also be used
Unicode support is the biggest issue

Several APIs to choose from

SAX
DOM
JDOM
Parser specific APIs

SAX

Public domain, developed on xml-dev mailing list
Maintained by David Megginson
org.xml.sax package
Parser independent; programs can plug in different parsers
http://www.megginson.com/SAX/
Event based; the parser pushes data to your handler
Read-only
SAX omits DTD declarations

SAX2

Adds:
- Namespace support
- Optional Validation
- Optional Lexical events for comments, CDATA sections, entity references
A lot more configurable
Deprecates a lot of SAX1
Adapter classes convert between SAX2 and SAX1 parsers.

The SAX Process

Build a parser-specific implementation of the XMLReader interface using XMLReaderFactory
Your code registers a ContentHandler with the parser
An InputSource feeds the document into the parser
As the document is read, the parser calls back to the methods of the methods of the ContentHandler to tell it what it's seeing in the document.

Parsing a Document with XMLReader

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;


public class SAX2Checker {

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java SAX2Checker URL1 URL2..."); 
    } 
    
    // set up the parser 
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    } 
    catch (SAXException e) {
      try {
        parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException e2) {
        System.err.println("Error: could not locate a parser.");
        return;
      }
    }
     
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
        // If there are no well-formedness errors
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber() 
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not check " + args[i] 
         + " because of the IOException " + e);
      }
      
    }  
  
  }

}

The ContentHandler interface

package org.xml.sax;


public interface ContentHandler {

    public void setDocumentLocator(Locator locator);
    
    public void startDocument() throws SAXException;
    
    public void endDocument()	throws SAXException;
    
    public void startPrefixMapping(String prefix, String uri) 
     throws SAXException;

    public void endPrefixMapping(String prefix) throws SAXException;

    public void startElement(String namespaceURI, String localName,
		 String rawName, Attributes atts) throws SAXException;

    public void endElement(String namespaceURI, String localName,
     String rawName) throws SAXException;

    public void characters(char[] ch, int start, int length) 
     throws SAXException;

    public void ignorableWhitespace(char ch[], int start, int length)
     throws SAXException;

    public void processingInstruction(String target, String data)
     throws SAXException;

    public void skippedEntity(String name) throws SAXException;
     
}

SAX Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.IOException;
import java.util.StringTokenizer;


public class SAXWordCount implements ContentHandler {

  private int numWords;
    
  public void startDocument() throws SAXException {
    this.numWords = 0; 
  }

  public void endDocument() throws SAXException {
    System.out.println(numWords + " words");
    System.out.flush();
  }
  
  private StringBuffer sb = new StringBuffer();
  
  public void characters(char[] text, int start, int length) 
   throws SAXException {
    
    sb.append(text, start, length);
    
  }
  
  private void flush() {
    numWords += countWords(sb.toString());
    sb = new StringBuffer();    
  }
  
  // methods that signify a word break
  public void startElement(String namespaceURI, String localName,
	 String rawName, Attributes atts) throws SAXException {
    this.flush(); 
  }
  
  public void endElement(String namespaceURI, String localName,
	 String rawName) throws SAXException {
    this.flush(); 
  }
  
  public void processingInstruction(String target, String data)
   throws SAXException {
    this.flush(); 
  }

  // methods that aren't necessary in this example
  public void startPrefixMapping(String prefix, String uri) 
   throws SAXException {
    // ignore; 
  }

  public void ignorableWhitespace(char[] text, int start, int length)
   throws SAXException {
    // ignore; 
  }
  
  public void endPrefixMapping(String prefix) throws SAXException {
    // ignore; 
  }

  public void skippedEntity(String name) throws SAXException {
    // ignore; 
  }   
  
  public void setDocumentLocator(Locator locator) {}

  private static int countWords(String s) {
    
    if (s == null) return 0;
    s = s.trim();
    if (s.length() == 0) return 0;
    
    StringTokenizer st = new StringTokenizer(s);
    return st.countTokens();
    
  } 

  public static void main(String[] args) {
     
    SAXParser parser = new SAXParser();
    SAXWordCount counter = new SAXWordCount();
    parser.setContentHandler(counter);
    
    for (int i = 0; i < args.length; i++) {
      try {
        parser.parse(args[i]); 
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}

% java SAXWordCount hotcop.xml
16 words

Event Based API Caveats

You do not always have all the information you need at the time of a given callback
You may need to store information in various data structures (stacks, queues,vectors, arrays, etc.) and act on it at a later point
For example, the characters() method is not guaranteed to give you the maximum number of contiguous characters. It may split a single run of characters over multiple method calls.

Document Object Model

Defines how XML and HTML documents are represented as objects in programs
W3C Standard
Defined in IDL; thus language independent
HTML as well as XML
Writing as well as reading
More complete than SAX or JDOM; covers everything except internal and external DTD subsets
DOM focuses more on the document; SAX focuses more on the parser.

The Design of the DOM API

Parser independent interfaces; parser dependent implementation classes. Most programs must use the parser dependent classes. JAXP helps solve this, but so far only for DOM Level 1.
Everything's a Node:
- Extensive use of polymorphism
- Lots of casting
Language independence means there's very limited use of the Java class library; Various features are reinvented
Language independence requires no method overloading because not all languages support it.
Several features are poor design in Java, if not in other languages:
- Named constants are often shorts
- Only one kind of exception; details provided by constants
- No Java-specific utility methods like equals(), hashCode(), clone(), or toString()

DOM Evolution

DOM Level 0
DOM Level 1, a W3C Standard
DOM Level 2, a W3C Standard
DOM Level 3, several W3C Working Drafts

Eight Modules:

Eight Modules:
- Core: org.w3c.dom ^*
- HTML: org.w3c.dom.html
- Views: org.w3c.dom.views
- StyleSheets: org.w3c.dom.stylesheets
- CSS: org.w3c.dom.css
- Events: org.w3c.dom.events ^*
- Traversal: org.w3c.dom.traversal ^*
- Range: org.w3c.dom.range
Only the core and traversal modules really apply to XML. The other six are for HTML.
^* indicates Xerces support

DOM Trees

Each XML document should is a tree.
A tree contains nodes.
Some nodes may contain other nodes (depending on node type).
Each document node contains:
- zero or one doctype nodes
- one root element node
- zero or more comment and processing instruction nodes

org.w3c.dom

17 interfaces:
- Attr
- CDATASection
- CharacterData
- Comment
- Document
- DocumentFragment
- DocumentType
- DOMImplementation
- Element
- Entity
- EntityReference
- NamedNodeMap
- Node
- NodeList
- Notation
- ProcessingInstruction
- Text
plus one exception: DOMException
Plus a bunch of HTML stuff in org.w3c.dom.html and other packages

The DOM Process

Library specific code creates a parser
The parser parses the document and returns an org.w3c.dom.Document object.
The entire document is stored in memory.
DOM methods and interfaces are used to extract data from this object

Parsing documents with a DOM Parser Example

import org.apache.xerces.parsers.DOMParser;
import org.xml.sax.SAXException;
import java.io.IOException;
import org.w3c.dom.*;


public class DOMChecker {

  public static void main(String[] args) {
     
    // This is simpler but less flexible than the SAX approach.
    // Perhaps a good creational design pattern is needed here?   
  
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        // work with the document...
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  }

}

DOM Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.IOException;
import java.util.StringTokenizer;


public class DOMWordCount {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    DOMWordCount counter = new DOMWordCount();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        int numWords = countWordsInNode(d);
        System.out.println(numWords + " words");

      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  // note use of recursion
  public static int countWordsInNode(Node node) {
    
    int numWords = 0;
    
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        numWords += countWordsInNode(children.item(i));
      } 
    }  

    int type = node.getNodeType();
    if (type == Node.TEXT_NODE) {
      String s = node.getNodeValue();
      numWords += countWordsInString(s);
    }
    
    return numWords;  
    
  }
  
  private static int countWordsInString(String s) {
    
    if (s == null) return 0;
    s = s.trim();
    if (s.length() == 0) return 0;
    
    StringTokenizer st = new StringTokenizer(s);
    return st.countTokens();
    
  } 

}

% java DOMWordCount hotcop.xml
16 words

JDOM

More Java like tree-based API
Parser independent classes sit on top of parsers and other APIs

The JDOM Process

Construct an org.jdom.input.SAXBuilder or an org.jdom.input.DOMBuilder
Invoke the builder's build() method to build a Document object from a
- Reader
- InputStream
- URL
- File
- SYSTEM ID String
If there's a problem building the document, a JDOMException is thrown
Work with the resulting Document object

Parsing a Document with JDOM

import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;


public class JDOMChecker {

  public static void main(String[] args) {
  
    if (args.length == 0) {
      System.out.println("Usage: java JDOMChecker URL1 URL2..."); 
    } 
      
    SAXBuilder builder = new SAXBuilder();
     
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        builder.build(args[i]);
        // If there are no well-formedness errors, 
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
      catch (JDOMException e) { // indicates a well-formedness or other error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage());
      }
      
    }   
  
  }

}

Parser Results

% java JDOMChecker shortlogs.xml HelloJDOM.java
shortlogs.xml is well formed.
HelloJDOM.java is not well formed.
The markup in the document preceding the root element must be well-formed.: 
Error on line 1 of XML document: The markup in the document preceding the 
root element must be well-formed.

JDOM Example

import org.jdom.*;
import org.jdom.input.SAXBuilder;
import java.util.*;


public class JDOMWordCount {

  public static void main(String[] args) {
  
    if (args.length == 0) {
      System.out.println("Usage: java JDOMWordCount URL1 URL2..."); 
    } 
      
    SAXBuilder builder = new SAXBuilder();
     
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        Document doc = builder.build(args[i]);
        Element root = doc.getRootElement();
        int numWords = countWordsInElement(root);
        System.out.println(numWords + " words");

      }
      catch (JDOMException e) { // indicates a well-formedness or other error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage());
      }
      
    }   
  
  }

  public static int countWordsInElement(Element element) {
    
    int numWords = 0;
    
    List children = element.getMixedContent();
    Iterator iterator = children.iterator();
    while (iterator.hasNext()) {
      Object o = iterator.next();
      if (o instanceof String) {
        numWords += countWordsInString((String) o);
      } 
      else if (o instanceof Element) {
        // note use of recursion
        numWords += countWordsInElement((Element) o); 
      } 
    }
    
    return numWords;  
    
  }

  private static int countWordsInString(String s) {
    
    if (s == null) return 0;
    s = s.trim();
    if (s.length() == 0) return 0;
    
    StringTokenizer st = new StringTokenizer(s);
    return st.countTokens();
    
  }

}

% java JDOMWordCount hotcop.xml
16 words

XML and Databases

XML is not a database, or a replacement for one.
Out of the box, XML doesn't know anything SQL; SQL doesn't know anything aobut XML.
But XML works very well with databases
You generally have to provide the integration code yourself, but this is not hard.
There are some third-party products and database add-ons that will do a lot of the work for you.

Integrating XML with Databases

Two approaches:

Store the document as a field
Store pieces of the document as fields

Middleware

There are a lot of products that will pull data out of a database and format it as XML:
- Stonebroom's ASP2XML
- Transparency's Beanstalk
- IBM's DatabaseDom
- infoteria's iConnector
- IBM's DB2 XML Extender
- FileMaker Web Companion
- Oracle XML SQL Utility for Java
- etc.
Or roll your own:
- Java + JDBC
- PHP
- ASP
- Visual Basic
- etc.

Database Exchange and Integration

XML is a very convenient format to move data between systems
It's all text; very platform independent.
Field and record boundaries are very clear
Schemas and DTDs can prevent you from accepting bad data

To Learn More

This presentation: http://www.ibiblio.org/xml/slides/sd2001west/fundamentals/
XML in a Nutshell
- Elliotte Rusty Harold and W. Scott Means
- O'Reilly & Associates, 2001
- ISBN 0-596-00058-8
- XPath: http://www.oreilly.com/catalog/xmlnut/chapter/ch09.html
XML Bible, second edition
- Elliotte Rusty Harold
- Hungry Minds, 2001
- ISBN 0-7645-4760-7
- XLinks: http://www.ibiblio.org/xml/books/bible/updates/16.html
- XPointers: http://www.ibiblio.org/xml/books/bible/updates/17.html

Index | Cafe con Leche