XML Fundamentals


XML Fundamentals

Elliotte Rusty Harold

OOP 2003

January 23, 2003

elharo@metalab.unc.edu

http://www.cafeconleche.org/


Outline


Part I: XML Overview

XML succeeded, and in ways that weren't expected - at least not by many. Originally it was conceived as a document-oriented technology for robust quality publishing of documents over networks. the original workplan had three pillars - XML syntax, XML link, and XML stylesheets. Schemas were not high on the agenda and XML was not seen as an infrastructure for middleware or glueware. It was expected that at some stage it would be necessary to manage data but there was little activity in this area in 1997. When developing Chemical Markup Language (which must be one of the first published XML applications), I found the lack of datatypes very frustrating!

Well, XML is now a basic infrastructure of much modern information. I doubt that anyone now designs a protocol, or operating system without including XML. Although this list sometimes complains that XML isn't as clean as we would like, it works, and it works pretty well.

--Peter Murray-Rust on the xml-dev mailing list, Thursday, February 7, 2002


What is XML?


XML is a Meta Markup Language


XML describes structure and semantics, not formatting


A Song Description in HTML

<dt>Hot Cop
<dd> by Jacques Morali, Henri Belolo, and Victor Willis
<ul>
<li>Jacques Morali
<li>PolyGram Records
<li>6:20
<li>1978
<li>Village People
</ul>
View Document in Browser

A Song Description in XML

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<SONG>
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
View Document in Browser

The XML Declaration


Elements


Cascading Style Sheets


CSS Stylesheet for Songs

SONG     {display: block}
TITLE    {display: block; font-family: Helvetica, sans-serif;
          font-size: 20pt; font-weight: bold}
COMPOSER {display: block; 
          font-family: Times, Times New Roman, serif;
          font-size: 14pt;
          font-style: italic}
ARTIST   {display: block; 
          font-family: Times, Times New Roman, serif;
          font-size: 14pt; font-weight: bold;
          font-style: italic}
PUBLISHER {display: block; 
           font-family: Times, Times New Roman, serif;
           font-size: 14pt}
LENGTH    {display: block; 
           font-family: Times, Times New Roman, serif;
           font-size: 14pt}
YEAR      {display: block; 
           font-family: Times, Times New Roman, serif;
           font-size: 14pt}

Attaching style sheets to documents

<?xml version="1.0"?>
<?xml-stylesheet type="text/css" href="song1.css"?>
<SONG>
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

View Document in Browser

song.xsl

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:template match="SONG">
    <html>
      <body>
       <h1>
        <xsl:value-of select="TITLE"/> 
        by the 
        <xsl:value-of select="ARTIST"/> 
       </h1>
       <ul>
         <xsl:apply-templates select="COMPOSER"/>
         <li>Publisher: <xsl:value-of select="PUBLISHER"/></li>
         <li>Year: <xsl:value-of select="YEAR"/></li>
         <li>Producer: <xsl:value-of select="PRODUCER"/></li>
       </ul>
      </body>
    </html>
  </xsl:template>

  <xsl:template match="COMPOSER">
    <li>Composer: <xsl:value-of select="."/></li>
  </xsl:template>

</xsl:stylesheet>

Applying an XSLT Style Sheet


Output

<html>
   <body>
      <h1>Hot Cop 
         by the 
         Village People
      </h1>
      <ul>
         <li>Composer: Jacques Morali</li>
         <li>Composer: Henri Belolo</li>
         <li>Composer: Victor Willis</li>
         <li>Publisher: PolyGram Records</li>
         <li>Year: 1978</li>
         <li>Producer: Jacques Morali</li>
      </ul>
   </body>
</html>
View in browser

Editing and Saving XML Documents


Well-formedness

Rules:


Validity

To be valid an XML document must be

  1. Well-formed

  2. Must have a Document Type Definition (DTD)

  3. Must comply with the constraints specified in the DTD


A DTD for Songs

<!ELEMENT SONG (TITLE, COMPOSER+, PRODUCER*, PUBLISHER*, 
                 LENGTH?, YEAR?, ARTIST+)>

<!ELEMENT TITLE (#PCDATA)>

<!ELEMENT COMPOSER (#PCDATA)>
<!ELEMENT PRODUCER (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)>
<!ELEMENT LENGTH (#PCDATA)>
<!-- This should be a four digit year like "1999",
     not a two-digit year like "99" -->
<!ELEMENT YEAR (#PCDATA)>

<!ELEMENT ARTIST (#PCDATA)>

A Valid Song Document

<?xml version="1.0"?>
<!DOCTYPE SONG SYSTEM "song.dtd">
<SONG>
  <TITLE>Hot Cop</TITLE>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <PUBLISHER>PolyGram Records</PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>

Checking Validity

To check validity you pass the document through a validating parser which should report any errors it finds. For example,

% java dom.Counter -v invalidhotcop.xml
[Error] invalidhotcop.xml:10:8: The content of element type "SONG" must match 
"(TITLE,COMPOSER+,PRODUCER*,PUBLISHER*,LENGTH?,YEAR?,ARTIST+)".
invalidhotcop.xml: 862;70;0 ms (7 elems, 0 attrs, 19 spaces, 59 chars)

A valid document:

% java dom.Counter -v validhotcop.xml
validhotcop.xml: 671;70;0 ms (10 elems, 0 attrs, 28 spaces, 98 chars)

A More Complex Example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/css" href="song.css"?>
<!DOCTYPE SONG SYSTEM "expanded_song.dtd">
<SONG xmlns="http://www.cafeconleche.org/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <COMPOSER>Henri Belolo</COMPOSER>
  <COMPOSER>Victor Willis</COMPOSER>
  <PRODUCER>Jacques Morali</PRODUCER>
  <!-- The publisher is actually Polygram but I needed 
       an example of a general entity reference. -->
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <LENGTH>6:20</LENGTH>
  <YEAR>1978</YEAR>
  <ARTIST>Village People</ARTIST>
</SONG>
<!-- You can tell what album I was 
     listening to when I wrote this example -->

Attributes

<PHOTO xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg" ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200" />


Empty-element Tags

<PHOTO xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg" ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200" />


Comments

<!-- You can tell what album I was listening to when I wrote this example -->


Namespaces

<SONG xmlns="http://www.cafeconleche.org/namespace/song"
      xmlns:xlink="http://www.w3.org/1999/xlink">
  <TITLE>Hot Cop</TITLE>
  <PHOTO 
    xlink:type="simple" xlink:show="onLoad" xlink:href="hotcop.jpg"
    ALT="Victor Willis in Cop Outfit" WIDTH="100" HEIGHT="200"/>
  <COMPOSER>Jacques Morali</COMPOSER>
  <PUBLISHER xlink:type="simple" xlink:href="http://www.amrecords.com/">
    A &amp; M Records
  </PUBLISHER>
  <ARTIST>Village People</ARTIST>
</SONG>

Entity References

A & M Records


A More Complex DTD

<!ELEMENT SONG (TITLE, PHOTO?, COMPOSER+, PRODUCER*, 
 PUBLISHER*, LENGTH?, YEAR?, ARTIST+)>
<!ATTLIST SONG xmlns       CDATA #REQUIRED
               xmlns:xlink CDATA #REQUIRED>
<!ELEMENT TITLE (#PCDATA)>

<!ELEMENT PHOTO EMPTY>
<!ATTLIST PHOTO xlink:type CDATA #FIXED "simple"
                xlink:href CDATA #REQUIRED
                xlink:show CDATA #IMPLIED
                ALT        CDATA #REQUIRED
                WIDTH      CDATA #REQUIRED
                HEIGHT     CDATA #REQUIRED
>

<!ELEMENT COMPOSER  (#PCDATA)>
<!ELEMENT PRODUCER  (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)>
<!ATTLIST PUBLISHER xlink:type CDATA #IMPLIED
                    xlink:href CDATA #IMPLIED
>

<!ELEMENT LENGTH (#PCDATA)>
<!-- This should be a four digit year like "1999",
     not a two-digit year like "99" -->
<!ELEMENT YEAR (#PCDATA)>

<!ELEMENT ARTIST (#PCDATA)>

Part II: What is XML Good For?


What is XML used for?


Domain-Specific Markup Languages


Self-Describing Data


An XML Fragment

<PERSON ID="p1100" SEX="M">
  <NAME>
    <GIVEN>Judson</GIVEN>
    <SURNAME>McDaniel</SURNAME>
  </NAME>
  <BIRTH>
    <DATE>21 Feb 1834</DATE>
  </BIRTH>
  <DEATH>
    <DATE>9 Dec 1905</DATE>
  </DEATH>
</PERSON>

Interchange of Data Among Applications


Can assemble data from multiple sources


XML Applications


Example XML Applications


Mathematical Markup Language

<?xml version="1.0"?>
<html>
<head>
<title>Fiat Lux</title>
<meta name="GENERATOR" content="amaya V1.3b" />
</head>
<body>

<p>
And God said,
</p>

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mrow>
    <msub>
      <mi>&delta;</mi>
      <mi>&alpha;</mi>
    </msub>
    <msup>
      <mi>F</mi>
      <mi>&alpha;&beta;</mi>
    </msup>
    <mo>=</mo>
    <mfrac>
      <mrow>
        <mn>4</mn>
        <mi>&pi;</mi>
      </mrow>
      <mi>c</mi>
    </mfrac>
    <msup>
      <mi>J</mi>
      <mrow>
        <mi>&beta;</mi>
      </mrow>
    </msup>
  </mrow>
</math>

<p>
and there was light.
</p>
</body>
</html>

View in Browser

Channel Definition Format

<?xml version="1.0"?>
<CHANNEL HREF="http://www.cafeconleche.org/index.html">
  <TITLE>Cafe con Leche</TITLE>
  <ITEM HREF="http://www.cafeconleche.org/books.html">
    <TITLE>Books about XML</TITLE>
  </ITEM>
  <ITEM HREF="http://www.cafeconleche.org/tradeshows.html">
    <TITLE>Trade shows and conferences about XML</TITLE>
  </ITEM>
  <ITEM HREF="http://www.cafeconleche.org/lists.htm">
    <TITLE>Mailing Lists dedicated to XML</TITLE>
  </ITEM>
</CHANNEL>

Classic Literature


Vector Graphics

An SVG document

XML for XML


XSL: The Extensible Stylesheet Language


Schemas


W3C XML Schema Language Example

<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
 
  <xsd:element name="SONG" type="SongType"/>

  <xsd:complexType name="SongType">
    <xsd:sequence>
      <xsd:element name="TITLE" type="xsd:string" 
                   minOccurs="1" maxOccurs="1"/>
      <xsd:element name="COMPOSER"  type="xsd:string" 
                   minOccurs="1" maxOccurs="unbounded"/>
      <xsd:element name="PRODUCER"  type="xsd:string" 
                   minOccurs="0" maxOccurs="unbounded"/>
      <xsd:element name="PUBLISHER" type="xsd:string" 
                   minOccurs="0" maxOccurs="1"/>
    
      <xsd:element name="LENGTH" type="xsd:duration" 
                   minOccurs="0" maxOccurs="1"/>
      <xsd:element name="YEAR"   type="xsd:gYear" 
                   minOccurs="1" maxOccurs="1"/>
  
      <xsd:element name="ARTIST" type="xsd:string" 
                   minOccurs="1" maxOccurs="unbounded"/>
    </xsd:sequence>
  </xsd:complexType>

</xsd:schema>

XLinks

<footnote xlink:type="simple" xlink:href="footnote7.xml">7</footnote>

File Formats, in-house applications, and other behind the scenes uses


Part III: Processing XML


Several APIs to choose from


SAX


SAX2


The SAX Process


Parsing a Document with XMLReader

import org.xml.sax.*;
import org.xml.sax.helpers.*;
import java.io.*;


public class SAX2Checker {

  public static void main(String[] args) {
    
    if (args.length == 0) {
      System.out.println("Usage: java SAX2Checker URL1 URL2..."); 
    } 
    
    // set up the parser 
    XMLReader parser;
    try {
      parser = XMLReaderFactory.createXMLReader();
    } 
    catch (SAXException e) {
      try {
        parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
      }
      catch (SAXException e2) {
        System.err.println("Error: could not locate a parser.");
        return;
      }
    }
     
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        parser.parse(args[i]);
        // If there are no well-formedness errors
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
      catch (SAXParseException e) { // well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(e.getMessage()
         + " at line " + e.getLineNumber() 
         + ", column " + e.getColumnNumber());
      }
      catch (SAXException e) { // some other kind of error
        System.out.println(e.getMessage());
      }
      catch (IOException e) {
        System.out.println("Could not check " + args[i] 
         + " because of the IOException " + e);
      }
      
    }  
  
  }

}

The ContentHandler interface

package org.xml.sax;


public interface ContentHandler {

    public void setDocumentLocator(Locator locator);
    
    public void startDocument() throws SAXException;
    
    public void endDocument() throws SAXException;
    
    public void startPrefixMapping(String prefix, String uri) 
     throws SAXException;

    public void endPrefixMapping(String prefix) throws SAXException;

    public void startElement(String namespaceURI, String localName,
	 String qualifiedName, Attributes atts) throws SAXException;

    public void endElement(String namespaceURI, String localName,
     String rawName) throws SAXException;

    public void characters(char[] text, int start, int length) 
     throws SAXException;

    public void ignorableWhitespace(char[] text, int start, int length)
     throws SAXException;

    public void processingInstruction(String target, String data)
     throws SAXException;

    public void skippedEntity(String name) throws SAXException;
     
}

SAX Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.IOException;
import java.util.StringTokenizer;


public class SAXWordCount implements ContentHandler {

  private int numWords;
    
  public void startDocument() throws SAXException {
    this.numWords = 0; 
  }

  public void endDocument() throws SAXException {
    System.out.println(numWords + " words");
    System.out.flush();
  }
  
  private StringBuffer sb = new StringBuffer();
  
  public void characters(char[] text, int start, int length) 
   throws SAXException {
    
    sb.append(text, start, length);
    
  }
  
  private void flush() {
    numWords += countWords(sb.toString());
    sb = new StringBuffer();    
  }
  
  // methods that signify a word break
  public void startElement(String namespaceURI, String localName,
	 String rawName, Attributes atts) throws SAXException {
    this.flush(); 
  }
  
  public void endElement(String namespaceURI, String localName,
	 String rawName) throws SAXException {
    this.flush(); 
  }
  
  public void processingInstruction(String target, String data)
   throws SAXException {
    this.flush(); 
  }

  // methods that aren't necessary in this example
  public void startPrefixMapping(String prefix, String uri) 
   throws SAXException {
    // ignore; 
  }

  public void ignorableWhitespace(char[] text, int start, int length)
   throws SAXException {
    // ignore; 
  }
  
  public void endPrefixMapping(String prefix) throws SAXException {
    // ignore; 
  }

  public void skippedEntity(String name) throws SAXException {
    // ignore; 
  }   
  
  public void setDocumentLocator(Locator locator) {}

  private static int countWords(String s) {
    
    if (s == null) return 0;
    s = s.trim();
    if (s.length() == 0) return 0;
    
    StringTokenizer st = new StringTokenizer(s);
    return st.countTokens();
    
  } 

  public static void main(String[] args) {
     
    SAXParser parser = new SAXParser();
    SAXWordCount counter = new SAXWordCount();
    parser.setContentHandler(counter);
    
    for (int i = 0; i < args.length; i++) {
      try {
        parser.parse(args[i]); 
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

}
% java SAXWordCount hotcop.xml
16 words

Event Based API Caveats


Document Object Model


The Design of the DOM API


DOM Evolution


Eight Modules:


DOM Trees


org.w3c.dom


The DOM Process


Parsing documents with a DOM Parser Example

import org.apache.xerces.parsers.DOMParser;
import org.xml.sax.SAXException;
import java.io.IOException;
import org.w3c.dom.*;


public class DOMChecker {

  public static void main(String[] args) {
     
    // This is simpler but less flexible than the SAX approach.
    // Perhaps a good creational design pattern is needed here?   
  
    DOMParser parser = new DOMParser();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        // work with the document...
      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  }

}

DOM Example

import org.apache.xerces.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.*;
import java.io.IOException;
import java.util.StringTokenizer;


public class DOMWordCount {

  public static void main(String[] args) {
     
    DOMParser parser = new DOMParser();
    DOMWordCount counter = new DOMWordCount();
    
    for (int i = 0; i < args.length; i++) {
      try {
        // Read the entire document into memory
        parser.parse(args[i]); 
       
        Document d = parser.getDocument();
        int numWords = countWordsInNode(d);
        System.out.println(numWords + " words");

      }
      catch (SAXException e) {
        System.err.println(e); 
      }
      catch (IOException e) {
        System.err.println(e); 
      }
      
    }
  
  } // end main

  // note use of recursion
  public static int countWordsInNode(Node node) {
    
    int numWords = 0;
    
    if (node.hasChildNodes()) {
      NodeList children = node.getChildNodes();
      for (int i = 0; i < children.getLength(); i++) {
        numWords += countWordsInNode(children.item(i));
      } 
    }  

    int type = node.getNodeType();
    if (type == Node.TEXT_NODE) {
      String s = node.getNodeValue();
      numWords += countWordsInString(s);
    }
    
    return numWords;  
    
  }
  
  private static int countWordsInString(String s) {
    
    if (s == null) return 0;
    s = s.trim();
    if (s.length() == 0) return 0;
    
    StringTokenizer st = new StringTokenizer(s);
    return st.countTokens();
    
  } 

}
% java DOMWordCount hotcop.xml
16 words

JDOM


The JDOM Process


Parsing a Document with JDOM

import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;
import java.io.IOException;


public class JDOMChecker {

  public static void main(String[] args) {
  
    if (args.length == 0) {
      System.out.println("Usage: java JDOMChecker URL1 URL2..."); 
    } 
      
    SAXBuilder builder = new SAXBuilder();
     
    // start parsing... 
    for (int i = 0; i < args.length; i++) {
      
      // command line should offer URIs or file names
      try {
        builder.build(args[i]);
        // If there are no well-formedness errors, 
        // then no exception is thrown
        System.out.println(args[i] + " is well formed.");
      }
      catch (JDOMException ex) { // indicates a well-formedness error
        System.out.println(args[i] + " is not well formed.");
        System.out.println(ex.getMessage());
      }
      catch (IOException ex) { // indicates an I/O error
        System.out.println("Could not read " + args[i] + " due to an I/O error.");
        System.out.println(ex.getMessage());
      }
      
    }   
  
  }

}

Parser Results

% java JDOMChecker shortlogs.xml HelloJDOM.java
shortlogs.xml is well formed.
HelloJDOM.java is not well formed.
The markup in the document preceding the root element must be well-formed.: 
Error on line 1 of XML document: The markup in the document preceding the 
root element must be well-formed.

JDOM Example

import org.jdom.*;
import org.jdom.input.SAXBuilder;
import java.util.*;
import java.io.IOException;


public class JDOMWordCount {

  public static void main(String[] args) {
  
    if (args.length == 0) {
      System.out.println("Usage: java JDOMWordCount URL"); 
    } 
      
    SAXBuilder builder = new SAXBuilder();
     
    // command line should offer URIs or file names
    try {
      Document doc = builder.build(args[0]);
      Element root = doc.getRootElement();
      int numWords = countWordsInElement(root);
      System.out.println(numWords + " words");
    
    }
    catch (JDOMException ex) { // indicates a well-formedness error
      System.out.println(args[0] + " is not well formed.");
      System.out.println(ex.getMessage());
    }
    catch (IOException ex) { // indicates an I/O error
      System.out.println("Could not read " + args[0] + " due to an I/O error.");
      System.out.println(ex.getMessage());
    } 
 
  }

  public static int countWordsInElement(Element element) {
    
    int numWords = 0;
    
    List children = element.getContent();
    Iterator iterator = children.iterator();
    while (iterator.hasNext()) {
      Object o = iterator.next();
      if (o instanceof Text) {
        numWords += countWordsInString((Text) o);
      } 
      else if (o instanceof Element) {
        // note use of recursion
        numWords += countWordsInElement((Element) o); 
      } 
    }
    
    return numWords;  
    
  }

  private static int countWordsInString(Text text) {
    
    String s = text.getText();
    if (s == null) return 0;
    s = s.trim();
    if (s.length() == 0) return 0;
    
    StringTokenizer st = new StringTokenizer(s);
    return st.countTokens();
    
  }

}
% java JDOMWordCount hotcop.xml
16 words

XML and Databases


Integrating XML with Databases


Middleware


Database Exchange and Integration


To Learn More


Index | Cafe con Leche

Copyright 2002 Elliotte Rusty Harold
elharo@metalab.unc.edu
Last Modified December 4, 2002