XInclude

Elliotte Rusty Harold

XMLOne San Jose 2001

Wednesday, October 3, 2001

elharo@metalab.unc.edu

http://www.ibiblio.org/xml/

HTML Hypertext is Limited

The Web conquered gopher for one reason: HTML made it possible to embed hypertext links in documents.
HTML linking has limits

You can only link to one document at a time
You must link to the entire document.
Once the link is traversed the trail of where you've been is lost.

Includes are server dependent and don't work across domains
Links break

XML Hypertext

Hypertext in XML is divided into multiple parts:

A Uniform Resource Identifier (URI) names or locates a resource
An XLink defines connections between two or more documents identified by URIs
XPath identifies particular nodes within a document
An XPointer adds an XPath to a URI
XBase defines the URI against which relative URIs are resolved
XInclude embeds a document identified by a URI inside an XML document.

XML Hypertext Example

<?xml version="1.0"?>
<story date="January 9, 2001"
       xmlns:xlink="http://www.w3.org/1999/xlink"
       xmlns:xinclude="http://www.w3.org/1999/XML/xinclude"
       xml:base="http://www.cafeaulait.org/">

  <p>
    The W3C XML Linking Working Group has pushed the 
    <cite xlink:type="simple"
      xlink:href="http://www.w3.org/TR/2001/WD-xptr-20010108">
      XPointer specification
    </cite> 
    back to working draft status. The specific issue that was 
    uncovered during Candidate Recommendation was some 
    <span xlink:type="simple"
      xlink:href="http://www.w3.org/TR/xptr#xpointer(//div[@class='div3'][7])">
      confusion
    </span> 
    over how to integrate XPointers, particularly those in non-XML documents, 
    with namespaces. 
   </p>

   <p>
     It's also come to light in this draft that Sun has 
     <span xlink:type="simple"
      xlink:href=
      "http://lists.w3.org/Archives/Public/www-xml-linking-comments/2000OctDec/0092.html"
      >
      claimed a patent</span> on some of the technologies needed to 
      implement XPointer. I think this is particularly offensive because Eve 
      L. Maler, a Sun employee, served as co-chair of the XML Linking 
      Working Group and a co-editor of the XPointer specification. As usual 
      Sun wants to use this as a club to lock implementers and users into a 
      licensing agreement that goes beyond what Sun and the W3C could 
      otherwise demand. The specific patent is <cite>United States Patent 
      No. 5,659,729, Method and system for implementing hypertext scroll 
      attributes</cite>, issued to Jakob Nielsen in 1997. The patent was 
      filed on February 1, 1996. It claims:
  </p>
  <blockquote>
    <xinclude:include 
      href=
      "http://www.delphion.com/details?&pn=US05659729__#xpointer(//abstract)"
      >
    </xinclude:include>
  </blockquote>
  
</story>

Versions

This talk is based on:

XLinks: June 27, 2001 Recommendation
XPointers: September 11, 2001 2nd Candidate Recommendation
XPath: November 16, 1999 1.0 Specification
XML Base: June 27, 2001 Recommendation

Part I: XInclude

The problem is that we're not providing the tools. We're providing the specs. That's a whole different ball game. If tools existed for actually making really interesting use of RDF and XLink and XInclude then people would use them. If IE and/or Mozilla supported the full gamut of specs, from XSLT 1.0 to XLink and XInclude (OK, so they're not quite REC's, but with time...) then you would find people using them more.

--Matt Sergeant on the xml-dev mailing list

What is XInclude?

A means of including one XML document inside another, irrespective of validation.
W3C Last Call Working Draft, May 16, 2000
Based on the XML Infoset; a source infoset is transformed into a result infoset

Alternatives (and why they don't work)

xlink:show="embed" only graphically includes, like the IMG element in HTML. It does not merge infosets.
External parsed entities:
- Require a DTD
- Can only handle very limited documents; i.e. not all well-formed XML documents are well-formed external parsed entities. In particular XML declarations can be and document type declarations are a problem.
- Doesn't allow unparsed text inserted as CDATA
XSLT document() function
- Only handles XSLT
- No unparsed, pure-text includes
Server side includes:
- HTML only
- Server dependent
Custom code or XSLT extension functions

The include element

href attribute identifies the document (or part thereof) to be included
In the http://www.w3.org/2001/XInclude namespace.
The prefixes xinclude or xi are customary.

<book xmlns:xinclude="http://www.w3.org/2001/XInclude">
  <title>Processing XML with Java</title>
  <chapter><xinclude:include href="dom.xml"/></chapter>
  <chapter><xinclude:include href="sax.xml"/></chapter>
  <chapter><xinclude:include href="jdom.xml"/></chapter>
</book>

The parse attribute

parse="xml": The resource must be parsed as XML and the infosets merged. This is the default.
parse="text": The resource must be treated as pure text and inserted as a text node. When serialized, this means that characters like < will change to < and so forth.

<slide xmlns:xinclude="http://www.w3.org/2001/XInclude">
  <title>The href attribute</title>
  
<ul>
  <li>Identifies the document to be included with a URI</li>
  <li>The document at the URI replaces the <code>include</code> 
      element in the including document</li>
  <li>The <code>xinclude</code> prefix is bound to the http://www.w3.org/2001/XInclude
  namespace URI. 
  </li>
</ul>  

<pre><code><xinclude:include parse="text" href="processing_xml_with_java.xml"/>
</code></pre>
        
  <description>
      A slide from Elliotte Rusty Harold's XInclude seminar at
      <host_ref/>, <date_ref/>
    </description>
  <last_modified>October 26, 2000</last_modified>
</slide>

The encoding attribute

Used when parse="text"
Value is the name of the text file's character encoding, as in the encoding declaration in the XML declaration
e.g. ISO-8859-1, UTF-8, UTF-16, MacRoman, etc.

<slide xmlns:xinclude="http://www.w3.org/2001/XInclude">
  <title>The href attribute</title>
  
<ul>
  <li>Identifies the document to be included with a URI</li>
  <li>The document at the URI replaces the <code>include</code> 
      element in the including document</li>
  <li>The <code>xinclude</code> prefix is bound to the http://www.w3.org/2001/XInclude
  namespace URI. 
  </li>
</ul>  

<pre><code><xinclude:include parse="text" encoding="ISO-8859-1" 
                  href="processing_xml_with_java.xml"/>
</code></pre>
        
  <description>
      A slide from Elliotte Rusty Harold's XInclude seminar at
      <host_ref/>, <date_ref/>
    </description>
  <last_modified>October 26, 2000</last_modified>
</slide>

Implementation as a SAX filter

/*--

 Copyright 2001 Elliotte Rusty Harold.
 All rights reserved.

 I haven't yet decided on a license.
 It will be some form of open source.

 THIS SOFTWARE IS PROVIDED "AS IS" AND ANY EXPRESSED OR IMPLIED
 WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 DISCLAIMED.  IN NO EVENT SHALL ELLIOTTE RUSTY HAROLD OR ANY
 OTHER CONTRIBUTORS TO THIS PACKAGE
 BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 SUCH DAMAGE.

 */

package com.macfaq.xml;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.Locator;
import org.xml.sax.helpers.XMLReaderFactory;
import org.xml.sax.helpers.XMLFilterImpl;
import org.xml.sax.helpers.NamespaceSupport;

import java.net.URL;
import java.net.URLConnection;
import java.net.MalformedURLException;
import java.io.UnsupportedEncodingException;
import java.io.IOException;
import java.io.InputStream;
import java.io.BufferedInputStream;
import java.io.InputStreamReader;
import java.util.Stack;

/**
 * <p>
 *  This is a SAX filter which resolves all XInclude include elements
 *  before passing them on to the client application. Currently this
 *  class has the following known deviation from the XInclude specification:
 * </p>
 *  <ol>
 *   <li>XPointer is not supported.</li>
 *  </ol>
 *
 *  <p>
 *    Furthermore, I would definitely use a new instance of this class
 *    for each document you want to process. I doubt it can be used
 *    successfully on multiple documents. Furthermore, I can virtually
 *    guarantee that this class is not thread safe. You have been
 *    warned.
 *  </p>
 *
 *  <p>
 *    Since this class is not designed to be subclassed, and since
 *    I have not yet considered how that might affect the methods 
 *    herein or what other protected methods might be needed to support 
 *    subclasses, I have declared this class final. I may remove this 
 *    restriction later, though the use-case for subclassing is weak.
 *    This class is designed to have its functionality extended via a
 *    a horizontal chain of filters, not a 
 *    vertical hierarchy of sub and superclasses.
 *  </p>
 *
 *  <p>
 *    To use this class: 
 *  </p>
 *  <ol>
 *   <li>Construct an <code>XIncludeFilter</code> object with a known base URL</li>
 *   <li>Pass the <code>XMLReader</code> object from which the raw document will 
 *       be read to the <code>setParent()</code> method of this object. </li>
 *   <li>Pass your own <code>ContentHandler</code> object to the 
 *       <code>setContentHandler()</code> method of this object. This is the 
 *       object which will receive events from the parsed and included
 *       document.
 *   </li>
 *   <li>Optional: if you wish to receive comments, set your own 
 *       <code>LexicalHandler</code> object as the value of this object's
 *       http://xml.org/sax/properties/lexical-handler property.
 *       Also make sure your <code>LexicalHandler</code> asks this object 
 *       for the status of each comment using <code>insideIncludeElement</code>
 *       before doing anything with the comment. 
 *   </li>
 *   <li>Pass the URL of the document to read to this object's 
 *       <code>parse()</code> method</li>
 *  </ol>
 * 
 *  <p> e.g.</p>
 *  <pre><code>XIncludeFilter includer = new XIncludeFilter(base); 
 *  includer.setParent(parser);
 *  includer.setContentHandler(new SAXXIncluder(System.out));
 *  includer.parse(args[i]);</code>
 *  </pre>
 * </p>               
 *
 * @author Elliotte Rusty Harold
 * @version 1.0d8
 */
public final class XIncludeFilter extends XMLFilterImpl {

    public final static String XINCLUDE_NAMESPACE
     = "http://www.w3.org/2001/XInclude";

    private Stack bases = new Stack();
    private Stack locators = new Stack();
    
    // what if this isn't called????
    // do I need to check this in startDocument() and push something
    // there????
    public void setDocumentLocator(Locator locator) {
        locators.push(locator);
        String base = locator.getSystemId();
        try {
             bases.push(new URL(base));
        }
        catch (MalformedURLException e) {
            throw new UnsupportedOperationException("Unrecognized SYSTEM ID: " + base);
        }
        super.setDocumentLocator(locator);
    }
    
    
    // necessary to throw away contents of non-empty XInclude elements
    private int level = 0;

  /**
    * <p>
    * This utility method returns true if and only if this reader is 
    * currently inside a non-empty include element. (This is <strong>
    * not</strong> the same as being inside the node set whihc replaces
    * the include element.) This is primarily needed for comments
    * inside include elements. It must be checked by the actual
    * LexicalHandler to see whether a comment is passed or not.
    * </p>
    *
    * @return boolean  
    */
    public boolean insideIncludeElement() {
      
        return level != 0;
      
    }
    
    
    public void startElement(String uri, String localName,
      String qName, Attributes atts) throws SAXException {
    
        if (level == 0) { // We're not inside an xi:include element

            // Adjust bases stack by pushing either the new
            // value of xml:base or the base of the parent
            String base = atts.getValue(NamespaceSupport.XMLNS, "base");
            URL parentBase = (URL) bases.peek();
            URL currentBase = parentBase;
            if (base != null) {
                try {
                    currentBase = new URL(parentBase, base); 
                }
                catch (MalformedURLException e) {
                    throw new SAXException("Malformed base URL: " 
                     + currentBase, e);
                }
            }
            bases.push(currentBase);
          
            if (uri.equals(XINCLUDE_NAMESPACE) && localName.equals("include")) {
                // include external document
                String href = atts.getValue("href");
                // Verify that there is an href attribute
                if (href==null) { 
                    throw new SAXException("Missing href attribute");
                }
                
                String parse = atts.getValue("parse");
                if (parse == null) parse = "xml";
                
                if (parse.equals("text")) {
                    String encoding = atts.getValue("encoding");
                    includeTextDocument(href, encoding); 
                }
                else if (parse.equals("xml")) {
                    includeXMLDocument(href); 
                }
                // Need to check this also in DOM and JDOM????
                else {
                    throw new SAXException(
                      "Illegal value for parse attribute: " + parse);
                }
                level++;
            }
            else {
                super.startElement(uri, localName, qName, atts);
            } 
        
        }  
      
    }

    public void endElement (String uri, String localName, String qName)
      throws SAXException {
        
        if (uri.equals(XINCLUDE_NAMESPACE) 
           && localName.equals("include")) {
            level--;
        }
        else if (level == 0) {
            bases.pop();
            super.endElement(uri, localName, qName);
        }
        
    }

    private int depth = 0;
     
    public void startDocument() throws SAXException {
        level = 0;
        if (depth == 0) super.startDocument(); 
        depth++;        
    }
    
    public void endDocument() throws SAXException {
      
        locators.pop();
        depth--;
        if (depth == 0) super.endDocument();
                
    }
    
    // how do prefix mappings move across documents????
    public void startPrefixMapping(String prefix, String uri)
      throws SAXException {
        if (level == 0) super.startPrefixMapping(prefix, uri);
    }
    
    public void endPrefixMapping(String prefix)
      throws SAXException {
        if (level == 0) super.endPrefixMapping(prefix);        
    }

    public void characters(char[] ch, int start, int length) 
      throws SAXException {
        
        if (level == 0) super.characters(ch, start, length);
    
    }

    public void ignorableWhitespace(char[] ch, int start, int length)
      throws SAXException {
        if (level == 0) super.ignorableWhitespace(ch, start, length);
    }

    public void processingInstruction(String target, String data)
      throws SAXException {
        if (level == 0) super.processingInstruction(target, data);
    }

    public void skippedEntity(String name) throws SAXException {
        if (level == 0) super.skippedEntity(name);
    }

    // convenience method for error messages
    private String getLocation() {
      
        String locationString = "";
        Locator locator = (Locator) locators.peek();
        String publicID = "";
        String systemID = "";
        int column = -1;
        int line = -1;
        if (locator != null) {
            publicID = locator.getPublicId();
            systemID = locator.getSystemId();
            line = locator.getLineNumber();
            column = locator.getColumnNumber();
        }
        locationString = " in document included from " + publicID
          + " at " + systemID 
          + " at line " + line + ", column " + column;

        return locationString;
        
    }
    
    
  /**
    * <p>
    * This utility method reads a document at a specified URL
    * and fires off calls to <code>characters()</code>.
    * It's used to include files with <code>parse="text"</code>
    * </p>
    *
    * @param  url          URL of the document that will be read
    * @param  encoding     Encoding of the document; e.g. UTF-8, 
    *                      ISO-8859-1, etc.
    * @return void  
    * @throws SAXException if the requested document cannot
                           be downloaded from the specified URL
                           or if the encoding is not recognized
    */
    private void includeTextDocument(String url, String encoding) 
      throws SAXException {

        if (encoding == null || encoding.trim().equals("")) encoding = "UTF-8"; 
        URL source;
        try {
            URL base = (URL) bases.peek();
            source = new URL(base, url);
        }
        catch (MalformedURLException e) {
            UnavailableResourceException ex =
              new UnavailableResourceException("Unresolvable URL " + url
              + getLocation());
            ex.setRootCause(e);
            throw new SAXException("Unresolvable URL " + url + getLocation(), ex);
        }
        
        try {
            URLConnection uc = source.openConnection();
            InputStream in = new BufferedInputStream(uc.getInputStream());
            String encodingFromHeader = uc.getContentEncoding();
            String contentType = uc.getContentType();
            if (encodingFromHeader != null) encoding = encodingFromHeader;
            else {
                // What if file does not have a MIME type but name ends in .xml????
                // MIME types are case-insensitive
                // Java may be picking this up from file URL
                if (contentType != null) {
                    contentType = contentType.toLowerCase();
                    if (contentType.equals("text/xml") 
                      || contentType.equals("application/xml")   
                      || (contentType.startsWith("text/") && contentType.endsWith("+xml") ) 
                      || (contentType.startsWith("application/") && contentType.endsWith("+xml"))) {
                         encoding = EncodingHeuristics.readEncodingFromStream(in);
                    }
                }
            }
            InputStreamReader reader = new InputStreamReader(in, encoding);
            char[] c = new char[1024];
            while (true) {
                int charsRead = reader.read(c, 0, 1024);
                if (charsRead == -1) break;
                if (charsRead > 0) this.characters(c, 0, charsRead);
            }
        }
        catch (UnsupportedEncodingException e) {
            throw new SAXException("Unsupported encoding: " 
             + encoding + getLocation(), e);
        }
        catch (IOException e) {
            throw new SAXException("Document not found: " 
             + source.toExternalForm() + getLocation(), e);
        }

    }

  /**
    * <p>
    * This utility method reads a document at a specified URL
    * and fires off calls to various <code>ContentHandler</code> methods.
    * It's used to include files with <code>parse="xml"</code>
    * </p>
    *
    * @param  url          URL of the document that will be read
    * @return void  
    * @throws SAXException if the requested document cannot
                           be downloaded from the specified URL.
    */
    private void includeXMLDocument(String url) 
      throws SAXException {

        URL source;
        try {
            URL base = (URL) bases.peek();
            source = new URL(base, url);
        }
        catch (MalformedURLException e) {
            UnavailableResourceException ex =
              new UnavailableResourceException("Unresolvable URL " + url
              + getLocation());
            ex.setRootCause(e);
            throw new SAXException("Unresolvable URL " + url + getLocation(), ex);
        }
        
        try {
            // make this more robust
            XMLReader parser; 
            try {
                parser = XMLReaderFactory.createXMLReader();
            } 
            catch (SAXException e) {
                try {
                    parser = XMLReaderFactory.createXMLReader(
                      "org.apache.xerces.parsers.SAXParser"
                    );
                }
                catch (SAXException e2) {
                    System.err.println("Could not find an XML parser");
                    return;
                }
            }
            parser.setContentHandler(this);
            // save old level and base
            int previousLevel = level;
            this.level = 0;
            if (bases.contains(source)) {
                Exception e = new CircularIncludeException(
                  "Circular XInclude Reference to " + source + getLocation()
                );
                throw new SAXException("Circular XInclude Reference", e);
            }
            bases.push(source);
            parser.parse(source.toExternalForm());
            // restore old level and base
            this.level = previousLevel;
            bases.pop();
        }
        catch (IOException e) {
            throw new SAXException("Document not found: " 
             + source.toExternalForm() + getLocation(), e);
        }

    }
        
}

SAX XInclude Driver

/*--

 Copyright 2001 Elliotte Rusty Harold.
 All rights reserved.

 I haven't yet decided on a license.
 It will be some form of open source.

 THIS SOFTWARE IS PROVIDED "AS IS" AND ANY EXPRESSED OR IMPLIED
 WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 DISCLAIMED.  IN NO EVENT SHALL ELLIOTTE RUSTY HAROLD OR ANY
 OTHER CONTRIBUTORS TO THIS PACKAGE
 BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 SUCH DAMAGE.

 */

package com.macfaq.xml;

import org.xml.sax.SAXException;
import org.xml.sax.ContentHandler;
import org.xml.sax.helpers.XMLReaderFactory;
import org.xml.sax.XMLReader;
import org.xml.sax.Locator;
import org.xml.sax.Attributes;
import org.xml.sax.ext.LexicalHandler;

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.io.OutputStream;
import java.io.Writer;
import java.io.OutputStreamWriter;
import java.io.File;
import java.net.URL;
import java.net.MalformedURLException;
import java.util.Stack;

/**
 * <p><code>SAXXIncluder</code> is a simple <code>ContentHandler</code> that
 * writes its XML document onto an output stream after resolving
 * all <code>xinclude:include</code> elements.
 * </p>
 *
 * <p>
 *    The only current known bug is that the notation and
 *    unparsed entity information items are not included
 *    in the result infoset. Furthermore, processing 
 *    instructions in the DTD are not included. Note that this is 
 *    only relevant to the source infoset. The DOCTYPE declaration
 *    is specifically excluded from included infosets.
 * </p>
 *
 *  <p> 
 *     I also need to check how section 4.4.3.1 applies for inscope
 *     namespaces in included documents. Currently this is not an issue
 *     because I only include full documents, but it may become an
 *     an issue when XPointer support is added. 
 *  </p>
 *
 *  <p> 
 *     There's no XPointer support yet. Only full documents are
 *     included.
 *  </p>
 *
 *  <p> 
 *     The parser used to drive this must support the <code>LexicalHandler</code>
 *     interface. It must also provide a <code>Locator</code> object. 
 *     These are optional in SAX, but Xerces-J does support these features.
 *  </p>
 *
 * @author Elliotte Rusty Harold
 * @version 1.0d8
 */
public class SAXXIncluder implements ContentHandler, LexicalHandler {

    private Writer out;
    private String encoding;
   
    // should try to combine two constructors so as not to duplicate
    // code
    public SAXXIncluder(OutputStream out, String encoding)
      throws UnsupportedEncodingException {
        this.out = new OutputStreamWriter(out, encoding);
        this.encoding = encoding;
    }

    public SAXXIncluder(OutputStream out) {
        try {
          this.out = new OutputStreamWriter(out, "UTF8");
          this.encoding="UTF-8";
        }
        catch (UnsupportedEncodingException e) {
          // This really shouldn't happen
        }    
    }

    public void setDocumentLocator(Locator locator) {}
    
    public void startDocument() throws SAXException {

        try {
            out.write("<?xml version='1.0' encoding='" 
              + encoding + "'?>\r\n");
        }
        catch (IOException e) {
            throw new SAXException("Write failed", e);       
        }        
        
    }
    
    public void endDocument() throws SAXException {
        
        try {
            out.flush();
        }
        catch (IOException e) {
            throw new SAXException("Flush failed", e);       
        }
        
    }
    
    public void startPrefixMapping(String prefix, String uri)
      throws SAXException {
        
    }
    
    public void endPrefixMapping(String prefix)
      throws SAXException {
        
    }

    public void startElement(String namespaceURI, String localName,
      String qualifiedName, Attributes atts)
      throws SAXException {

        try {
            out.write("<" + qualifiedName);
            for (int i = 0; i < atts.getLength(); i++) {
                out.write(" ");   
                out.write(atts.getQName(i));   
                out.write("='");
                String value = atts.getValue(i);
                // + 4 allows space for one entitiy reference.
                // If there's more than that, then the StringBuffer
                // will automatically expand
                // Need to use character references if the encoding
                // can't support the character
                StringBuffer encodedValue=new StringBuffer(value.length() + 4);
                for (int j = 0; j < value.length(); j++) {
                    char c = value.charAt(j);
                    if (c == '&') encodedValue.append("&amp;");
                    else if (c == '<') encodedValue.append("&lt;");
                    else if (c == '>') encodedValue.append("&gt;");
                    else if (c == '\'') encodedValue.append("&apos;");
                    else encodedValue.append(c);    
                }
                out.write(encodedValue.toString());   
                out.write("'");
            }
            out.write(">");
        }
        catch (IOException e) {
            throw new SAXException("Write failed", e);       
        }        
        
    }
      
    public void endElement(String namespaceURI, String localName,
      String qualifiedName) throws SAXException {
        
        try {
            out.write("</" + qualifiedName + ">");
        }
        catch (IOException e) {
            throw new SAXException("Write failed", e);       
        }
            
    }

    // need to escape characters that are not in the given 
    // encoding using character references????
    // need to escape characters that are not in the given 
    // encoding using character references????
    public void characters(char[] ch, int start, int length) 
      throws SAXException {
        
        try {
            for (int i = 0; i < length; i++) {
                char c = ch[start+i];
                if (c == '&') out.write("&amp;");
                else if (c == '<') out.write("&lt;");
                else out.write(c);
            }
        }
        catch (IOException e) {
            throw new SAXException("Write failed", e);       
        }
    
    }

    public void ignorableWhitespace(char[] ch, int start, int length)
      throws SAXException {
        this.characters(ch, start, length);   
    }

    // do I need to escape text in PI????
    public void processingInstruction(String target, String data)
      throws SAXException {

        try {
            out.write("<?" + target + " " + data + "?>");
        }
        catch (IOException e) {
            throw new SAXException("Write failed", e);       
        }
        
    }

    public void skippedEntity(String name) throws SAXException {
        
        try {
            out.write("&" + name + ";");
        }
        catch (IOException e) {
            throw new SAXException("Write failed", e);       
        }
        
    }

    // LexicalHandler methods
    private boolean inDTD = false;
    private Stack entities = new Stack();
    
    public void startDTD(String name, String publicId, String systemId)
      throws SAXException {
        inDTD = true;
        // if this is the source document, output a DOCTYPE declaration
        if (entities.size() == 0) {
            String id;
            if (publicId != null) id = "PUBLIC \"" + publicId + "\" \"" + systemId + '"';
            else id = "SYSTEM \"" + systemId + '"';
            try {
                out.write("<!DOCTYPE " + name + " " + id + ">\r\n");
            }
            catch (IOException e) {
                throw new SAXException("Error while writing DOCTYPE", e);   
            }
        }
    }
    public void endDTD() throws SAXException { }
    
    public void startEntity(String name) throws SAXException {
        entities.push(name);
    }
    
    
    public void endEntity(String name) throws SAXException {
        entities.pop();
    }
    
    public void startCDATA() throws SAXException {}
    public void endCDATA() throws SAXException {}

    // Just need this reference so we can ask if a comment is 
    // inside an include element or not
    private XIncludeFilter filter = null;

    public void setFilter(XIncludeFilter filter) {
        this.filter = filter;
    } 
    
    public void comment(char[] ch, int start, int length)
      throws SAXException {
        
        if (!inDTD && !filter.insideIncludeElement()) {
            try {
                out.write("<!--");
                out.write(ch, start, length);
                out.write("-->");
            }
            catch (IOException e) {
                throw new SAXException("Write failed", e);       
            }
        }
      
    }    
    
    /**
      * <p>
      * The driver method for the SAXXIncluder program.
      * </p>
      *
      * @param args  contains the URLs and/or filenames
      *              of the documents to be procesed.
      */
    public static void main(String[] args) {

        // make this more robust
        XMLReader parser; 
        try {
            parser = XMLReaderFactory.createXMLReader();
        } 
        catch (SAXException e) {
            try {
                parser = XMLReaderFactory.createXMLReader(
                  "org.apache.xerces.parsers.SAXParser");
            }
            catch (SAXException e2) {
                System.err.println("Could not find an XML parser");
                return;
            }
        }
        
        // Need better namespace handling
        try {
            parser.setFeature("http://xml.org/sax/features/namespace-prefixes", true);
        }
        catch (SAXException e) {
            System.err.println(e);
            return;
        }   
        
        for (int i = 0; i < args.length; i++) {
            try {
               /* URL base;
                try {
                    base = new URL(args[i]);
                }
                catch (MalformedURLException e) {
                    File f = new File(args[i]);
                    base = f.toURL();
                } */
                XIncludeFilter includer = new XIncludeFilter(); 
                includer.setParent(parser);
                SAXXIncluder s = new SAXXIncluder(System.out);
                includer.setContentHandler(s);
                try {
                    includer.setProperty(
                      "http://xml.org/sax/properties/lexical-handler",
                       s);
                    s.setFilter(includer);
                }
                catch (SAXException e) {
                    // Will not support comments
                } 
                includer.parse(args[i]);
            }
            catch (Exception e) { // be specific about exceptions????
                System.err.println(e);
                e.printStackTrace();
            }
        }

    }

}

Implementation as JDOM

/*--

 Copyright 2000, 2001 Elliotte Rusty Harold.
 All rights reserved.

 I haven't yet decided on a license.
 It will be some form of open source.

 THIS SOFTWARE IS PROVIDED "AS IS" AND ANY EXPRESSED OR IMPLIED
 WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 DISCLAIMED.  IN NO EVENT SHALL ELLIOTTE RUSTY HAROLD OR
 ANY OTHER CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 SUCH DAMAGE.

 */

package com.macfaq.xml;

import java.net.URL;
import java.net.URLConnection;
import java.net.MalformedURLException;
import java.util.Stack;
import java.util.Iterator;
import java.util.List;
import java.util.LinkedList;
import java.io.File;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.io.InputStreamReader;
import java.io.BufferedInputStream;
import java.io.InputStream;

import org.jdom.Namespace;
import org.jdom.Comment;
import org.jdom.CDATA;
import org.jdom.Text;
import org.jdom.JDOMException;
import org.jdom.Attribute;
import org.jdom.Element;
import org.jdom.ProcessingInstruction;
import org.jdom.Document;
import org.jdom.DocType;
import org.jdom.EntityRef;
import org.jdom.input.SAXBuilder;
import org.jdom.input.DOMBuilder;
import org.jdom.output.XMLOutputter;

/**
 * <p><code>JDOMXIncluder</code> provides methods to
 * resolve JDOM elements and documents to produce
 * a new <code>Document</code>, <code>Element</code>, 
 * or <code>List</code> of nodes with all
 * XInclude references resolved.
 * </p>
 *
 * <p>
 * Known bugs include:
 * </p>
 * <ul>
 *  <li>XPointer fragment identifiers are not handled</li>
 *  <li>Notations and unparsed entities from the included infosets
 *      are not merged into the final infoset</li>
 * </ul>
 *
 * @author Elliotte Rusty Harold
 * @version 1.0d8, September 5, 2001
 */
public class JDOMXIncluder {

  public final static Namespace XINCLUDE_NAMESPACE
    = Namespace.getNamespace("xi", "http://www.w3.org/2001/XInclude");

  // No instances allowed
  private JDOMXIncluder() {}

  private static SAXBuilder builder = new SAXBuilder();

  /**
    * <p>
    * This method resolves a JDOM <code>Document</code>
    * and merges in all XInclude references.
    * The <code>Document</code> object returned is a new document.
    * The original <code>Document</code> is not changed.
    * </p>
    *
    * @param original <code>Document</code> that will be processed
    * @param base     <code>String</code> form of the base URI against which
    *                 relative URLs will be resolved. This can be null if the
    *                 document includes an <code>xml:base</code> attribute.
    * @return Document new <code>Document</code> object in which all
    *                  XInclude elements have been replaced.
    * @throws MissingHrefException if an <code>xinclude:include</code> element does not have an href attribute.
    * @throws UnavailableResourceException if an included document cannot be located
    *                                  or cannot be read.
    * @throws MalformedResourceException if an included document is not namespace well-formed
    * @throws CircularIncludeException if this document possesses a cycle of
    *                                  XIncludes.
    * @throws XIncludeException if any of the rules of XInclude are violated
    */
    public static Document resolve(Document original, String base)
      throws XIncludeException {

        if (original == null) {
           throw new NullPointerException("Document must not be null");
        }
        
        Document result = (Document) original.clone();
        
        Element root = result.getRootElement();
        List resolved = resolve(root, base);
        
        // check that the list returned contains 
        // exactly one root element
        Element newRoot = null;
        Iterator iterator = resolved.iterator();
        while (iterator.hasNext()) {
            Object o = iterator.next();
            if (o instanceof Element) {
                if (newRoot != null) {
                    throw new XIncludeException("Tried to include multiple roots");       
                }
                newRoot = (Element) o;
            }
            else if (o instanceof Comment || o instanceof ProcessingInstruction) {
                // do nothing    
            }
            else if (o instanceof Text || o instanceof String) {
                throw new XIncludeException(
                  "Tried to include text node outside of root element"
                );    
            }
            else if (o instanceof EntityRef) {
                throw new XIncludeException(
                  "Tried to include a general entity reference outside of root element"
                );    
            }
            else {
                throw new XIncludeException(
                    "Unexpected type " + o.getClass()
                ); 
            }
                 
        }
        
        if (newRoot == null) {
            throw new XIncludeException("No root element");       
        }
  
        // Could probably combine two loops
        List newContent = result.getContent();
        // resolved contains list of new content
        // use it to replace old root element
        iterator = resolved.iterator();
        
        // put in nodes before root element
        int rootPosition = newContent.indexOf(result.getRootElement());
        while (iterator.hasNext()) {
            Object o = iterator.next();
            if (o instanceof Comment || o instanceof ProcessingInstruction) {
                newContent.add(rootPosition, o);
                rootPosition++;
            }
            else if (o instanceof Element) { // the root
                break;
            }
            else {
              // throw exception????   
            }
        }
        
        // put in root element
        result.setRootElement(newRoot);
        
        int addPosition = rootPosition+1;
        // put in nodes after root element
        while (iterator.hasNext()) {
            Object o = iterator.next();
            if (o instanceof Comment || o instanceof ProcessingInstruction) {
                newContent.add(addPosition, o);
                addPosition++;
            }
            else {
              // throw exception????   
            }
        }
                        
        return result;
  }

  /**
    * <p>
    * This method resolves a JDOM <code>Element</code>
    * and merges in all XInclude references. This process is recursive.
    * The element returned contains no XInclude elements.
    * If a referenced document cannot be found it is replaced with
    * an error message. The <code>Element</code> object returned is a new element.
    * The original <code>Element</code> is not changed.
    * </p>
    *
    * @param original <code>Element</code> that will be processed
    * @param base     <code>String</code> form of the base URI against which
    *                 relative URLs will be resolved. This can be null if the
    *                 element includes an <code>xml:base</code> attribute.
    * @return List  A List containing all nodes that replace this element.
    *               If this element is not an <code>xinclude:include</code>
    *               this list is guaranteed to contain a single <code>Element</code> object.
    * @throws MissingHrefException if an <code>xinclude:include</code> element does not have an href attribute.
    * @throws NullPointerException if <code>original</code> element is null.
    * @throws UnavailableResourceException if an included document cannot be located
    *                                  or cannot be read.
    * @throws MalformedResourceException if an included document is not namespace well-formed
    * @throws CircularIncludeException if this <code>Element</code> contains an XInclude element
    *                                  that attempts to include a document in which 
    *                                  this element is directly or indirectly included.
    */
    public static List resolve(Element original, String base)
     throws CircularIncludeException, XIncludeException, NullPointerException {

        if (original == null) {
          throw new NullPointerException("You can't XInclude a null element.");
        }
        Stack bases = new Stack();
        if (base != null) bases.push(base);
    
        List result = resolve(original, bases);
        bases.pop();
        return result;

    }

    private static boolean isIncludeElement(Element element) {
        
        if (element.getName().equals("include") &&
            element.getNamespace().equals(XINCLUDE_NAMESPACE)) {
          return true;
        }
        return false;
        
    }


  /**
    * <p>
    * This method resolves a JDOM <code>Element</code>
    * and merges in all XInclude references. This process is recursive.
    * The list returned contains no XInclude elements.
    * The nodes in the list returned are new objects.
    * The original <code>Element</code> is not changed.
    * </p>
    *
    * @param original <code>Element</code> that will be processed
    * @param bases    <code>Stack</code> containing the string forms of
    *                 all the URIs of documents which contain this element
    *                 through XIncludes. This is used to detect if any circular 
    *                 references occur. 
    * @return List  A <code>List</code> containing all nodes that replace this element.
    *               If this element is not an <code>xinclude:include</code>
    *               this list is guaranteed to contain a single <code>Element</code> object.
    * @throws MissingHrefException if an <code>xinclude:include</code> element does not have an href attribute.
    * @throws UnavailableResourceException if an included document cannot be located
    *                                  or cannot be read.
    * @throws BadParseAttributeException if an <code>include</code> element has a <code>parse</code> attribute
                                         with any value other than <code>text</code> or <code>parse</code>
    * @throws MalformedResourceException if an included document is not namespace well-formed
    * @throws CircularIncludeException if this <code>Element</code> contains an XInclude element
    *                                  that attempts to include a document in which 
    *                                  this element is directly or indirectly included.
    */
    protected static List resolve(Element original, Stack bases)
      throws CircularIncludeException, MalformedResourceException, 
      UnavailableResourceException, BadParseAttributeException, XIncludeException {

        String base = "";
        if (bases.size() != 0) base = (String) bases.peek();
  
        if (isIncludeElement(original)) {
            return resolveXIncludeElement(original, bases);       
        }
        else {
            Element resolvedElement = resolveNonXIncludeElement(original, bases);        
            List resultList = new LinkedList();
            resultList.add(resolvedElement);
            return resultList;
        }
  
    }

    private static List resolveXIncludeElement(Element original, Stack bases)
      throws CircularIncludeException, MalformedResourceException, 
      UnavailableResourceException, XIncludeException {

        String base = "";
        if (bases.size() != 0) base = (String) bases.peek();
  
        // These lines are probably unnecessary
        if (!isIncludeElement(original)) {
            throw new RuntimeException("Bad private Call");       
        }
            
        Attribute href = original.getAttribute("href");
        if (href == null) { 
            throw new MissingHrefException("Missing href attribute");
        }
          
        Attribute baseAttribute
          = original.getAttribute("base", Namespace.XML_NAMESPACE);
        if (baseAttribute != null) {
            base = baseAttribute.getValue();
        }
          
        URL remote;
        if (base != null) {
            try {
              URL context = new URL(base);
              remote = new URL(context, href.getValue());
            }
            catch (MalformedURLException ex) {
               XIncludeException xex = new UnavailableResourceException(
                 "Unresolvable URL " + base + "/" + href);
               xex.setRootCause(ex);
               throw xex;
            }
        }
        else { // base == null
            try {
                remote = new URL(href.getValue());
            }
            catch (MalformedURLException ex) {
                XIncludeException xex = new UnavailableResourceException(
                  "Unresolvable URL " + href.getValue());
                xex.setRootCause(ex);
                throw xex;
            }
        }
    
        boolean parse = true;
        Attribute parseAttribute = original.getAttribute("parse");
        if (parseAttribute != null) {
            String parseValue = parseAttribute.getValue();
            if (parseValue.equals("text")) parse = false;
            else if (!parseValue.equals("xml")) {
                throw new BadParseAttributeException(
                  parseAttribute + "is not a legal value for the parse attribute"
                );
            } 
        }
    
        if (parse) {
            // System.err.println("parsed");
                     // checks for equality (OK) or identity (not OK)????
            if (bases.contains(remote.toExternalForm())) {
                // need to figure out how to get file and number where
                // bad include occurs
                throw new CircularIncludeException(
                  "Circular XInclude Reference to "
                  + remote.toExternalForm() + " in " 
                );
            }
    
            try {
                Document doc = builder.build(remote); // this Document object never leaves this method
                // System.err.println(doc);
                bases.push(remote.toExternalForm());
                // This is the point where I need to select out 
                // the nodes pointed to by the XPointer
                // I really need to push this out into a separate method
                // that returns a list of the nodes pointed to by the XPointer
                String fragment = remote.getRef();
                 
                 
                // I need to return the full document child list including comments and PIs, 
                // not just the resolved root
                Element root = doc.getRootElement();
                List topLevelNodes = doc.getContent();
                int rootPosition = topLevelNodes.indexOf(root);
                List beforeRoot = topLevelNodes.subList(0, rootPosition);
                List afterRoot = topLevelNodes.subList(rootPosition+1, topLevelNodes.size());
                List rootList = resolve(root, bases);
                List resultList = new LinkedList();
                resultList.addAll(beforeRoot);
                resultList.addAll(rootList);
                resultList.addAll(afterRoot);

                // the top-level things I return should be disconnected from their parents                
                for (int i = 0; i < resultList.size(); i++) {
                    Object o = resultList.get(i);
                    if (o instanceof Element) {
                      Element element = (Element) o;
                      List nodes = resolve(element, bases);
                      resultList.addAll(i, nodes);
                      i += nodes.size();
                      resultList.remove(i);
                      i--;
                      // System.err.println(element);
                      element.detach();     
                    } 
                    if (o instanceof Comment) {
                      Comment comment = (Comment) o;
                      comment.detach();     
                    } 
                    if (o instanceof ProcessingInstruction) {
                      ProcessingInstruction pi = (ProcessingInstruction) o;
                      pi.detach();     
                    } 
                }
                bases.pop();
                return resultList;
              }
              // should this be a MalformedResourceException????
              // probably; maybe check on why JDOMException was thrown
              catch (JDOMException e) {
                  XIncludeException xex = new UnavailableResourceException(
                    "Unresolvable URL " + href.getValue());
                  xex.setRootCause(e);
                  throw xex;
              }
          }
          else { // unparsed, insert text
            String encoding = original.getAttributeValue("encoding");
            String s = downloadTextDocument(remote, encoding);
            List resultList = new LinkedList();
            resultList.add(s);
            return resultList;
          }
  
    }

    private static Element resolveNonXIncludeElement(Element original, Stack bases)
      throws CircularIncludeException, MalformedResourceException, 
      UnavailableResourceException, XIncludeException {

        String base = "";
        if (bases.size() != 0) base = (String) bases.peek();

        // Not an include element; a copy of this element in which its
        // descendants have been resolved will be returned
        // recursively process children
        Element result = new Element(original.getName(), original.getNamespace());
        Iterator attributes = original.getAttributes().iterator();
        while (attributes.hasNext()) {
            Attribute a = (Attribute) attributes.next();
            result.setAttribute((Attribute) a.clone());
        }
        List newChildren = result.getContent(); // live list

        Iterator originalChildren = original.getContent().iterator();
        while (originalChildren.hasNext()) {
            Object o = originalChildren.next();
            if (o instanceof Element) {
                Element element = (Element) o;
                if (isIncludeElement(element)) {
                    newChildren.addAll(resolveXIncludeElement(element, bases));
                }
                else {
                    newChildren.add(resolveNonXIncludeElement(element, bases));
                }
            }
            else if (o instanceof String) {
                newChildren.add(o);
            }
            else if (o instanceof Text) {
                newChildren.add(o);
            }
            else if (o instanceof CDATA) {
                newChildren.add(o);
            }
            else if (o instanceof Comment) {
                Comment c = (Comment) o;
                newChildren.add(c.clone());
            }
            else if (o instanceof EntityRef) {
                EntityRef entity = (EntityRef) o;
                newChildren.add(entity.clone());
            }
            else if (o instanceof ProcessingInstruction) {
                ProcessingInstruction pi = (ProcessingInstruction) o;
                newChildren.add(pi.clone());
            }
            else {
                throw new XIncludeException("Unexpected Type " + o.getClass());
            }
        } // end while

        return result;
  
    }


  /**
    * <p>
    * This utility method reads a document at a specified URL
    * and returns the contents of that document as a <code>String</code>.
    * It's used to include files with <code>parse="text"</code>.
    * </p>
    *
    * @param source   <code>URL</code> of the document that will be stored in 
    *                 <code>String</code>. 
    * @param  encoding Encoding of the document; e.g. UTF-8,
    *                  ISO-8859-1, etc.
    * @return String  The document retrieved from the source <code>URL</code>.
    * @throws UnavailableResourceException if the source document cannot be located
    *                                  or cannot be read.
    */    
    public static String downloadTextDocument(URL source, String encoding) 
      throws UnavailableResourceException {
         
        if (encoding == null || encoding.equals("")) encoding = "UTF-8"; 
        try {
            StringBuffer s = new StringBuffer();
            URLConnection uc = source.openConnection();
            String encodingFromHeader = uc.getContentEncoding();
            String contentType = uc.getContentType();
            InputStream in = new BufferedInputStream(uc.getInputStream());
            if (encodingFromHeader != null) encoding = encodingFromHeader;
            else {
                // What if file does not have a MIME type but name ends in .xml????
                // MIME types are case-insensitive
                // Java may be picking this up from file URL
                if (contentType != null) {
                    contentType = contentType.toLowerCase();
                    if (contentType.equals("text/xml") 
                      || contentType.equals("application/xml")   
                      || (contentType.startsWith("text/") && contentType.endsWith("+xml") ) 
                      || (contentType.startsWith("application/") && contentType.endsWith("+xml"))) {
                         encoding = EncodingHeuristics.readEncodingFromStream(in);
                    }
                }
            }
            InputStreamReader reader = new InputStreamReader(in, encoding);
            int c;
            while ((c = in.read()) != -1) {
              if (c == '<') s.append("&lt;");
              else if (c == '&') s.append("&amp;");
              else s.append((char) c);
            }
            return s.toString();
        }
        catch (UnsupportedEncodingException e) {
            UnavailableResourceException ex = new UnavailableResourceException(
              "Encoding " + encoding + " not recognized for included document: " 
              + source.toExternalForm());
            ex.setRootCause(e);
            throw ex;
        }
        catch (IOException e) {
            UnavailableResourceException ex = new UnavailableResourceException(
              "Document not found: " + source.toExternalForm());
            ex.setRootCause(e);
            throw ex;
        }
      
    }

    /**
      * <p>
      * The driver method for the XIncluder program.
      * I'll probably move this to a separate class soon.
      * </p>
      *
      * @param args  <code>args[0]</code> contains the URL or file name 
      *              of the first document to be processed; <code>args[1]</code>
      *              contains the URL or file name 
      *              of the second document to be processed, etc. 
      */
    public static void main(String[] args) {
  
        SAXBuilder builder = new SAXBuilder();
        XMLOutputter outputter = new XMLOutputter();
        for (int i = 0; i < args.length; i++) {
            try {
                Document input = builder.build(args[i]);
                // absolutize URL
                String base = args[i];
                if (base.indexOf(':') < 0) {
                    File f = new File(base);
                    base = f.toURL().toExternalForm();
                }
                Document output = resolve(input, base);
                // need to set encoding on this to Latin-1 and check what
                // happens to UTF-8 curly quotes
                outputter.output(output, System.out);
            }
            catch (Exception e) {
                System.err.println(e);
                e.printStackTrace();
            }
        }
  
    }

}

Implementation as DOM

/*--

 Copyright 2001 Elliotte Rusty Harold.
 All rights reserved.

 I haven't yet decided on a license.
 It will be some form of open source.

 THIS SOFTWARE IS PROVIDED "AS IS" AND ANY EXPRESSED OR IMPLIED
 WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
 OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 DISCLAIMED.  IN NO EVENT SHALL ELLIOTTE RUSTY HAROLD OR ANY
 OTHER CONTRIBUTORS TO THIS PACKAGE
 BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
 SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
 LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
 USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
 OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 SUCH DAMAGE.

 */

package com.macfaq.xml;

import java.net.URL;
import java.net.URLConnection;
import java.net.MalformedURLException;
import java.util.Stack;
import java.util.List;
import java.util.ArrayList;

import java.io.File;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.io.InputStreamReader;
import java.io.BufferedInputStream;
import java.io.InputStream;

import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;

import org.w3c.dom.Element;
import org.w3c.dom.Document;
import org.w3c.dom.Comment;
import org.w3c.dom.ProcessingInstruction;
import org.w3c.dom.DocumentType;
import org.w3c.dom.Text;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.DOMImplementation;

import org.apache.xerces.parsers.DOMParser;
import org.apache.xml.serialize.OutputFormat;
import org.apache.xml.serialize.XMLSerializer;

/**
 * <p><code>DOMXIncluder</code> provides methods to
 * resolve DOM elements and documents to produce
 * a new <code>Document</code> or <code>Element</code> with all
 * XInclude references resolved.
 * </p>
 *
 * <p>
 * It does not yet handle the merging of unparsed entity
 * and notation information items from the included infosets.
 * Furthermore it does not include the source document's doctype 
 * declaration if that contains an internal DTD subset.
 * This may be the result of a Xerces bug. 
 * </p>
 *
 *
 * @author Elliotte Rusty Harold
 * @version 1.0d8
 */
public class DOMXIncluder {

  public final static String XINCLUDE_NAMESPACE
   = "http://www.w3.org/2001/XInclude";
  public final static String XML_NAMESPACE
   = "http://www.w3.org/XML/1998/namespace";

  // No instances allowed
  private DOMXIncluder() {}

  private static DOMParser parser = new DOMParser();

  /**
    * <p>
    * This method resolves a DOM <code>Document</code>
    * and merges in all XInclude references.
    * The <code>Document</code> object returned is a new document.
    * The original <code>Document</code> object is not changed.
    * </p>
    *
    * <p>
    * This method depends on the ability to clone a DOM <code>Document</code>
    * which not all DOM parsers may be able to do.
    * It definitely exercises a bug in Xerces-J 1.3.1.
    * This bug is fixed in Xerces-J 1.4.0.
    * </p>
    *
    * @param original <code>Document</code> that will be processed
    * @param base     <code>String</code> form of the base URI against which
    *                 relative URLs will be resolved. This can be null if the
    *                 document includes an <code>xml:base</code> attribute.
    * @return Document new <code>Document</code> object in which all
    *                  XInclude elements have been replaced.
    * @throws XIncludeException if this document, though namespace well-formed,
    *                           violates one of the rules of XInclude.
    * @throws NullPointerException  if the original argument is null.
    */
    public static Document resolve(Document original, String base)
      throws XIncludeException, NullPointerException {

        if (original == null) {
          throw new NullPointerException("Document must not be null");
        }

        Document resultDocument = (Document) original.cloneNode(true);
        // This clone doesn't seem to include the DOCTYPE 
        // if there's an internal DTD subset????
        // Is this the correct behavior? No, a bug in Xerces 1.4.3
        Element resultRoot = resultDocument.getDocumentElement();
 
        // Should this method return a DocumentFragment instead of a
        // NodeList????
        NodeList resolved = resolve(resultRoot, base, resultDocument);
        // Check that this contains exactly one root element
        // and no Text, DocumentType, or other nodes
        int numberRoots = 0;
        for (int i = 0; i < resolved.getLength(); i++) {
            if (resolved.item(i) instanceof Comment
              || resolved.item(i) instanceof ProcessingInstruction) {
                continue;
            }
            else if (resolved.item(i) instanceof Element) numberRoots++;
            else if (resolved.item(i) instanceof Text) {
                throw new XIncludeException(
                  "Tried to include text node outside document element");
            }
            else {
                throw new XIncludeException(
                  // convert type to a string????
                  "Cannot include a " + resolved.item(i).getNodeType() + " node");
            }
        }
        if (numberRoots != 1) {
            throw new XIncludeException("Tried to include multiple roots");
        }

        // insert nodes before the root
        int nodeIndex = 0;
        while (nodeIndex < resolved.getLength()) {
            if (resolved.item(nodeIndex) instanceof Element) break;
            resultDocument.insertBefore(resolved.item(nodeIndex), resultRoot);
            nodeIndex++;
        }

        // insert new root
        resultDocument.replaceChild(
          resolved.item(nodeIndex), resultRoot
        );
        nodeIndex++;

        //insert nodes after new root
        Node refNode = resultDocument.getDocumentElement().getNextSibling();
        if (refNode == null) {
            while (nodeIndex < resolved.getLength()) {
                resultDocument.appendChild(resolved.item(nodeIndex));
                nodeIndex++;
            }
        }
        else {
            while (nodeIndex < resolved.getLength()) {
                resultDocument.insertBefore(resolved.item(nodeIndex), refNode);
                nodeIndex++;
            }
        }

        return resultDocument;

    }

  /**
    * <p>
    * This method resolves a DOM <code>Element</code>
    * and merges in all XInclude references. This process is recursive.
    * The element returned contains no XInclude elements.
    * If a referenced document cannot be found it is replaced with
    * an error message. The <code>Element</code> object returned is a new element.
    * The original <code>Element</code> is not changed.
    * </p>
    *
    * @param original <code>Element</code> that will be processed
    * @param base     <code>String</code> form of the base URI against which
    *                 relative URLs will be resolved. This can be null if the
    *                 element includes an <code>xml:base</code> attribute.
    * @param resolved <code>Document</code> into which the resolved element will be placed.
    * @return NodeList the infoset that this element resolves to
    * @throws CircularIncludeException if this <code>Element</code> contains an XInclude element
    *                                  that attempts to include a document in which
    *                                  this element is directly or indirectly included.
    * @throws NullPointerException  if the <code>original</code> argument is null.
    */
    public static NodeList resolve(Element original, String base, Document resolved)
      throws XIncludeException, NullPointerException {

        if (original == null) {
          throw new NullPointerException(
           "You can't XInclude a null element."
          );
        }
        Stack bases = new Stack();
        if (base != null) bases.push(base);

        NodeList result = resolve(original, bases, resolved);
        bases.pop();
        return result;

    }

    private static boolean isIncludeElement(Element element) {

        if (element.getLocalName().equals("include") &&
            element.getNamespaceURI().equals(XINCLUDE_NAMESPACE)) {
            return true;
        }
        return false;

    }


  /**
    * <p>
    * This method resolves a DOM <code>Element</code> into an infoset
    * and merges in all XInclude references. This process is recursive.
    * The returned infoset contains no XInclude elements.
    * If a referenced document cannot be found it is replaced with
    * an error message. The <code>NodeList</code> object returned is new.
    * The original <code>Element</code> is not changed.
    * </p>
    *
    * @param original <code>Element</code> that will be processed
    * @param bases    <code>Stack</code> containing the string forms of
    *                 all the URIs of documents which contain this element
    *                 through XIncludes. This used to detect if a circular
    *                 reference is being used.
    * @param resolved <code>Document</code> into which the resolved element will be placed.
    * @return NodeList the infoset into whihc this element resolves. This is just a copy
                       of the element if the element is not an XInclude element and does
                       not contain any XInclude elements.
    * @throws CircularIncludeException if this <code>Element</code> contains an XInclude element
    *                                  that attempts to include a document in which
    *                                  this element is directly or indirectly included.
    * @throws MissingHrefException if the <code>href</code> attribute is missing from an include element.
    * @throws MalformedResourceException if an included document is not namespace well-formed
    * @throws BadParseAttributeException if an <code>include</code> element has a <code>parse</code> attribute
                                         with any value other than <code>text</code> or <code>parse</code>
    * @throws UnavailableResourceException if the URL in the include element's
                                           <code>href</code> attribute cannot be loaded.
    * @throws XIncludeException if this document, though namespace well-formed,
    *                           violates one of the rules of XInclude.
    */
    private static NodeList resolve(Element original, Stack bases, Document resolved)
      throws CircularIncludeException, MissingHrefException, MalformedResourceException,
      BadParseAttributeException, UnavailableResourceException, XIncludeException {

        XIncludeNodeList result = new XIncludeNodeList();
        String base = null;
        if (bases.size() != 0) base = (String) bases.peek();
  
        if (isIncludeElement(original)) {
  
          // Verify that there is an href attribute
          if (!original.hasAttribute("href")) {
            throw new MissingHrefException("Missing href attribute");
          }
          String href = original.getAttribute("href");
  
          // Check for a base attribute
          String baseAttribute
            = original.getAttributeNS(XML_NAMESPACE, "base");
          if (baseAttribute != null && !baseAttribute.equals("")) {
              base = baseAttribute;
          }
  
          String remote;
          if (base != null) {
              try {
                  URL context = new URL(base);
                  URL u = new URL(context, href);
                  remote = u.toExternalForm();
              }
              catch (MalformedURLException ex) {
                  XIncludeException xex = new UnavailableResourceException(
                    "Unresolvable URL " + base + "/" + href);
                  xex.setRootCause(ex);
                  throw xex;
              }
          }
          else {
              remote = href;
          }
  
          // check for parse attribute; default is true
          boolean parse = true;
          if (original.hasAttribute("parse")) {
              String parseAttribute = original.getAttribute("parse");
              if (parseAttribute.equals("text")) {
                  parse = false;
              }
              else if (!parseAttribute.equals("xml")) {
                  throw new BadParseAttributeException(
                    parseAttribute + "is not a legal value for the parse attribute"
                  );
              }
          }
  
          if (parse) {
              // checks for equality (OK) or identity (not OK)????
              if (bases.contains(remote)) {
                // need to figure out how to get file and number where
                // bad include occurs????
                  throw new CircularIncludeException(
                    "Circular XInclude Reference to "
                    + remote + " in " );
              }
  
              try {
                  parser.parse(remote);
                  Document doc = parser.getDocument();
                  bases.push(remote);
                  // this method needs to remove DocType node if any
                  NodeList docChildren = doc.getChildNodes();
                  for (int i = 0; i < docChildren.getLength(); i++) {
                      Node node = docChildren.item(i);
                      if (node instanceof Element) {
                          result.add(resolve((Element) node, bases, resolved));
                      }
                      else if (node instanceof DocumentType) continue;
                      else result.add(node);
                  }
                  bases.pop();
              }
              catch (SAXParseException e) {
                  int line = e.getLineNumber();
                  int column = e.getColumnNumber();
                  if (line <= 0) {
                      XIncludeException ex = new UnavailableResourceException("Document "
                        + remote + " was not found.");
                      ex.setRootCause(e);
                      throw ex;                        
                  }
                  else {
                      XIncludeException ex = new MalformedResourceException("Document "
                        + remote + " is not well-formed at line " + line + ", column " + column);
                      ex.setRootCause(e);
                      throw ex;
                  }
              }
              catch (SAXException e) {
                 XIncludeException ex = new MalformedResourceException("Document "
                   + remote + " is not well-formed.");
                 ex.setRootCause(e);
                 throw ex;
              }
              catch (IOException e) {
                  XIncludeException ex
                    = new UnavailableResourceException("Document not found: "
                    + remote);
                  ex.setRootCause(e);
                  throw ex;
              }
          }
          else { // insert text
              String encoding = original.getAttribute("encoding");
              String s = downloadTextDocument(remote, encoding);
              result.add(resolved.createTextNode(s));
          }
  
        }
        // not an include element
        else { // recursively process children
           // still need to adjust bases here????
           // replace nodes instead
           // Do I need to explicitly attach attributes here or does
           // importing take care of that????
           Element copy = (Element) resolved.importNode(original, false);
           NodeList children = original.getChildNodes();
           for (int i = 0; i < children.getLength(); i++) {
             Node n = children.item(i);
             if (n instanceof Element) {
               Element e = (Element) n;
               NodeList kids = resolve(e, bases, resolved);
               for (int j = 0; j < kids.getLength(); j++) {
                   copy.appendChild(kids.item(j));
               }
             }
             else {
               copy.appendChild(resolved.importNode(n, true));
             }
           }
           result.add(copy);
        }
  
        return result;

    }

  /**
    * <p>
    * This utility method reads a document at a specified URL
    * and returns the contents of that document as a <code>Text</code>.
    * It's used to include files with <code>parse="text"</code>
    * </p>
    *
    * @param url      URL of the document that will be stored in
    *                 <code>String</code>.
    * @param  encoding Encoding of the document; e.g. UTF-8,
    *                  ISO-8859-1, etc. If this is null or the empty string
    *                  then UTF-8 is guessed. 
    * @return String  The document retrieved from the source <code>URL</code>
    * @throws UnavailableResourceException if the requested document cannot
                                           be downloaded from the specified URL.
    */
    private static String downloadTextDocument(String url, String encoding)
      throws UnavailableResourceException {

        if (encoding == null || encoding.equals("")) {
            encoding = "UTF-8";  
            // should try to read encoding from HTTP header
            // and XML declaration heuristics     
        }
        URL source;
        try {
            source = new URL(url);
        }
        catch (MalformedURLException e) {
            UnavailableResourceException ex =
              new UnavailableResourceException("Unresolvable URL " + url);
            ex.setRootCause(e);
            throw ex;
        }

        StringBuffer s = new StringBuffer();
        try {
            URLConnection uc = source.openConnection();
            InputStream in = new BufferedInputStream(uc.getInputStream());
            String encodingFromHeader = uc.getContentEncoding();
            String contentType = uc.getContentType();
            if (encodingFromHeader != null) encoding = encodingFromHeader;
            else {
                // What if file does not have a MIME type but name ends in .xml????
                // MIME types are case-insensitive
                // Java may be picking this up from file URL
                if (contentType != null) {
                    contentType = contentType.toLowerCase();
                    if (contentType.equals("text/xml") 
                      || contentType.equals("application/xml")   
                      || (contentType.startsWith("text/") && contentType.endsWith("+xml") ) 
                      || (contentType.startsWith("application/") && contentType.endsWith("+xml"))) {
                         encoding = EncodingHeuristics.readEncodingFromStream(in);
                    }
                }
            }
            InputStreamReader reader = new InputStreamReader(in, encoding);
            int c;
            while ((c = in.read()) != -1) {
                s.append((char) c);
            }
            return s.toString();
        }
        catch (UnsupportedEncodingException e) {
            UnavailableResourceException ex = new UnavailableResourceException(
              "Encoding not recognized for document " + source.toExternalForm());
            ex.setRootCause(e);
            throw ex;
        }
        catch (IOException e) {
            UnavailableResourceException ex = new UnavailableResourceException(
              "Document not found: " + source.toExternalForm());
            ex.setRootCause(e);
            throw ex;
        }

    }

    /**
      * <p>
      * The driver method for the XIncluder program.
      * I'll probably move this to a separate class soon.
      * </p>
      *
      * @param args  contains the URLs and/or filenames
      *              of the documents to be procesed.
      */
    public static void main(String[] args) {

        DOMParser parser = new DOMParser();
        for (int i = 0; i < args.length; i++) {
            try {
                parser.parse(args[i]);
                Document input = parser.getDocument();
                // absolutize URL
                String base = args[i];
                if (base.indexOf(':') < 0) {
                  File f = new File(base);
                  base = f.toURL().toExternalForm();
                }
                Document output = resolve(input, base);
                // need to set encoding on this to Latin-1 and check what
                // happens to UTF-8 curly quotes
                OutputFormat format = new OutputFormat("XML", "ISO-8859-1", false);
                format.setPreserveSpace(true);
                XMLSerializer serializer
                 = new XMLSerializer(System.out, format);
                serializer.serialize(output);
            }
            catch (Exception e) {
                System.err.println(e);
                e.printStackTrace();
            }
        }

    }

}


// I need to create NodeLists in a parser independent fashion
class XIncludeNodeList implements NodeList {

    private List data = new ArrayList();

// could easily expose more List methods if they seem useful
    public void add(int index, Node node) {
        data.add(index, node);
    }

    public void add(Node node) {
        data.add(node);
    }

    public void add(NodeList nodes) {
        for (int i = 0; i < nodes.getLength(); i++) {
            data.add(nodes.item(i));
        }
    }

    public Node item(int index) {
        return (Node) data.get(index);
    }

// copy DOM JavaDoc
    public int getLength() {
        return data.size();
    }

}

To Learn More

XInclude Specification: http://www.w3.org/TR/xinclude

Part II: XML Base

What is XML Base?

An inband means of specifying the proper URI for a document that can succeed even if out-of-band mechanisms aren't available.
A means of specifying the proper base URI which relative URLs are relative to, even if the document itself is copied to a different location.
An XML replacement for the HTML BASE element
W3C Recommendation, June 27, 2001

The xml:base attribute

<slide xml:base="http://www.ibiblio.org/xml/slides/xmloneaustin2001/xlinks/">
  <title>The xml:base attribute</title>
  ...
  <previous xlink:type="simple" xlink:href="What_Is_XBase.xml"/>
  <next xlink:type="simple" xlink:href="xbaseexample.xml"/>
</slide>

May be attached to any element to set the base URI for that element and its descendants
The xml prefix is automatically bound to the http://www.w3.org/XML/1998/namespace URI
The value should be an absolute URI

XML Base Example

<COURSE xmlns:xlink="http://www.w3.org/1999/xlink"
         xml:base="http://www.ibiblio.org/javafaq/course/"
         xlink:type="extended">

  <TOC xlink:type="locator" xlink:href="index.html" xlink:label="index"/>

  <CLASS xlink:type="locator" xlink:label="class"
         xlink:href="week1.xml"/>
  <CLASS xlink:type="locator" xlink:label="class" 
         xlink:href="week2.xml"/>
  <CLASS xlink:type="locator" xlink:label="class" 
         xlink:href="week3.xml"/>
  <CLASS xlink:type="locator" xlink:label="class" 
         xlink:href="week4.xml"/>
  <CLASS xlink:type="locator" xlink:label="class" 
         xlink:href="week5.xml"/>
  <CLASS xlink:type="locator" xlink:label="class" 
         xlink:href="week6.xml"/>
  <CLASS xlink:type="locator" xlink:label="class" 
         xlink:href="week7.xml"/>
  <CLASS xlink:type="locator" xlink:label="class" 
         xlink:href="week8.xml"/>
  <CLASS xlink:type="locator" xlink:label="class" 
         xlink:href="week9.xml"/>
  <CLASS xlink:type="locator" xlink:label="class" 
         xlink:href="week10.xml"/> 
  <CLASS xlink:type="locator" xlink:label="class" 
         xlink:href="week11.xml"/> 
  <CLASS xlink:type="locator" xlink:label="class" 
         xlink:href="week12.xml"/>
  <CLASS xlink:type="locator" xlink:label="class" 
         xlink:href="week13.xml"/>
  
  <CONNECTION xlink:type="arc" from="index" to="class"/>
  <CONNECTION xlink:type="arc" from="class" to="index"/>
  
</COURSE>

"index.html" now resolves to the URI "http://www.ibiblio.org/javafaq/course/index.html"
"week1.xml" resolves to the URI "http://www.ibiblio.org/javafaq/course/week1.xml"
"week2.xml" resolves to the URI "http://www.ibiblio.org/javafaq/course/week2.xml"
"week3.xml" resolves to the URI "http://www.ibiblio.org/javafaq/course/week3.xml"
etc.

Open Issues

How does it interact with XHTML? in particular, the XHTML base element?
Browser and other application support?

To Learn More

XML Base Specification: http://www.w3.org/TR/xmlbase

To Learn More

This presentation: http://www.ibiblio.org/xml/slides/xmlonesanjose2001/xinclude
XML Base Specification: http://www.w3.org/TR/xmlbase
XInclude Specification: http://www.w3.org/TR/xinclude
XPath Specification: http://www.w3.org/TR/xpath
XPointer Specification: http://www.w3.org/TR/xptr
XML Bible, Gold edition
- Elliotte Rusty Harold
- Hungry Minds, 2001
- ISBN 0-7645-4819-0

Index | Cafe con Leche