If for some reason, you want all the attributes of an element or you don’t know their names, you can use the getAttributes() method to retrieve a NamedNodeMap inherited from the Node. (Why getAttributes() is in Node instead of Element I have no idea. Elements are the only kind of node that can have attributes. For all other types of node, getAttributes() returns null.) The NamedNodeMap interface, summarized in Example 11.5, has methods to get and set the various named nodes as well to iterate through the nodes like a list. Here it’s used for attributes, but soon you’ll see it used for notations and entities as well.
Example 11.5. The NamedNodeMap interface
package org.w3c.dom; public interface NamedNodeMap { // for iterating through the map as a list public Node item(int index); public int getLength(); // For working with particular items in the list public Node getNamedItem(String name); public Node setNamedItem(Node arg) throws DOMException; public Node removeNamedItem(String name) throws DOMException; public Node getNamedItemNS(String namespaceURI, String localName); public Node setNamedItemNS(Node arg) throws DOMException; public Node removeNamedItemNS(String namespaceURI, String localName) throws DOMException; }
I’ll demonstrate with an XLink spider program like the one you saw in Chapter 6. However, this time I’ll implement the program on top of DOM rather than SAX. You can judge for yourself which one is more natural.
Recall that XLink is an attribute based syntax for denoting connections between documents. The element that is the link has an xlink:type attribute with the value simple and an xlink:href attribute whose value is the URL of the remote document. For example, this book element points to this book’s home page:
<book xlink:type="simple" xlink:href="http://www.cafeconleche.org/books/xmljava/" xmlns:xlink="http://www.w3.org/1999/xlink"> Processing XML with Java </book>
The customary prefix xlink is bound to the namespace URI http://www.w3.org/1999/xlink. Most of the time you should depend on the specific URI and not the prefix, which may change.
Relative URLs are relative to the nearest ancestor xml:base attribute if one is present or the location of the document otherwise. For example, the book element in this library element also points to http://www.cafeconleche.org/books/xmljava/.
<library xml:base="http://www.cafeconleche.org/" xmlns:xlink="http://www.w3.org/1999/xlink"> <book xlink:type="simple" xlink:href="books/xmljava/"> Processing XML with Java </book> </library>
The prefix xml is bound to the namespace URI http://www.w3.org/XML/1998/namespace. This is a special case, however. The xml prefix cannot be changed, and does not need to be declared.
Attributes provide all the information needed to process the link. Consequently, a spider can follow XLinks without knowing any details about the rest of the markup in the document. Example 11.6 is such a program. Currently this spider does nothing more than follow the links and print their URLs. However, it would not be hard to add code to load the discovered documents into a database or perform some other useful operation. You’d just subclass DOMSpider while overriding the process() method.
Example 11.6. An XLink spider that uses DOM
import org.xml.sax.SAXException; import javax.xml.parsers.*; import java.io.*; import java.util.*; import java.net.*; import org.w3c.dom.*; public class DOMSpider { public static String XLINK_NAMESPACE = "http://www.w3.org/1999/xlink"; // This will be used to read all the documents. We could use // multiple parsers in parallel. However, it's a lot easier // to work in a single thread, and doing so puts some real // limits on how much bandwidth this program will eat. private DocumentBuilder parser; // Builds the parser public DOMSpider() throws ParserConfigurationException { try { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setNamespaceAware(true); parser = factory.newDocumentBuilder(); } catch (FactoryConfigurationError e) { // I don't absolutely need to catch this, but I hate to // throw an Error for no good reason. throw new ParserConfigurationException( "Could not locate a factory class"); } } // store the URLs already visited private Vector visited = new Vector(); // Limit the amount of bandwidth this program uses private int maxDepth = 5; private int currentDepth = 0; public void spider(String systemID) { currentDepth++; try { if (currentDepth < maxDepth) { Document document = parser.parse(systemID); process(document, systemID); Vector toBeVisited = new Vector(); // search the document for uris, // store them in vector, and print them findLinks(document.getDocumentElement(), toBeVisited, systemID); Enumeration e = toBeVisited.elements(); while (e.hasMoreElements()) { String uri = (String) e.nextElement(); visited.add(uri); spider(uri); } } } catch (SAXException e) { // Couldn't load the document, // probably not well-formed XML, skip it } catch (IOException e) { // Couldn't load the document, // likely network failure, skip it } finally { currentDepth--; System.out.flush(); } } public void process(Document document, String uri) { System.out.println(uri); } // Recursively descend the tree of one document private void findLinks(Element element, List uris, String base) { // Check for an xml:base attribute String baseAtt = element.getAttribute("xml:base"); if (!baseAtt.equals("")) base = baseAtt; // look for XLinks in this element if (isSimpleLink(element)) { String uri = element.getAttributeNS(XLINK_NAMESPACE, "href"); if (!uri.equals("")) { try { String wholePage = absolutize(base, uri); if (!visited.contains(wholePage) && !uris.contains(wholePage)) { uris.add(wholePage); } } catch (MalformedURLException e) { // If it's not a good URL, then we can't spider it // anyway, so just drop it on the floor. } } // end if } // end if // process child elements recursively NodeList children = element.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { Node node = children.item(i); int type = node.getNodeType(); if (type == Node.ELEMENT_NODE) { findLinks((Element) node, uris, base); } } // end for } // If you're willing to require Java 1.4, you can do better // than this with the new java.net.URI class private static String absolutize(String context, String uri) throws MalformedURLException { URL contextURL = new URL(context); URL url = new URL(contextURL, uri); // Remove fragment identifier if any String wholePage = url.toExternalForm(); int fragmentSeparator = wholePage.indexOf('#'); if (fragmentSeparator != -1) { // There is a fragment identifier wholePage = wholePage.substring(0, fragmentSeparator); } return wholePage; } private static boolean isSimpleLink(Element element) { String type = element.getAttributeNS(XLINK_NAMESPACE, "type"); if (type.equals("simple")) return true; return false; } public static void main(String[] args) { if (args.length == 0) { System.out.println("Usage: java DOMSpider topURL"); return; } // start parsing... try { DOMSpider spider = new DOMSpider(); spider.spider(args[0]); } catch (Exception e) { System.err.println(e); e.printStackTrace(); } } // end main } // end DOMSpider
There are two levels of recursion here. The spider() method recursively spiders documents. The findLinks() method recursively searches through the elements in a document looking for XLinks. It adds the URLs found in these links to a list of unvisited pages. After finishing each of these documents, the next document is retrieved from the list and processed in turn. If it’s an XML document, then it is parsed and passed to the process() method. Non-XML documents found at the end of XLinks are ignored.
I tested this program by pointing it at the Resource Directory Description Language specification, which is one of the few real-world documents I know of that uses XLinks. I was surprised to find out just how much XLinked XML there is out there in the world, though as of yet most of it is just more XML specifications. This must be what the Web felt like circa 1991. Here’s a sample of the more interesting output:
D:\books\XMLJAVA>java DOMSpider http://www.rddl.org/ http://www.rddl.org/ http://www.rddl.org/purposes http://www.rddl.org/purposes/software http://www.rddl.org/rddl.rdfs http://www.rddl.org/rddl-integration.rxg http://www.rddl.org/modules/rddl-1.rxm … http://www.w3.org/2001/XMLSchema http://www.w3.org/2001/XMLSchema.xsd http://www.examplotron.org http://www.examplotron.org/compile.xsl http://www.examplotron.org/examplotron.xsd http://www.examplotron.org/0/1/ http://www.examplotron.org/0/2/ http://www.examplotron.org/0/3/ http://webns.net/rdfs/ http://www.w3.org/2000/01/rdf-schema http://webns.net/rdfs/?format=rdf http://webns.net/foaf/ http://xmlns.com/foaf/0.1/ http://webns.net/foaf/?format=rdf http://webns.net/dc/ http://purl.org/dc/elements/1.1/ http://webns.net/dc/?format=rdf http://openhealth.org/XSet http://xsltunit.org/0/1/ http://xsltunit.org/0/1/xsltunit.xsl http://xsltunit.org/0/1/tst_library.xsl http://xsltunit.org/0/1/library.xml http://xsltunit.org/0/1/library.xsl http://venetica.com/venicebridgecontent/ http://www.venetica.com/VeniceBridgeContent http://www.venetica.com/VeniceBridgeContent/VeniceBridgeContent40.xsd http://www.venetica.com/VeniceBridgeContent/VeniceBridgeContent.biz http://www.venetica.com/VeniceBridgeContent/rddl30.html http://www.w3.org/TR/xhtml-basic http://www.w3.org/TR/xml-infoset/ http://www.w3.org/TR/xhtml-modularization/
Copyright 2001, 2002 Elliotte Rusty Harold | elharo@metalab.unc.edu | Last Modified February 26, 2002 |
Up To Cafe con Leche |