Although there are many different kinds of XPath expressions, the one that’s of primary use in Java programs is the location path. A location path selects a set of nodes from an XML document. Each location path is composed of one or more location steps. Each location step has an axis, a node test and, optionally, one or more predicates. Furthermore, each location step is evaluated with respect to a particular context node. A double colon (::) separates the axis from the node test, and each predicate is enclosed in square brackets.
Some examples will help explain all these terms. Consider the simple XML-RPC request document in Example 16.3.
Example 16.3. An XML-RPC request document
<?xml version="1.0"?> <methodCall> <methodName>calculateFibonacci</methodName> <params> <param> <value> <int>23</int> </value> </param> </params> </methodCall>
Exactly how the context node for a location step is determined depends on the environment in which the location step appears. When using XPath in Java code, you normally pass the context node as an argument to the method that evaluates the expression. In XSLT the context node is normally the currently matched node in the input document. In other environments, other means are provided to choose the context node. For now, let’s just pick the root methodCall element as the context node. Then child::methodName is a location step that selects a node-set containing the single methodName element. It moves along the child axis with the node test methodName. That is, it selects all the children of the context node named methodName. child::params returns a node-set containing the single params element.
Location paths are not guaranteed to return a node-set that contains exactly one node (and assuming they do is a very common mistake). child::* returns a node-set containing two element nodes, one for the methodName element and one for the params element. The asterisk is a wild card node test that matches any element, regardless of name.
There are twelve axes along which a location step can move. Each selects a different subset of the nodes in the document, depending on the context node. These are:
The node itself.
All child nodes of the context node. (Attributes and namespaces are not considered to be children of the node they belong to.)
All nodes completely contained inside the context node (between the end of its start-tag and the beginning of its end-tag); that is, all child nodes, plus all children of the child nodes, and all children of the children’s children, and so forth. This axis is empty if the context node is not an element node or a root node.
All descendants of the context node and the context node itself.
The node which most immediately contains the context node. The root node has no parent. The parent of the root element and comments and processing instructions in the document’s prolog and epilog is the root node. The parent of every other node is an element node. The parent of a namespace or attribute node is the element node that contains it, even though namespaces and attributes aren’t children of their parent elements.
The root node and all element nodes that contain the context node.
All ancestors of the context node and the context node itself.
All non-attribute, non-namespace nodes which come before the context node in document order and which are not ancestors of the context node
All non-attribute, non-namespace nodes which come before the context node in document order and have the same parent node
All non-attribute, non-namespace nodes which follow the context node in document order and which are not descendants of the context node.
All non-attribute, non-namespace nodes which follow the context node in document order and have the same parent node
Attributes of the context node. This axis is empty if the context node is not an element node.
Namespaces in scope on the context node. This axis is empty if the context node is not an element node.
For example, consider the slightly more complex SOAP request document in Example 16.4. Let us pick the middle Quote element (the one whose symbol is AAPL) as the context node and move along each of the axes from there.
Example 16.4. A SOAP request document
<?xml version="1.0"?> <!-- XPath axes example --> <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns="http://namespaces.cafeconleche.org/xmljava/ch2/"> <SOAP-ENV:Body> <Quote symbol="RHAT"> <Price currency="USD">7.02</Price> </Quote> <Quote symbol="AAPL"> <Price currency="USD">24.85</Price> </Quote> <Quote symbol="BAC"> <Price currency="USD">68.59</Price> </Quote> </SOAP-ENV:Body> </SOAP-ENV:Envelope>
The self axis contains one node, the middle Quote element that was chosen as the context node.
The child axis contains three nodes: a text node containing white space, an element node with the local name Price, and another text node containing only white space, in that order. (All the white space counts, though there are ways to get rid of it or ignore it if you want to, as you’ll see later.)
The descendant axis contains four nodes: a text node containing white space, an element node with the local name Price, a text node with the value "24.85", and another text node containing only white space, in that order.
The descendant-or-self axis contains five nodes: an element node with the local name Quote, a text node containing white space, an element node with the local name Price, a text node with the value "24.85", and another text node containing only white space, in that order.
The parent axis contains a single element node with the local name Body.
The ancestor axis contains three nodes: an element node with the local name Body, an element node with the local name Envelope, and the root node in that order.
The ancestor-or-self axis contains four nodes: an element node with the local name Quote, an element node with the local name Body, an element node with the local name Envelope, and the root node in that order.
The preceding axis contains eight nodes: a text node containing only white space, another text node containing only white space, a text node containing the string 7.02, an element node named Price, another text node containing only white space, an element node named Quote, a text node containing only white space, a comment node in that order. Note that ancestor elements and attribute and namespace nodes are not counted along the preceding axis.
The preceding-sibling axis contains three nodes: a text node containing white space, an element node with the name Quote and the symbol RHAT, and another text node containing only white space.
The following axis contains eight nodes: a text node containing only white space, a Quote element node, a text node containing only white space, a Price element node, a text node containing the string 68.59, and three text nodes containing only white space. Descendants are not included in the following axis.
The following-sibling axis contains three nodes: a text node containing white space, an element node with the name Quote and the symbol BAC, and another text node containing only white space.
The attribute axis contains one attribute node with the name symbol and the value AAPL.
The namespace axis contains two namespace nodes, one with the name SOAP-ENV and the value http://schemas.xmlsoap.org/soap/envelope/ and the other with an empty string name and the value http://namespaces.cafeconleche.org/xmljava/ch2/.
Generally these sets would be further subsetted via a node test. For example, if the location step preceding::Quote were applied to this context node, then the resulting node-set would only contain a single node, an element node named Quote.
The axis chooses the direction to move from the context node. The node test determines what kinds of nodes will be selected along that axis. The node tests are:
Any element or attribute with the specified name. If the name is prefixed, then the local name and namespace URI are compared, not the qualified names. If the name is not prefixed, then the element must be in no namespace at all. An unprefixed name in an XPath expression never matches an element in a namespace, even in the default namespace. When using XPath to search for an unprefixed element like Quote that is in a namespace, you have to use a prefixed name instead such as stk:Quote. Exactly how the prefix is mapped to the namespace depends on the environment in which the XPath expression is used.
Along the attribute axis the asterisk matches all attribute nodes. Along the namespace axis the asterisk matches all namespace nodes. Along all other axes, this matches all element nodes.
Any element or attribute in the namespace mapped to the prefix.
Any comment
Any text node
Any node
Any processing instruction
Any processing instruction with the specified target
For example, once again considering the SOAP request document in Example 16.4 and choosing the AAPL Quote element as the context node, consider these location steps:
self::* selects one node, the middle Quote element that serves as the context node.
child::* selects one node, an element node with the name Price and the value 24.85.
child::Price selects no nodes because there are no Price elements in this document that are not in any namespace.
child::stk:Price selects one node, an element node with the name Price and the value 24.85, provided that the prefix stk is bound to the http://namespaces.cafeconleche.org/xmljava/ch2/ namespace URI in the local environment.
descendant::text() selects three nodes: a text node containing white space, a text node with the value "24.85", and another text node containing only white space.
descendant-or-self::* selects two nodes: an element node with the name Quote and an element node with the name Price.
parent::SOAP-ENV:Envelope selects an empty node set because the parent of the context node is not SOAP-ENV:Envelope.
ancestor::SOAP-ENV:Envelope selects one node, the document element, assuming that the local environment maps the prefix SOAP-ENV to the namespace URI http://schemas.xmlsoap.org/soap/envelope/.
ancestor::SOAP-ENV:* selects two nodes, the SOAP-ENV:Body element and the SOAP-ENV:Envelope element, again assuming that the prefixes are properly mapped.
ancestor-or-self::* selects three nodes: an element node with the local name Quote, an element node with the local name Body, and an element node with the local name Envelope.
preceding::comment() selects the single comment in the prolog.
preceding-sibling::node() selects three nodes: a text node containing white space, an element node with the name Quote and the symbol RHAT, and another text node containing only white space, in that order.
following::* selects two nodes: a Quote element node and a Price element node.
following-sibling::processing-instruction() returns an empty node-set.
attribute::symbol selects the attribute node with the name symbol and the value AAPL.
namespace::SOAP-ENV returns a node-set containing a namespace node with name SOAP-ENV and the value http://schemas.xmlsoap.org/soap/envelope/.
namespace::* returns a node-set containing two namespace nodes, one with the name SOAP-ENV and the value http://schemas.xmlsoap.org/soap/envelope/ and the other with an empty string name and the value http://namespaces.cafeconleche.org/xmljava/ch2/.
Each location step can have zero or more predicates that further filter the node-set. A predicate is an XPath expression in square brackets that is evaluated for each node selected by the location step. If the predicate is true, then the node is kept in the node-set. If the predicate is false, then the node is removed from the node-set. For example, given the same SOAP request document, suppose the context node is now the SOAP-ENV:Body element and that the stk prefix is mapped to the http://namespaces.cafeconleche.org/xmljava/ch2/ namespace URI. This location step returns a node-set containing all the Quote elements whose price is less than ten:
child::stk:Quote[child::stk:Price < 10]
If this XPath expression were embedded in an XML document, you might need to escape the less than sign as <. However, this is not necessary when using XPath expressions in Java programs.
There can be more than one predicate. For example, this location step checks both that the absolute price is greater than ten and that the currency is U.S. dollars:
child::stk:Quote[child::stk:Price > 10][attribute::currency = "USD"]
If the predicate returns a number, then the node is kept in the set only if the number is equal to the position of the context node in the context node list. For example, this location step selects the third Quote child of the context node but not the first or second:
child::stk:Quote[3]
If the context node has fewer than three Quote children, then this returns an empty node-set.
If the predicate returns a string, then the context node is deleted from the set if the string is empty and kept otherwise. For example, this location step selects those Quote elements whose symbol attribute has a value:
child::stk:Quote[string(attribute::symbol)]
This is not quite the same as selecting the Quote elements that have a symbol attribute. This Quote element would not be matched by the above location step:
<Quote symbol=""> <Price currency="USD">17.32</Price> </Quote>
If the predicate returns a node-set, then the source node is kept in the returned set only if the predicate node-set is non-empty. It is deleted otherwise. For example, this location step finds those Quote children of the context node that have at least one Price child:
child::stk:Quote[child::stk:Price]
This location step finds those Quote children of the context node that have at least one Price child and at least one Quantity child:
child::stk:Quote[child::stk:Price][child::stk:Quantity]
When applied to the SOAP-ENV:Body element in Example 16.4, it returns an empty node-set because none of its Quote children have a Quantity child.
The forward slash (/) combines location steps into a location path. The node-set selected by the first step becomes the context node-set for the second step. The node-set identified by the second step becomes the context node-set for the third step, and so on.
Continuing with the same example in Example 16.4 and still using the second Quote element as the context node, consider these location paths (Here I assume that the environment for the XPath expressions maps the prefix stk to the namespace URI http://namespaces.cafeconleche.org/xmljava/ch2/ and the prefix SOAP-ENV to the namespace URI http://schemas.xmlsoap.org/soap/envelope/):
This selects the currency attribute node currency="USD"
This selects one node, the first value element in the document.
This selects all three Quote element nodes in the document, including the context node itself.
This selects the AAPL and the BAC Quote element nodes, but not the RHAT Quote element node.
This selects all three Price element nodes in the document.
This selects the Price element node of the BAC Quote element.
This selects all three currency attribute nodes in the document.
So far all the location paths have been relative to a specified context node. To date, I’ve just identified that context node in prose. When we begin discussing XPath APIs, you’ll see that most methods for evaluating an XPath expression have a context node argument. However, not all location paths require context nodes. In particular, a location path that begins with a forward slash (/) is an absolute path that starts at the root node of the document (not the root element but the root node).
Continuing with the same example in Example 16.4 and once again assuming that the environment binds the prefix stk to the namespace URI http://namespaces.cafeconleche.org/xmljava/ch2/ and the prefix SOAP-ENV to the namespace URI http://schemas.xmlsoap.org/soap/envelope/, consider these location paths:
This selects all three Price element nodes.
This selects the single SOAP-ENV:Body element node.
This selects all three Price element nodes in the document.
This selects the Quote element nodes whose Price is greater than 20; i.e. it selects the AAPL and the BAC Quote element nodes, but not the RHAT Quote element node.
This returns an empty node-set because the root element of the document is SOAP-ENV:Envelope, not SOAP-ENV:Body.
This returns a node-set containing all attribute nodes in the document.
This returns a node-set containing all non-attribute, non-namespace nodes in the document.
This selects the root node of the document.
XPath location paths can use the abbreviations listed in Table 16.2 in location paths. The semantics are the same. The syntax is just a little easier to type.
Table 16.2. Abbreviated syntax for XPath
Abbreviation | Expanded form |
---|---|
Name | child::Name |
@Name | attribute::Name |
// | /descendant-or-self::node()/ |
. | self::node() |
.. | parent::node() |
Using the abbreviated forms, the previous batch of relative XPaths selecting from Example 16.4 using the second Quote element as the context node can be rewritten like this:
This selects the currency attribute node currency="USD"
This isn’t an exact abbreviation for preceding-sibling::stk:Quote/descendant::* (// expands to /descendant-or-self::node()/, not /descendant::) but the node-set selected is the same, the first Price element in the document.
This selects all three Quote element nodes in the document, including the context node itself.
This also isn’t an exact abbreviation for the original expression, but again it selects the same node-set, which in this case contains all three Price element nodes in the document.
This selects the AAPL and the BAC Quote element nodes, but not the RHAT Quote element node.
This selects the Price child element node of the BAC Quote element.
This too isn’t an exact abbreviation for the original expression, but once again it selects the same node-set containing all three currency attribute nodes in the document.
Absolute location paths can also be abbreviated. In this case // is especially convenient because at the start of a location path it produces a node-set containing every non-attribute, non-namespace node in the document. However, you should be warned that it is quite inefficient in most XPath processors. If it’s possible to rewrite an expression so that it does not use // (or the unabbreviated descendant or descendant-or-self axes), you probably should.
Here are some example of abbreviated absolute location paths that apply to Example 16.4:
This selects all three Quote element nodes.
This selects the single SOAP-ENV:Body element node.
This selects all three Price element nodes in the document.
This selects the Quote element nodes whose Price is greater than 20; i.e. it selects the AAPL and the BAC Quote element nodes, but not the RHAT Quote element node.
This returns an empty node-set because the root element of the document is SOAP-ENV:Envelope, not Price.
This returns a node-set containing all attribute nodes in the document.
This returns a node-set containing all non-attribute, non-namespace nodes in the document.
Occasionally it’s useful to select a node-set that’s built from multiple, more or less unrelated parts of an XML document. For example, you might want to select all the Price elements and all the Quote elements in a document. //stk:Price selects all the prices. //stk:Quote selects all the quotes. You can use the vertical bar, |, to combine these two node-sets into one.
selects all the Price element nodes and all the Quote element nodes in the document.
selects all the currency attribute nodes and all the Price element nodes.
selects all the Price and Quantity child elements of all Quote elements.
Copyright 2001, 2002 Elliotte Rusty Harold | elharo@metalab.unc.edu | Last Modified June 07, 2002 |
Up To Cafe con Leche |