Writing XML with Java
Reading XML through SAX2
Reading and Writing XML through the DOM
You need a JDK
You need some free class libraries
You need a text editor
You need some data to process
Are familiar with Java including I/O, classes, objects, polymorphism, etc.
Know XML including well-formedness, validity, namespaces, and so forth
SAX, the Simple API for XML
SAX1
SAX2
DOM, the Document Object Model
DOM Level 0
DOM Level 1
DOM Level 2
JDOM
Proprietary APIs
Parser specific APIs
Sun's Java API for XML Parsing = SAX1 + DOM1 + a few factory classes
JSR-000031 XML Data Binding Specification from Bluestone, Sun, webMethods et al.
The proposed specification will define an XML data-binding facility for the JavaTM Platform. Such a facility compiles an XML schema into one or more Java classes. These automatically-generated classes handle the translation between XML documents that follow the schema and interrelated instances of the derived classes. They also ensure that the constraints expressed in the schema are maintained as instances of the classes are manipulated.
XML documents are text
Any Writer
can produce an XML document
XML documents and APIs are Unicode
Unicode encodings:
UTF-8
UTF-16 big endian
UCS-4 big endian
UTF-16 little endian
UCS-4 little endian
Non-Unicode encodings:
ASCII (subset of UTF-8)
MacRoman
Windows ANSI
Latin 1 through Latin 15
SJIS Japanese
Big-5 Chinese
K0I8R Cyrillic
Many others...
Java's InputStreamReader
and OutputStreamWriter
classes are very helpful
URL u = new URL(
"http://www.fxis.co.jp/DMS/sgml/xml/charset/utf-8/weekly.xml");
InputStream in = u.openStream();
InputStreamReader reader = new InputStreamReader(in, "UTF-8");
int c;
while ((c = in.read()) != -1) System.out.write((char) c);
import java.math.BigInteger; import java.io.*; public class FibonacciText { public static void main(String[] args) { try { FileOutputStream fout = new FileOutputStream("fibonacci.txt"); OutputStreamWriter out = new OutputStreamWriter(fout, "8859_1"); BigInteger low = BigInteger.ZERO; BigInteger high = BigInteger.ONE; for (int i = 0; i <= 25; i++) { out.write(low.toString() + "\r\n"); BigInteger temp = high; high = high.add(low); low = temp; } out.close(); } catch (IOException e) { System.err.println(e); } } }
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025
import java.math.*; import java.io.*; public class FibonacciXML { public static void main(String[] args) { try { FileOutputStream fout = new FileOutputStream("fibonacci.xml"); OutputStreamWriter out = new OutputStreamWriter(fout); BigInteger low = BigInteger.ZERO; BigInteger high = BigInteger.ONE; out.write("<?xml version=\"1.0\"?>\r\n"); out.write("<Fibonacci_Numbers>\r\n"); for (int i = 0; i <= 25; i++) { out.write(" <fibonacci index=\"" + i + "\">"); out.write(low.toString()); out.write("</fibonacci>\r\n"); BigInteger temp = high; high = high.add(low); low = temp; } out.write("</Fibonacci_Numbers>"); out.close(); } catch (IOException e) { System.err.println(e); } } }
<?xml version="1.0"?> <Fibonacci_Numbers> <fibonacci index="0">0</fibonacci> <fibonacci index="1">1</fibonacci> <fibonacci index="2">1</fibonacci> <fibonacci index="3">2</fibonacci> <fibonacci index="4">3</fibonacci> <fibonacci index="5">5</fibonacci> <fibonacci index="6">8</fibonacci> <fibonacci index="7">13</fibonacci> <fibonacci index="8">21</fibonacci> <fibonacci index="9">34</fibonacci> <fibonacci index="10">55</fibonacci> <fibonacci index="11">89</fibonacci> <fibonacci index="12">144</fibonacci> <fibonacci index="13">233</fibonacci> <fibonacci index="14">377</fibonacci> <fibonacci index="15">610</fibonacci> <fibonacci index="16">987</fibonacci> <fibonacci index="17">1597</fibonacci> <fibonacci index="18">2584</fibonacci> <fibonacci index="19">4181</fibonacci> <fibonacci index="20">6765</fibonacci> <fibonacci index="21">10946</fibonacci> <fibonacci index="22">17711</fibonacci> <fibonacci index="23">28657</fibonacci> <fibonacci index="24">46368</fibonacci> <fibonacci index="25">75025</fibonacci> </Fibonacci_Numbers>
import java.math.BigInteger; import java.io.*; public class FibonacciLatin1 { public static void main(String[] args) { try { FileOutputStream fout = new FileOutputStream("fibonacci_8859_1.xml"); OutputStreamWriter out = new OutputStreamWriter(fout, "8859_1"); BigInteger low = BigInteger.ZERO; BigInteger high = BigInteger.ONE; out.write("<?xml version=\"1.0\" encoding=\"8859_1\"?>\r\n"); out.write("<Fibonacci_Numbers>\r\n"); for (int i = 0; i <= 25; i++) { out.write(" <fibonacci index=\"" + i + "\">"); out.write(low.toString()); out.write("</fibonacci>\r\n"); BigInteger temp = high; high = high.add(low); low = temp; } out.write("</Fibonacci_Numbers>"); out.close(); } catch (IOException e) { System.err.println(e); } } }
<?xml version="1.0" encoding="8859_1"?> <Fibonacci_Numbers> <fibonacci index="0">0</fibonacci> <fibonacci index="1">1</fibonacci> <fibonacci index="2">1</fibonacci> <fibonacci index="3">2</fibonacci> <fibonacci index="4">3</fibonacci> <fibonacci index="5">5</fibonacci> <fibonacci index="6">8</fibonacci> <fibonacci index="7">13</fibonacci> <fibonacci index="8">21</fibonacci> <fibonacci index="9">34</fibonacci> <fibonacci index="10">55</fibonacci> <fibonacci index="11">89</fibonacci> <fibonacci index="12">144</fibonacci> <fibonacci index="13">233</fibonacci> <fibonacci index="14">377</fibonacci> <fibonacci index="15">610</fibonacci> <fibonacci index="16">987</fibonacci> <fibonacci index="17">1597</fibonacci> <fibonacci index="18">2584</fibonacci> <fibonacci index="19">4181</fibonacci> <fibonacci index="20">6765</fibonacci> <fibonacci index="21">10946</fibonacci> <fibonacci index="22">17711</fibonacci> <fibonacci index="23">28657</fibonacci> <fibonacci index="24">46368</fibonacci> <fibonacci index="25">75025</fibonacci> </Fibonacci_Numbers>
import java.math.BigInteger; import java.io.*; public class FibonacciDTD { public static void main(String[] args) { try { FileOutputStream fout = new FileOutputStream("valid_fibonacci.xml"); OutputStreamWriter out = new OutputStreamWriter(fout, "UTF-8"); BigInteger low = BigInteger.ZERO; BigInteger high = BigInteger.ONE; out.write("<?xml version=\"1.0\"?>\r\n"); out.write("<!DOCTYPE Fibonacci_Numbers [\r\n"); out.write(" <!ELEMENT Fibonacci_Numbers (fibonacci*)>\r\n"); out.write(" <!ELEMENT fibonacci (#PCDATA)>\r\n"); out.write(" <!ATTLIST fibonacci index CDATA #IMPLIED>\r\n"); out.write("]>\r\n"); out.write("<Fibonacci_Numbers>\r\n"); for (int i = 0; i <= 25; i++) { out.write(" <fibonacci index=\"" + i + "\">"); out.write(low.toString()); out.write("</fibonacci>\r\n"); BigInteger temp = high; high = high.add(low); low = temp; } out.write("</Fibonacci_Numbers>"); out.close(); } catch (IOException e) { System.err.println(e); } } }
<?xml version="1.0"?> <!DOCTYPE Fibonacci_Numbers [ <!ELEMENT Fibonacci_Numbers (fibonacci*)> <!ELEMENT fibonacci (#PCDATA)> <!ATTLIST fibonacci index CDATA #IMPLIED> ]> <Fibonacci_Numbers> <fibonacci index="0">0</fibonacci> <fibonacci index="1">1</fibonacci> <fibonacci index="2">1</fibonacci> <fibonacci index="3">2</fibonacci> <fibonacci index="4">3</fibonacci> <fibonacci index="5">5</fibonacci> <fibonacci index="6">8</fibonacci> <fibonacci index="7">13</fibonacci> <fibonacci index="8">21</fibonacci> <fibonacci index="9">34</fibonacci> <fibonacci index="10">55</fibonacci> <fibonacci index="11">89</fibonacci> <fibonacci index="12">144</fibonacci> <fibonacci index="13">233</fibonacci> <fibonacci index="14">377</fibonacci> <fibonacci index="15">610</fibonacci> <fibonacci index="16">987</fibonacci> <fibonacci index="17">1597</fibonacci> <fibonacci index="18">2584</fibonacci> <fibonacci index="19">4181</fibonacci> <fibonacci index="20">6765</fibonacci> <fibonacci index="21">10946</fibonacci> <fibonacci index="22">17711</fibonacci> <fibonacci index="23">28657</fibonacci> <fibonacci index="24">46368</fibonacci> <fibonacci index="25">75025</fibonacci> </Fibonacci_Numbers>
Surname FirstName Team Position Games Played Games Started AtBats Runs Hits Doubles Triples Home runs RBI Stolen Bases Caught Stealing Sacrifice Hits Sacrifice Flies Errors PB Walks Strike outs Hit by pitch
Anderson Garret ANA Outfield 156 151 622 62 183 41 7 15 79 8 3 3 3 6 0 29 80 1
Baughman Justin ANA Second Base 62 54 196 24 50 9 1 1 20 10 4 5 3 8 0 6 36 1
Bolick Frank ANA Third Base 21 11 45 3 7 2 0 1 2 0 0 0 0 0 0 11 8 0
Disarcina Gary ANA Shortstop 157 155 551 73 158 39 3 3 56 12 7 12 3 14 0 21 51 8
Edmonds Jim ANA Outfield 154 150 599 115 184 42 1 25 91 7 5 1 1 5 0 57 114 1
Erstad Darin ANA Outfield 133 129 537 84 159 39 3 19 82 20 6 1 3 3 0 43 77 6
Garcia Carlos ANA Second Base 19 10 35 4 5 1 0 0 0 2 0 1 0 1 0 3 11 1
Glaus Troy ANA Third Base 48 45 165 19 36 9 0 1 23 1 0 0 2 7 0 15 51 0
Greene Todd ANA Outfield 29 15 71 3 18 4 0 1 7 0 0 0 0 0 0 2 20 0
Helfand Eric ANA Catcher 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Hollins Dave ANA Third Base 101 98 363 60 88 16 2 11 39 11 3 2 2 17 0 44 69 7
Jefferies Gregg ANA Outfield 19 18 72 7 25 6 0 1 10 1 0 0 0 0 0 0 5 0
Johnson Mark ANA First Base 10 2 14 1 1 0 0 0 0 0 0 0 0 0 0 0 6 0
Kreuter Chad ANA Catcher 96 74 252 27 63 10 1 2 33 1 0 5 1 9 5 33 49 3
Martin Norberto ANA Second Base 79 50 195 20 42 2 0 1 13 3 1 3 2 4 0 6 29 0
Mashore Damon ANA Outfield 43 24 98 13 23 6 0 2 11 1 0 1 0 0 0 9 22 3
Molina Ben ANA Catcher 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Nevin Phil ANA Catcher 75 65 237 27 54 8 1 8 27 0 0 0 2 5 20 17 67 5
Obrien Charlie ANA Catcher 62 58 175 13 45 9 0 4 18 0 0 3 3 4 1 10 33 2
Palmeiro Orlando ANA Outfield 74 34 165 28 53 7 2 0 21 5 4 7 0 0 0 20 11 0
Pritchett Chris ANA First Base 31 19 80 12 23 2 1 2 8 2 0 0 0 1 0 4 16 0
Salmon Tim ANA Designated Hitter 136 130 463 84 139 28 1 26 88 0 1 0 10 2 0 90 100 3
Shipley Craig ANA Third Base 77 32 147 18 38 7 1 2 17 0 4 4 1 3 0 5 22 5
Velarde Randy ANA Second Base 51 50 188 29 49 13 1 4 26 7 2 0 1 4 0 34 42 1
Walbeck Matt ANA Catcher 108 91 338 41 87 15 2 6 46 1 1 5 5 7 8 30 68 2
Williams Reggie ANA Outfield 29 7 36 7 13 1 0 1 5 3 3 1 0 0 0 7 11 1
import java.io.*; public class BaseballTabToXML { public static void main(String[] args) { try { FileInputStream fin = new FileInputStream(args[0]); BufferedReader in = new BufferedReader(new InputStreamReader(fin)); FileOutputStream fout = new FileOutputStream("baseballstats.xml"); OutputStreamWriter out = new OutputStreamWriter(fout, "UTF-8"); out.write("<?xml version=\"1.0\"?>\r\n"); out.write("<players>\r\n"); String playerStats; while ((playerStats = in.readLine()) != null) { String[] stats = splitLine(playerStats); out.write(" <player>\r\n"); out.write(" <first_name>" + stats[1] + "</first_name>\r\n"); out.write(" <surname>" + stats[0] + "</surname>\r\n"); out.write(" <games_played>" + stats[4] + "</games_played>\r\n"); out.write(" <at_bats>" + stats[6] + "</at_bats>\r\n"); out.write(" <runs>" + stats[7] + "</runs>\r\n"); out.write(" <hits>" + stats[8] + "</hits>\r\n"); out.write(" <doubles>" + stats[9] + "</doubles>\r\n"); out.write(" <triples>" + stats[10] + "</triples>\r\n"); out.write(" <home_runs>" + stats[11] + "</home_runs>\r\n"); out.write(" <stolen_bases>" + stats[12] + "</stolen_bases>\r\n"); out.write(" <caught_stealing>" + stats[14] + "</caught_stealing>\r\n"); out.write(" <sacrifice_hits>" + stats[15] + "</sacrifice_hits>\r\n"); out.write(" <sacrifice_flies>" + stats[16] + "</sacrifice_flies>\r\n"); out.write(" <errors>" + stats[17] + "</errors>\r\n"); out.write(" <passed_by_ball>" + stats[18] + "</passed_by_ball>\r\n"); out.write(" <walks>" + stats[19] + "</walks>\r\n"); out.write(" <strike_outs>" + stats[20] + "</strike_outs>\r\n"); out.write(" <hit_by_pitch>" + stats[21] + "</hit_by_pitch>\r\n"); out.write(" </player>\r\n"); } out.write("</players>\r\n"); out.close(); in.close(); } catch (IOException e) { System.err.println(e); } catch (ArrayIndexOutOfBoundsException e) { System.out.println("Usage: java BaseballTabToXML input_file.tab"); } } public static String[] splitLine(String playerStats) { // count the number of tabs int numTabs = 0; for (int i = 0; i < playerStats.length(); i++) { if (playerStats.charAt(i) == '\t') numTabs++; } int numFields = numTabs + 1; String[] fields = new String[numFields]; int position = 0; for (int i = 0; i < numFields; i++) { StringBuffer field = new StringBuffer(); while (position < playerStats.length() && playerStats.charAt(position++) != '\t') { field.append(playerStats.charAt(position-1)); } fields[i] = field.toString(); } return fields; } }
<?xml version="1.0"?> <players> <player> <first_name>FirstName</first_name> <surname>Surname</surname> <games_played>Games Played</games_played> <at_bats>AtBats</at_bats> <runs>Runs</runs> <hits>Hits</hits> <doubles>Doubles</doubles> <triples>Triples</triples> <home_runs>Home runs</home_runs> <stolen_bases>RBI</stolen_bases> <caught_stealing>Caught Stealing</caught_stealing> <sacrifice_hits>Sacrifice Hits</sacrifice_hits> <sacrifice_flies>Sacrifice Flies</sacrifice_flies> <errors>Errors</errors> <passed_by_ball>PB</passed_by_ball> <walks>Walks</walks> <strike_outs>Strike outs</strike_outs> <hit_by_pitch>Hit by pitch</hit_by_pitch> </player> <player> <first_name>Garret </first_name> <surname>Anderson</surname> <games_played>156</games_played> <at_bats>622</at_bats> <runs>62</runs> <hits>183</hits> <doubles>41</doubles> <triples>7</triples> <home_runs>15</home_runs> <stolen_bases>79</stolen_bases> <caught_stealing>3</caught_stealing> <sacrifice_hits>3</sacrifice_hits> <sacrifice_flies>3</sacrifice_flies> <errors>6</errors> <passed_by_ball>0</passed_by_ball> <walks>29</walks> <strike_outs>80</strike_outs> <hit_by_pitch>1</hit_by_pitch> </player> <player> <first_name>Justin </first_name> <surname>Baughman</surname> <games_played>62</games_played> <at_bats>196</at_bats> <runs>24</runs> <hits>50</hits> <doubles>9</doubles> <triples>1</triples> <home_runs>1</home_runs> <stolen_bases>20</stolen_bases> <caught_stealing>4</caught_stealing> <sacrifice_hits>5</sacrifice_hits> <sacrifice_flies>3</sacrifice_flies> <errors>8</errors> <passed_by_ball>0</passed_by_ball> <walks>6</walks> <strike_outs>36</strike_outs> <hit_by_pitch>1</hit_by_pitch> </player> <player> <first_name>Frank </first_name> <surname>Bolick</surname> <games_played>21</games_played> <at_bats>45</at_bats> <runs>3</runs> <hits>7</hits> <doubles>2</doubles> <triples>0</triples> <home_runs>1</home_runs> <stolen_bases>2</stolen_bases> <caught_stealing>0</caught_stealing> <sacrifice_hits>0</sacrifice_hits> <sacrifice_flies>0</sacrifice_flies> <errors>0</errors> <passed_by_ball>0</passed_by_ball> <walks>11</walks> <strike_outs>8</strike_outs> <hit_by_pitch>0</hit_by_pitch> </player> <player> <first_name>Gary </first_name> <surname>Disarcina</surname> <games_played>157</games_played> <at_bats>551</at_bats> <runs>73</runs> <hits>158</hits> <doubles>39</doubles> <triples>3</triples> <home_runs>3</home_runs> <stolen_bases>56</stolen_bases> <caught_stealing>7</caught_stealing> <sacrifice_hits>12</sacrifice_hits> <sacrifice_flies>3</sacrifice_flies> <errors>14</errors> <passed_by_ball>0</passed_by_ball> <walks>21</walks> <strike_outs>51</strike_outs> <hit_by_pitch>8</hit_by_pitch> </player> <player> <first_name>Jim </first_name> <surname>Edmonds</surname> <games_played>154</games_played> <at_bats>599</at_bats> <runs>115</runs> <hits>184</hits> <doubles>42</doubles> <triples>1</triples> <home_runs>25</home_runs> <stolen_bases>91</stolen_bases> <caught_stealing>5</caught_stealing> <sacrifice_hits>1</sacrifice_hits> <sacrifice_flies>1</sacrifice_flies> <errors>5</errors> <passed_by_ball>0</passed_by_ball> <walks>57</walks> <strike_outs>114</strike_outs> <hit_by_pitch>1</hit_by_pitch> </player> <player> <first_name>Darin </first_name> <surname>Erstad</surname> <games_played>133</games_played> <at_bats>537</at_bats> <runs>84</runs> <hits>159</hits> <doubles>39</doubles> <triples>3</triples> <home_runs>19</home_runs> <stolen_bases>82</stolen_bases> <caught_stealing>6</caught_stealing> <sacrifice_hits>1</sacrifice_hits> <sacrifice_flies>3</sacrifice_flies> <errors>3</errors> <passed_by_ball>0</passed_by_ball> <walks>43</walks> <strike_outs>77</strike_outs> <hit_by_pitch>6</hit_by_pitch> </player> <player> <first_name>Carlos </first_name> <surname>Garcia</surname> <games_played>19</games_played> <at_bats>35</at_bats> <runs>4</runs> <hits>5</hits> <doubles>1</doubles> <triples>0</triples> <home_runs>0</home_runs> <stolen_bases>0</stolen_bases> <caught_stealing>0</caught_stealing> <sacrifice_hits>1</sacrifice_hits> <sacrifice_flies>0</sacrifice_flies> <errors>1</errors> <passed_by_ball>0</passed_by_ball> <walks>3</walks> <strike_outs>11</strike_outs> <hit_by_pitch>1</hit_by_pitch> </player> <player> <first_name>Troy </first_name> <surname>Glaus</surname> <games_played>48</games_played> <at_bats>165</at_bats> <runs>19</runs> <hits>36</hits> <doubles>9</doubles> <triples>0</triples> <home_runs>1</home_runs> <stolen_bases>23</stolen_bases> <caught_stealing>0</caught_stealing> <sacrifice_hits>0</sacrifice_hits> <sacrifice_flies>2</sacrifice_flies> <errors>7</errors> <passed_by_ball>0</passed_by_ball> <walks>15</walks> <strike_outs>51</strike_outs> <hit_by_pitch>0</hit_by_pitch> </player> <player> <first_name>Todd </first_name> <surname>Greene</surname> <games_played>29</games_played> <at_bats>71</at_bats> <runs>3</runs> <hits>18</hits> <doubles>4</doubles> <triples>0</triples> <home_runs>1</home_runs> <stolen_bases>7</stolen_bases> <caught_stealing>0</caught_stealing> <sacrifice_hits>0</sacrifice_hits> <sacrifice_flies>0</sacrifice_flies> <errors>0</errors> <passed_by_ball>0</passed_by_ball> <walks>2</walks> <strike_outs>20</strike_outs> <hit_by_pitch>0</hit_by_pitch> </player> <player> <first_name>Eric </first_name> <surname>Helfand</surname> <games_played>0</games_played> <at_bats>0</at_bats> <runs>0</runs> <hits>0</hits> <doubles>0</doubles> <triples>0</triples> <home_runs>0</home_runs> <stolen_bases>0</stolen_bases> <caught_stealing>0</caught_stealing> <sacrifice_hits>0</sacrifice_hits> <sacrifice_flies>0</sacrifice_flies> <errors>0</errors> <passed_by_ball>0</passed_by_ball> <walks>0</walks> <strike_outs>0</strike_outs> <hit_by_pitch>0</hit_by_pitch> </player> <player> <first_name>Dave </first_name> <surname>Hollins</surname> <games_played>101</games_played> <at_bats>363</at_bats> <runs>60</runs> <hits>88</hits> <doubles>16</doubles> <triples>2</triples> <home_runs>11</home_runs> <stolen_bases>39</stolen_bases> <caught_stealing>3</caught_stealing> <sacrifice_hits>2</sacrifice_hits> <sacrifice_flies>2</sacrifice_flies> <errors>17</errors> <passed_by_ball>0</passed_by_ball> <walks>44</walks> <strike_outs>69</strike_outs> <hit_by_pitch>7</hit_by_pitch> </player> <player> <first_name>Gregg </first_name> <surname>Jefferies</surname> <games_played>19</games_played> <at_bats>72</at_bats> <runs>7</runs> <hits>25</hits> <doubles>6</doubles> <triples>0</triples> <home_runs>1</home_runs> <stolen_bases>10</stolen_bases> <caught_stealing>0</caught_stealing> <sacrifice_hits>0</sacrifice_hits> <sacrifice_flies>0</sacrifice_flies> <errors>0</errors> <passed_by_ball>0</passed_by_ball> <walks>0</walks> <strike_outs>5</strike_outs> <hit_by_pitch>0</hit_by_pitch> </player> <player> <first_name>Mark </first_name> <surname>Johnson</surname> <games_played>10</games_played> <at_bats>14</at_bats> <runs>1</runs> <hits>1</hits> <doubles>0</doubles> <triples>0</triples> <home_runs>0</home_runs> <stolen_bases>0</stolen_bases> <caught_stealing>0</caught_stealing> <sacrifice_hits>0</sacrifice_hits> <sacrifice_flies>0</sacrifice_flies> <errors>0</errors> <passed_by_ball>0</passed_by_ball> <walks>0</walks> <strike_outs>6</strike_outs> <hit_by_pitch>0</hit_by_pitch> </player> <player> <first_name>Chad </first_name> <surname>Kreuter</surname> <games_played>96</games_played> <at_bats>252</at_bats> <runs>27</runs> <hits>63</hits> <doubles>10</doubles> <triples>1</triples> <home_runs>2</home_runs> <stolen_bases>33</stolen_bases> <caught_stealing>0</caught_stealing> <sacrifice_hits>5</sacrifice_hits> <sacrifice_flies>1</sacrifice_flies> <errors>9</errors> <passed_by_ball>5</passed_by_ball> <walks>33</walks> <strike_outs>49</strike_outs> <hit_by_pitch>3</hit_by_pitch> </player> <player> <first_name>Norberto </first_name> <surname>Martin</surname> <games_played>79</games_played> <at_bats>195</at_bats> <runs>20</runs> <hits>42</hits> <doubles>2</doubles> <triples>0</triples> <home_runs>1</home_runs> <stolen_bases>13</stolen_bases> <caught_stealing>1</caught_stealing> <sacrifice_hits>3</sacrifice_hits> <sacrifice_flies>2</sacrifice_flies> <errors>4</errors> <passed_by_ball>0</passed_by_ball> <walks>6</walks> <strike_outs>29</strike_outs> <hit_by_pitch>0</hit_by_pitch> </player> <player> <first_name>Damon </first_name> <surname>Mashore</surname> <games_played>43</games_played> <at_bats>98</at_bats> <runs>13</runs> <hits>23</hits> <doubles>6</doubles> <triples>0</triples> <home_runs>2</home_runs> <stolen_bases>11</stolen_bases> <caught_stealing>0</caught_stealing> <sacrifice_hits>1</sacrifice_hits> <sacrifice_flies>0</sacrifice_flies> <errors>0</errors> <passed_by_ball>0</passed_by_ball> <walks>9</walks> <strike_outs>22</strike_outs> <hit_by_pitch>3</hit_by_pitch> </player> <player> <first_name>Ben </first_name> <surname>Molina</surname> <games_played>2</games_played> <at_bats>1</at_bats> <runs>0</runs> <hits>0</hits> <doubles>0</doubles> <triples>0</triples> <home_runs>0</home_runs> <stolen_bases>0</stolen_bases> <caught_stealing>0</caught_stealing> <sacrifice_hits>0</sacrifice_hits> <sacrifice_flies>0</sacrifice_flies> <errors>0</errors> <passed_by_ball>0</passed_by_ball> <walks>0</walks> <strike_outs>0</strike_outs> <hit_by_pitch>0</hit_by_pitch> </player> <player> <first_name>Phil </first_name> <surname>Nevin</surname> <games_played>75</games_played> <at_bats>237</at_bats> <runs>27</runs> <hits>54</hits> <doubles>8</doubles> <triples>1</triples> <home_runs>8</home_runs> <stolen_bases>27</stolen_bases> <caught_stealing>0</caught_stealing> <sacrifice_hits>0</sacrifice_hits> <sacrifice_flies>2</sacrifice_flies> <errors>5</errors> <passed_by_ball>20</passed_by_ball> <walks>17</walks> <strike_outs>67</strike_outs> <hit_by_pitch>5</hit_by_pitch> </player> <player> <first_name>Charlie </first_name> <surname>Obrien</surname> <games_played>62</games_played> <at_bats>175</at_bats> <runs>13</runs> <hits>45</hits> <doubles>9</doubles> <triples>0</triples> <home_runs>4</home_runs> <stolen_bases>18</stolen_bases> <caught_stealing>0</caught_stealing> <sacrifice_hits>3</sacrifice_hits> <sacrifice_flies>3</sacrifice_flies> <errors>4</errors> <passed_by_ball>1</passed_by_ball> <walks>10</walks> <strike_outs>33</strike_outs> <hit_by_pitch>2</hit_by_pitch> </player> <player> <first_name>Orlando </first_name> <surname>Palmeiro</surname> <games_played>74</games_played> <at_bats>165</at_bats> <runs>28</runs> <hits>53</hits> <doubles>7</doubles> <triples>2</triples> <home_runs>0</home_runs> <stolen_bases>21</stolen_bases> <caught_stealing>4</caught_stealing> <sacrifice_hits>7</sacrifice_hits> <sacrifice_flies>0</sacrifice_flies> <errors>0</errors> <passed_by_ball>0</passed_by_ball> <walks>20</walks> <strike_outs>11</strike_outs> <hit_by_pitch>0</hit_by_pitch> </player> <player> <first_name>Chris </first_name> <surname>Pritchett</surname> <games_played>31</games_played> <at_bats>80</at_bats> <runs>12</runs> <hits>23</hits> <doubles>2</doubles> <triples>1</triples> <home_runs>2</home_runs> <stolen_bases>8</stolen_bases> <caught_stealing>0</caught_stealing> <sacrifice_hits>0</sacrifice_hits> <sacrifice_flies>0</sacrifice_flies> <errors>1</errors> <passed_by_ball>0</passed_by_ball> <walks>4</walks> <strike_outs>16</strike_outs> <hit_by_pitch>0</hit_by_pitch> </player> <player> <first_name>Tim </first_name> <surname>Salmon</surname> <games_played>136</games_played> <at_bats>463</at_bats> <runs>84</runs> <hits>139</hits> <doubles>28</doubles> <triples>1</triples> <home_runs>26</home_runs> <stolen_bases>88</stolen_bases> <caught_stealing>1</caught_stealing> <sacrifice_hits>0</sacrifice_hits> <sacrifice_flies>10</sacrifice_flies> <errors>2</errors> <passed_by_ball>0</passed_by_ball> <walks>90</walks> <strike_outs>100</strike_outs> <hit_by_pitch>3</hit_by_pitch> </player> <player> <first_name>Craig </first_name> <surname>Shipley</surname> <games_played>77</games_played> <at_bats>147</at_bats> <runs>18</runs> <hits>38</hits> <doubles>7</doubles> <triples>1</triples> <home_runs>2</home_runs> <stolen_bases>17</stolen_bases> <caught_stealing>4</caught_stealing> <sacrifice_hits>4</sacrifice_hits> <sacrifice_flies>1</sacrifice_flies> <errors>3</errors> <passed_by_ball>0</passed_by_ball> <walks>5</walks> <strike_outs>22</strike_outs> <hit_by_pitch>5</hit_by_pitch> </player> <player> <first_name>Randy </first_name> <surname>Velarde</surname> <games_played>51</games_played> <at_bats>188</at_bats> <runs>29</runs> <hits>49</hits> <doubles>13</doubles> <triples>1</triples> <home_runs>4</home_runs> <stolen_bases>26</stolen_bases> <caught_stealing>2</caught_stealing> <sacrifice_hits>0</sacrifice_hits> <sacrifice_flies>1</sacrifice_flies> <errors>4</errors> <passed_by_ball>0</passed_by_ball> <walks>34</walks> <strike_outs>42</strike_outs> <hit_by_pitch>1</hit_by_pitch> </player> <player> <first_name>Matt </first_name> <surname>Walbeck</surname> <games_played>108</games_played> <at_bats>338</at_bats> <runs>41</runs> <hits>87</hits> <doubles>15</doubles> <triples>2</triples> <home_runs>6</home_runs> <stolen_bases>46</stolen_bases> <caught_stealing>1</caught_stealing> <sacrifice_hits>5</sacrifice_hits> <sacrifice_flies>5</sacrifice_flies> <errors>7</errors> <passed_by_ball>8</passed_by_ball> <walks>30</walks> <strike_outs>68</strike_outs> <hit_by_pitch>2</hit_by_pitch> </player> <player> <first_name>Reggie </first_name> <surname>Williams</surname> <games_played>29</games_played> <at_bats>36</at_bats> <runs>7</runs> <hits>13</hits> <doubles>1</doubles> <triples>0</triples> <home_runs>1</home_runs> <stolen_bases>5</stolen_bases> <caught_stealing>3</caught_stealing> <sacrifice_hits>1</sacrifice_hits> <sacrifice_flies>0</sacrifice_flies> <errors>0</errors> <passed_by_ball>0</passed_by_ball> <walks>7</walks> <strike_outs>11</strike_outs> <hit_by_pitch>1</hit_by_pitch> </player> </players>
import java.io.*; import java.text.*; import java.util.*; public class BattingAverage { public static void main(String[] args) { try { FileInputStream fin = new FileInputStream(args[0]); BufferedReader in = new BufferedReader(new InputStreamReader(fin)); FileOutputStream fout = new FileOutputStream("battingaverages.xml"); OutputStreamWriter out = new OutputStreamWriter(fout, "UTF-8"); out.write("<?xml version=\"1.0\"?>\r\n"); out.write("<players>\r\n"); String playerStats; // for formatting batting averages DecimalFormat averages = (DecimalFormat) NumberFormat.getNumberInstance(Locale.US); averages.setMaximumFractionDigits(3); averages.setMinimumFractionDigits(3); averages.setMinimumIntegerDigits(0); while ((playerStats = in.readLine()) != null) { String[] stats = splitLine(playerStats); String formattedAverage; try { int atBats = Integer.parseInt(stats[6]); int hits = Integer.parseInt(stats[8]); int walks = Integer.parseInt(stats[19]); int hitByPitch = Integer.parseInt(stats[21]); int sacrificeFlies = Integer.parseInt(stats[16]); int sacrificeHits = Integer.parseInt(stats[15]); int officialAtBats = atBats - walks - hitByPitch - sacrificeHits; if (officialAtBats <= 0) formattedAverage = "N/A"; else { double average = hits / (double) officialAtBats; formattedAverage = averages.format(average); } } catch (Exception e) { // skip this player continue; } out.write(" <player>\r\n"); out.write(" <first_name>" + stats[1] + "</first_name>\r\n"); out.write(" <surname>" + stats[0] + "</surname>\r\n"); out.write(" <batting_average>" + formattedAverage + "</batting_average>\r\n"); out.write(" </player>\r\n"); } out.write("</players>\r\n"); out.close(); in.close(); } catch (IOException e) { System.err.println(e); } catch (ArrayIndexOutOfBoundsException e) { System.out.println("Usage: java BattingAverage input_file.tab"); } } public static String[] splitLine(String playerStats) { // count the number of tabs int numTabs = 0; for (int i = 0; i < playerStats.length(); i++) { if (playerStats.charAt(i) == '\t') numTabs++; } int numFields = numTabs + 1; String[] fields = new String[numFields]; int position = 0; for (int i = 0; i < numFields; i++) { StringBuffer field = new StringBuffer(); while (position < playerStats.length() && playerStats.charAt(position++) != '\t') { field.append(playerStats.charAt(position-1)); } fields[i] = field.toString(); } return fields; } }
<?xml version="1.0"?> <players> <player> <first_name>Garret </first_name> <surname>Anderson</surname> <batting_average>.311</batting_average> </player> <player> <first_name>Justin </first_name> <surname>Baughman</surname> <batting_average>.272</batting_average> </player> <player> <first_name>Frank </first_name> <surname>Bolick</surname> <batting_average>.206</batting_average> </player> <player> <first_name>Gary </first_name> <surname>Disarcina</surname> <batting_average>.310</batting_average> </player> <player> <first_name>Jim </first_name> <surname>Edmonds</surname> <batting_average>.341</batting_average> </player> <player> <first_name>Darin </first_name> <surname>Erstad</surname> <batting_average>.326</batting_average> </player> <player> <first_name>Carlos </first_name> <surname>Garcia</surname> <batting_average>.167</batting_average> </player> <player> <first_name>Troy </first_name> <surname>Glaus</surname> <batting_average>.240</batting_average> </player> <player> <first_name>Todd </first_name> <surname>Greene</surname> <batting_average>.261</batting_average> </player> <player> <first_name>Eric </first_name> <surname>Helfand</surname> <batting_average>N/A</batting_average> </player> <player> <first_name>Dave </first_name> <surname>Hollins</surname> <batting_average>.284</batting_average> </player> <player> <first_name>Gregg </first_name> <surname>Jefferies</surname> <batting_average>.347</batting_average> </player> <player> <first_name>Mark </first_name> <surname>Johnson</surname> <batting_average>.071</batting_average> </player> <player> <first_name>Chad </first_name> <surname>Kreuter</surname> <batting_average>.299</batting_average> </player> <player> <first_name>Norberto </first_name> <surname>Martin</surname> <batting_average>.226</batting_average> </player> <player> <first_name>Damon </first_name> <surname>Mashore</surname> <batting_average>.271</batting_average> </player> <player> <first_name>Ben </first_name> <surname>Molina</surname> <batting_average>.000</batting_average> </player> <player> <first_name>Phil </first_name> <surname>Nevin</surname> <batting_average>.251</batting_average> </player> <player> <first_name>Charlie </first_name> <surname>Obrien</surname> <batting_average>.281</batting_average> </player> <player> <first_name>Orlando </first_name> <surname>Palmeiro</surname> <batting_average>.384</batting_average> </player> <player> <first_name>Chris </first_name> <surname>Pritchett</surname> <batting_average>.303</batting_average> </player> <player> <first_name>Tim </first_name> <surname>Salmon</surname> <batting_average>.376</batting_average> </player> <player> <first_name>Craig </first_name> <surname>Shipley</surname> <batting_average>.286</batting_average> </player> <player> <first_name>Randy </first_name> <surname>Velarde</surname> <batting_average>.320</batting_average> </player> <player> <first_name>Matt </first_name> <surname>Walbeck</surname> <batting_average>.289</batting_average> </player> <player> <first_name>Reggie </first_name> <surname>Williams</surname> <batting_average>.481</batting_average> </player> </players>
XML files are text files.
You can write XML files any way you can write a text file in Java or any other language for that matter.
You have to follow well-formedness rules.
You do have to use UTF-8 or specify a different encoding in the XML declaration.
Java I/O
Elliotte Rusty Harold
O'Reilly & Associates, 1999
ISBN: 01-56592-485-1
The stereotypical "Desperate Perl Hacker" (DPH) is supposed to be able to write an XML parser in a weekend.
The parser does the hard work for you.
Your code reads the document through the parser's API.
Public domain, developed on xml-dev mailing list
Maintained by David Megginson
org.xml.sax package
Event based
Parser | URL | Validating | Namespaces | DOM1 | DOM2 | SAX1 | SAX2 | License |
---|---|---|---|---|---|---|---|---|
Apache XML Project's Xerces Java | http://xml.apache.org/xerces-j/index.html | X | X | X | X | X | X | Apache Software License, Version 1.1 |
IBM's XML for Java | http://www.alphaworks.ibm.com/formula/xml | X | X | X | X | X | X | License |
James Clark's XP | http://www.jclark.com/xml/xp/index.html | X | Modified BSD | |||||
Microstar's Ælfred | http://home.pacbell.net/david-b/xml/ | Namespaces | DOM1 | DOM2 | SAX1 | SAX2 | open source | |
Silfide's SXP | http://www.loria.fr/projets/XSilfide/EN/sxp/ | X | X | Non-GPL viral open source license | ||||
Sun's Java API for XML | http://java.sun.com/products/xml | X | X | X | X | free beer | ||
Oracle's XML Parser for Java | http://technet.oracle.com/ | X | X | X | X | free beer |
SAX1 omits:
Comments
Lexical Information (CDATA sections, entity references, etc.)
DTD declarations
Validation
Namespaces
Adds:
Namespace support
Optional validation
Optional lexical events for comments, CDATA sections, entity references
A lot more configurable
Deprecates a lot of SAX1
Adapter classes convert between parsers.
Use the factory method
XMLReaderFactory.createXMLReader()
to retrieve a parser-specific implementation of the
XMLReader
interface
Your code registers a ContentHandler
with the parser
An InputSource
feeds the document into the parser
As the document is read, the parser calls back to the
methods of the ContentHandler
to tell it
what it's seeing in the document.
The XMLReaderFactory.createXMLReader()
method
instantiates an XMLReader
subclass named by
the org.xml.sax.driver
system property:
try {
XMLReader parser = XMLReaderFactory.createXMLReader();
}
catch (SAXException e) {
System.err.println(e);
}
The XMLReaderFactory.createXMLReader(String className)
method
instantiates an XMLReader
subclass named by
its argument:
try {
XMLReader parser
= XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser");
}
catch (SAXException e) {
System.err.println(e);
}
Or you can use the constructor in the package-specific class:
XMLReader parser = new SAXParser();
Or all three:
XMLReader parser;
try {
parser = XMLReaderFactory.createXMLReader();
}
catch (SAXException ex) {
try {
parser = XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser");
}
catch (SAXException ex2) {
parser = new SAXParser();
}
}
import org.xml.sax.*; import org.xml.sax.helpers.*; import java.io.*; public class SAX2Checker { public static void main(String[] args) { XMLReader parser; try { parser = XMLReaderFactory.createXMLReader(); } catch (SAXException ex) { try { parser = XMLReaderFactory.createXMLReader( "org.apache.xerces.parsers.SAXParser"); } catch (SAXException ex2) { System.out.println("Could not locate a parser." + "Please set the the org.xml.sax.driver property."); return; } } if (args.length == 0) { System.out.println("Usage: java SAX2Checker URL1 URL2..."); } // start parsing... for (int i = 0; i < args.length; i++) { // command line should offer URIs or file names try { parser.parse(args[i]); // If there are no well-formedness errors // then no exception is thrown System.out.println(args[i] + " is well formed."); } catch (SAXParseException e) { // well-formedness error System.out.println(args[i] + " is not well formed."); System.out.println(e.getMessage() + " at line " + e.getLineNumber() + ", column " + e.getColumnNumber()); } catch (SAXException e) { // some other kind of error System.out.println(e.getMessage()); } catch (IOException e) { System.out.println("Could not check " + args[i] + " because of the IOException " + e); } } } }
C:\>java SAX2Checker http://www.ibiblio.org/xml/
http://www.ibiblio.org/xml/ is not well formed.
The element type "dt" must be terminated by the
matching end-tag "</dt>".
at line 186, column 5
package org.xml.sax; public interface ContentHandler { public void setDocumentLocator(Locator locator); public void startDocument() throws SAXException; public void endDocument() throws SAXException; public void startPrefixMapping(String prefix, String uri) throws SAXException; public void endPrefixMapping(String prefix) throws SAXException; public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException; public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException; public void characters(char[] text, int start, int length) throws SAXException; public void ignorableWhitespace(char[] text, int start, int length) throws SAXException; public void processingInstruction(String target, String data) throws SAXException; public void skippedEntity(String name) throws SAXException; }
import org.xml.sax.*; import org.xml.sax.helpers.*; import java.io.*; public class EventReporter implements ContentHandler { public void setDocumentLocator(Locator locator) { System.out.println("setDocumentLocator(" + locator + ")"); } public void startDocument() throws SAXException { System.out.println("startDocument()"); } public void endDocument() throws SAXException { System.out.println("endDocument()"); } public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException { namespaceURI = '"' + namespaceURI + '"'; localName = '"' + localName + '"'; qName = '"' + qName + '"'; String attributeString = "{"; for (int i = 0; i < atts.getLength(); i++) { attributeString += atts.getQName(i) + "=\"" + atts.getValue(i) + "\""; if (i != atts.getLength()-1) attributeString += ", "; } attributeString += "}"; System.out.println("startElement(" + namespaceURI + ", " + localName + ", " + qName + ", " + attributeString + ")"); } public void endElement(String namespaceURI, String localName, String qName) throws SAXException { namespaceURI = '"' + namespaceURI + '"'; localName = '"' + localName + '"'; qName = '"' + qName + '"'; System.out.println("endElement(" + namespaceURI + ", " + localName + ", " + qName + ")"); } public void characters(char[] text, int start, int length) throws SAXException { String textString = "[" + new String(text) + "]"; System.out.println("characters(" + textString + ", " + start + ", " + length + ")"); } public void ignorableWhitespace(char[] text, int start, int length) throws SAXException { System.out.println("ignorableWhitespace()"); } public void processingInstruction(String target, String data) throws SAXException { System.out.println("processingInstruction(" + target + ", " + data + ")"); } public void startPrefixMapping(String prefix, String uri) throws SAXException { System.out.println("startPrefixMapping(\"" + prefix + "\", \"" + uri + "\")"); } public void endPrefixMapping(String prefix) throws SAXException { System.out.println("endPrefixMapping(\"" + prefix + "\")"); } public void skippedEntity(String name) throws SAXException { System.out.println("skippedEntity(" + name + ")"); } // Could easily have put main() method in a separate class public static void main(String[] args) { XMLReader parser; try { parser = XMLReaderFactory.createXMLReader(); } catch (Exception e) { // fall back on Xerces parser by name try { parser = XMLReaderFactory.createXMLReader( "org.apache.xerces.parsers.SAXParser"); } catch (Exception ee) { System.err.println("Couldn't locate a SAX parser"); return; } } if (args.length == 0) { System.out.println( "Usage: java EventReporter URL1 URL2..."); } // Install the Document Handler parser.setContentHandler(new EventReporter()); // start parsing... for (int i = 0; i < args.length; i++) { // command line should offer URIs or file names try { parser.parse(args[i]); } catch (SAXParseException e) { // well-formedness error System.out.println(args[i] + " is not well formed."); System.out.println(e.getMessage() + " at line " + e.getLineNumber() + ", column " + e.getColumnNumber()); } catch (SAXException e) { // some other kind of error System.out.println(e.getMessage()); } catch (IOException e) { System.out.println("Could not report on " + args[i] + " because of the IOException " + e); } } } }
UserLand's RSS based list of Web logs at http://static.userland.com/weblogMonitor/logs.xml:
<?xml version="1.0"?> <!DOCTYPE foo SYSTEM "http://msdn.microsoft.com/xml/general/htmlentities.dtd"> <weblogs> <log> <name>MozillaZine</name> <url>http://www.mozillazine.org</url> <changesUrl>http://www.mozillazine.org/contents.rdf</changesUrl> <ownerName>Jason Kersey</ownerName> <ownerEmail>kerz@en.com</ownerEmail> <description>THE source for news on the Mozilla Organization. DevChats, Reviews, Chats, Builds, Demos, Screenshots, and more.</description> <imageUrl></imageUrl> <adImageUrl>http://static.userland.com/weblogMonitor/ads/kerz@en.com.gif</adImageUrl> </log> <log> <name>SalonHerringWiredFool</name> <url>http://www.salonherringwiredfool.com/</url> <ownerName>Some Random Herring</ownerName> <ownerEmail>salonfool@wiredherring.com</ownerEmail> <description></description> </log> <log> <name>Scripting News</name> <url>http://www.scripting.com/</url> <ownerName>Dave Winer</ownerName> <ownerEmail>dave@userland.com</ownerEmail> <description>News and commentary from the cross-platform scripting community.</description> <imageUrl>http://www.scripting.com/gifs/tinyScriptingNews.gif</imageUrl> <adImageUrl>http://static.userland.com/weblogMonitor/ads/dave@userland.com.gif</adImageUrl> </log> <log> <name>SlashDot.Org</name> <url>http://www.slashdot.org/</url> <ownerName>Simply a friend</ownerName> <ownerEmail>afriendofweblogs@weblogs.com</ownerEmail> <description>News for Nerds, Stuff that Matters.</description> </log> </weblogs>
Design Decisions
Should we return an array, an Enumeration
,
a List
, or what?
Perhaps we should use multiple threads?
We do not know how many URLs there will be when we start parsing
so let's use a Vector
Single threaded for simplicity but a real program would use multiple threads
One to load and parse the data
Another thread (probably the main thread) to serve the data
Early data could be provided before the entire document had been read
The character data of each url
element needs to be stored.
Everything else can be ignored.
A startElement()
with the name
url indicates that we need to start
storing this data.
A stopElement()
with the name url indicates that we need to stop
storing this data, convert it to a URL
and put it in the
Vector
Should we hide the XML parsing inside a non-public class to avoid accidentally calling the methods from unexpected places or threads?
import org.xml.sax.*; import org.xml.sax.helpers.XMLReaderFactory; import java.util.*; import java.io.*; public class WeblogsSAX { public static List listChannels() throws IOException, SAXException { return listChannels( "http://static.userland.com/weblogMonitor/logs.xml"); } public static List listChannels(String uri) throws IOException, SAXException { XMLReader parser; try { parser = XMLReaderFactory.createXMLReader(); } catch (SAXException ex) { parser = XMLReaderFactory.createXMLReader( "org.apache.xerces.parsers.SAXParser" ); } Vector urls = new Vector(1000); ContentHandler handler = new URIGrabber(urls); parser.setContentHandler(handler); parser.parse(uri); return urls; } public static void main(String[] args) { try { List urls; if (args.length > 0) urls = listChannels(args[0]); else urls = listChannels(); Iterator iterator = urls.iterator(); while (iterator.hasNext()) { System.out.println(iterator.next()); } } catch (IOException e) { System.err.println(e); } catch (SAXParseException e) { System.err.println(e); System.err.println("at line " + e.getLineNumber() + ", column " + e.getColumnNumber()); } catch (SAXException e) { System.err.println(e); } catch (/* Unexpected */ Exception e) { e.printStackTrace(); } } }
import org.xml.sax.*; import java.net.*; import java.util.Vector; // conflicts with java.net.ContentHandler class URIGrabber implements org.xml.sax.ContentHandler { private Vector urls; URIGrabber(Vector urls) { this.urls = urls; } // do nothing methods public void setDocumentLocator(Locator locator) {} public void startDocument() throws SAXException {} public void endDocument() throws SAXException {} public void startPrefixMapping(String prefix, String uri) throws SAXException {} public void endPrefixMapping(String prefix) throws SAXException {} public void skippedEntity(String name) throws SAXException {} public void ignorableWhitespace(char[] text, int start, int length) throws SAXException {} public void processingInstruction(String target, String data) throws SAXException {} // Remember, there's no guarantee all the text of the // url element will be returned in a single call to characters private StringBuffer urlBuffer; private boolean collecting = false; public void startElement(String namespaceURI, String localName, String qualifiedName, Attributes atts) throws SAXException { if (qualifiedName.equals("url")) { collecting = true; urlBuffer = new StringBuffer(); } } public void characters(char[] text, int start, int length) throws SAXException { if (collecting) { urlBuffer.append(text, start, length); } } public void endElement(String namespaceURI, String localName, String qualifiedName) throws SAXException { if (qualifiedName.equals("url")) { collecting = false; String url = urlBuffer.toString(); try { urls.addElement(new URL(url)); } catch (MalformedURLException e) { // skip this url } } } }
% java Weblogs shortlogs.xml
http://www.mozillazine.org
http://www.salonherringwiredfool.com/
http://www.slashdot.org/
SAX2 parsers--that is XMLReaders--are configured by features and properties
Feature and property names are absolute URIs
A feature is boolean, on or off, true or false; a property is an object
public boolean getFeature(String name)
throws SAXNotRecognizedException, SAXNotSupportedException
public void setFeature(String name, boolean value)
throws SAXNotRecognizedException, SAXNotSupportedException
public Object getProperty(String name)
throws SAXNotRecognizedException, SAXNotSupportedException
public void setProperty(String name, Object value)
throws SAXNotRecognizedException, SAXNotSupportedException
Features can be read-only or read/write.
Some features may be modifiable while parsing; others only before parsing starts
For example,
try {
if (xmlReader.getFeature("http://xml.org/sax/features/validation")) {
System.out.println("Parser is validating.");
}
else {
System.out.println("Parser is not validating.");
}
}
catch (SAXException e) {
System.out.println("Do not know if parser validates");
}
SAXNotRecognizedException
SAXNotSupportedException
http://xml.org/sax/features/namespaces
If true, then perform namespace processing.
If false, then, at parser option, do not perform namespace processing
access: (parsing) read-only; (not parsing) read/write
true by default
http://xml.org/sax/features/namespace-prefixes
If true, then report the original prefixed names and attributes used for namespace declarations.
If false, then do not report attributes used for namespace declarations, and optionally do not report original prefixed names.
false by default
access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/namespaces
http://xml.org/sax/features/namespace-prefixes
http://xml.org/sax/features/string-interning
If true, then all element names, prefixes, attribute
names, Namespace URIs, and local names are internalized using
java.lang.String.intern()
.
If false, then names are not necessarily internalized.
access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/validation
If true, then report all validation errors
If false, then do not report validation errors.
access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/external-general-entities
If true, then include all external general (text) entities.
false: Do not include external general entities.
access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/features/external-parameter-entities
If true, then include all external parameter entities, including the external DTD subset.
false: Do not include any external parameter entities, even the external DTD subset.
access: (parsing) read-only; (not parsing) read/write
adapted from SAX2 documentation by David Megginson
Not all parsers are validating but Xerces-J is.
Validity errors are not fatal; therefore they do not throw SAXParseExceptions
Must install an ErrorHandler
as well as a
ContentHandler
Must set the feature http://xml.org/sax/features/validation
In increasing order of severity:
A warning; e.g. ambiguous content model, a constraint for compatibility
A recoverable error: typically a validity error
A fatal error: typically a well-formedness error
package org.xml.sax;
public interface ErrorHandler {
public void warning(SAXParseException exception)
throws SAXException;
public void error(SAXParseException exception)
throws SAXException;
public void fatalError(SAXParseException exception)
throws SAXException;
}
import org.xml.sax.*; import java.io.*; public class ValidityErrorReporter implements ErrorHandler { Writer out; public ValidityErrorReporter(Writer out) { this.out = out; } public ValidityErrorReporter() { this(new OutputStreamWriter(System.out)); } public void warning(SAXParseException ex) throws SAXException { try { out.write(ex.getMessage() + "\r\n"); out.write(" at line " + ex.getLineNumber() + ", column " + ex.getColumnNumber() + "\r\n"); out.flush(); } catch (IOException e) { throw new SAXException(e); } } public void error(SAXParseException ex) throws SAXException { try { out.write(ex.getMessage() + "\r\n"); out.write(" at line " + ex.getLineNumber() + ", column " + ex.getColumnNumber() + "\r\n"); out.flush(); } catch (IOException e) { throw new SAXException(e); } } public void fatalError(SAXParseException ex) throws SAXException { try { out.write(ex.getMessage() + "\r\n"); out.write(" at line " + ex.getLineNumber() + ", column " + ex.getColumnNumber() + "\r\n"); out.flush(); } catch (IOException e) { throw new SAXException(e); } } }
import org.xml.sax.*; import org.xml.sax.helpers.*; import org.apache.xerces.parsers.*; import java.io.*; public class SAX2Validator { public static void main(String[] args) { XMLReader parser; try { parser = XMLReaderFactory.createXMLReader(); } catch (SAXException ex) { try { parser = XMLReaderFactory.createXMLReader( "org.apache.xerces.parsers.SAXParser" ); } catch (SAXException ex2) { System.err.println("Could not locate a SAX2 Parser"); return; } } // turn on validation try { parser.setFeature( "http://xml.org/sax/features/validation", true); parser.setErrorHandler(new ValidityErrorReporter()); } catch (SAXNotRecognizedException e) { System.err.println( "Installed XML parser cannot validate;" + " checking for well-formedness instead..."); } catch (SAXNotSupportedException e) { System.err.println( "Cannot turn on validation here; " + "checking for well-formedness instead..."); } if (args.length == 0) { System.out.println("Usage: java SAX2Validator URL1 URL2..."); } // start parsing... for (int i = 0; i < args.length; i++) { // command line should offer URIs or file names try { parser.parse(args[i]); // If there are no well-formedness errors, // then no exception is thrown System.out.println(args[i] + " is well formed."); } catch (SAXParseException e) { // well-formedness error System.out.println(args[i] + " is not well formed."); System.out.println(e.getMessage() + " at line " + e.getLineNumber() + ", column " + e.getColumnNumber()); } catch (SAXException e) { // some other kind of error System.out.println(e.getMessage()); } catch (IOException e) { System.out.println("Could not check " + args[i] + " because of the IOException " + e); } } } }
http://xml.org/sax/properties/lexical-handler
data type:
org.xml.sax.ext.LexicalHandler
description: An optional extension handler for items like comments that are not part of the information set and may be omitted.
access: read/write
http://xml.org/sax/properties/declaration-handler
data type:
org.xml.sax.ext.DeclHandler
description: An optional extension handler for ATTLIST and ELEMENT declarations (but not notations and unparsed entities).
access: read/write
http://xml.org/sax/properties/dom-node
data type: org.w3c.dom.Node
description: When parsing, the current DOM node being visited if this is a DOM iterator; when not parsing, the root DOM node for iteration.
access: (parsing) read-only; (not parsing) read/write
http://xml.org/sax/properties/xml-string
data type: java.lang.String
description: The literal string of characters that was the source for the current event.
access: read-only
adapted from SAX2 documentation by David Megginson
http://apache.org/xml/features/validation/dynamic
True: The parser will validate the document
if a DTD is specified in a DOCTYPE
declaration or using the appropriate
schema attributes like xsi:noNamespaceSchemaLocation
.
False: Validation is determined by the state of the http://xml.org/sax/features/validation feature.
Default is false
http://apache.org/xml/features/validation/warn-on-duplicate-attdef
True: Warn on duplicate attribute declaration.
False: Do not warn on duplicate attribute declaration.
Default: true
http://apache.org/xml/features/validation/warn-on-undeclared-elemdef
True: Warn if element referenced in content model is not declared.
False: Do not warn if element referenced in content model is not declared.
Default: true
http://apache.org/xml/features/allow-java-encodings
True: Allow Java encoding names like 8859_1 in XML and text declarations.
False: Do not allow Java encoding names in XML and text declarations.
Default: false
http://apache.org/xml/features/continue-after-fatal-error
True: Continue after fatal error.
False: Stops parse on first fatal error.
Default: false
None for the SAX parser
The DOM parser has a couple
Extension handlers are non-required interfaces in the
org.xml.sax.ext
package.
To set the
LexicalHandler
for an XML reader, set the property
http://xml.org/sax/handlers/LexicalHandler
.
To set the
DeclHandler
for an XML reader, set the property
http://xml.org/sax/handlers/DeclHandler
.
If the reader does not support the requested property, it will throw a
SAXNotRecognizedException
or a SAXNotSupportedException
.
The startElement()
method in
ContentHandler
receives as an argument an
Attributes
object containing all attributes
on that tag.
public void startElement(String namespaceURI,
String localName, String qualifiedName, Attributes atts) throws SAXException
The Attributes
interface:
package org.xml.sax;
public interface Attributes {
public int getLength();
/* Look up an attribute's Namespace URI by index.*/
public String getURI(int index);
public String getLocalName(int index);
public String getQName(int index);
public String getType(int index);
public String getValue(int index);
public int getIndex(String uri, String localPart);
public int getIndex(String qualifiedName);
public String getType(String uri, String localName);
public String getType(String qualifiedName);
public String getValue(String uri, String localName);
public String getValue(String qualifiedName);
}
import org.xml.sax.*; import org.apache.xerces.parsers.*; import java.io.*; import java.util.*; import org.xml.sax.helpers.*; public class XLinkSpider extends DefaultHandler { public static Enumeration listURIs(String systemId) throws SAXException, IOException { // set up the parser XMLReader parser; try { parser = XMLReaderFactory.createXMLReader(); } catch (SAXException e) { try { parser = XMLReaderFactory.createXMLReader( "org.apache.xerces.parsers.SAXParser"); } catch (SAXException e2) { System.err.println("Error: could not locate a parser."); return null; } } // Install the Content Handler XLinkSpider spider = new XLinkSpider(); parser.setContentHandler(spider); parser.parse(systemId); return spider.uris.elements(); } private Vector uris = new Vector(); public void startElement(String namespaceURI, String localName, String rawName, Attributes atts) throws SAXException { String uri = atts.getValue( "http://www.w3.org/1999/xlink", "href"); if (uri != null) uris.addElement(uri); } public static void main(String[] args) { if (args.length == 0) { System.out.println("Usage: java XLinkSpider URL1 URL2..."); } // start parsing... for (int i = 0; i < args.length; i++) { try { Enumeration uris = listURIs(args[i]); while (uris.hasMoreElements()) { String s = (String) uris.nextElement(); System.out.println(s); } } catch (Exception e) { System.err.println(e); e.printStackTrace(); } } // end for } // end main } // end XLinkSpider
Encapsulates access to data so that it looks the same whether it's coming from a
URL
file
stream
reader
database
something else
Used in SAX1 and SAX2
Allows the source to be changed
package org.xml.sax;
import java.io.*;
public class InputSource {
public InputSource()
public InputSource(String systemID)
public InputSource(InputStream in)
public InputSource(Reader in)
public void setPublicId(String publicID)
public String getPublicId()
public void setSystemId(String systemID)
public String getSystemId()
public void setByteStream(InputStream byteStream)
public InputStream getByteStream()
public void setEncoding(String encoding)
public String getEncoding()
public void setCharacterStream(Reader characterStream)
public Reader getCharacterStream()
}
import org.xml.sax;
import java.io.*;
import java.net.*;
import java.util.zip.*;
...
try {
URL u = new URL(
"http://www.ibiblio.org/xml/examples/1998validstats.xml.gz");
InputStream raw = u.openStream();
InputStream decompressed = new GZIPInputStream(in);
InputSource in = new InputSource(decompressed);
// read the document...
}
catch (IOException e) {
System.err.println(e);
}
catch (SAXException e) {
System.err.println(e);
}
ELEMENT, ATTLIST, ENTITY declarations are only optionally reported
Schema declarations aren't reported at all
Lexical events are only optionally reported
SAX2 can be configured on top of a lot of different parsers with different capabilities. What the parser does is more important than what SAX2 does.
You do not always have all the information you need at the time of a given callback
You may need to store information in various data structures (stacks, queues,vectors, arrays, etc.) and act on it at a later point
For example the characters()
method is not guaranteed
to give you the maximum number of contiguous characters. It may
split a single run of characters over multiple method calls.
Elliotte Rusty Harold and Scott Means
O'Reilly & Associates, 2001
ISBN: 0-596-00058-8
SAX website: http://www.megginson.com/SAX/
Writing with DOM
Reading with DOM
An XML document is a tree.
It has a root.
It has nodes.
It is amenable to recursive processing.
Not all applications agree on what the root is.
Not all applications agree on what is and isn't a node.
Defines how XML and HTML documents are represented as objects in programs
Defined in IDL; thus language independent
HTML as well as XML
Writing as well as reading
More complete than SAX; covers everything except internal and external DTD subsets
DOM focuses more on the document; SAX focuses more on the parser.
DOM Level 0:
DOM Level 1, a W3C Standard
DOM Level 2, a W3C Standard
DOM Level 3: Several Working Drafts:
Apache XML Project's Xerces Java: http://xml.apache.org/xerces-j/index.html
IBM's XML for Java: http://www.alphaworks.ibm.com/formula/xml
Sun's Java API for XML http://java.sun.com/products/xml
Eight Modules:
Core: org.w3c.dom
*
HTML: org.w3c.dom.html
Views: org.w3c.dom.views
StyleSheets: org.w3c.dom.stylesheets
CSS: org.w3c.dom.css
Events: org.w3c.dom.events
*
Traversal: org.w3c.dom.traversal
*
Range: org.w3c.dom.range
Only the core and traversal modules really apply to XML. The other six are for HTML.
* indicates Xerces support
A DOM application can use the
hasFeature()
method of the DOMImplementation
interface to
determine whether a module is supported or not.
XML Module: "XML"
HTML Module: "HTML"
Views Module: "Views"
StyleSheets Module: "StyleSheets"
CSS Module: "CSS"
CSS (extended interfaces) Module: "CSS2"
Events Module: "Events"
User Interface Events (UIEvent interface) Module: "UIEvents"
Mouse Events Module: "MouseEvents"
Mutation Events Module: "MutationEvents"
HTML Events Module: "HTMLEvents"
Traversal Module: "Traversal"
Range Module: "Range"
import org.apache.xerces.dom.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class ModuleChecker { public static void main(String[] args) { // parser dependent DOMImplementation implementation = DOMImplementationImpl.getDOMImplementation(); String[] features = {"XML", "HTML", "Views", "StyleSheets", "CSS", "CSS2", "Events", "UIEvents", "MouseEvents", "MutationEvents", "HTMLEvents", "Traversal", "Range"}; for (int i = 0; i < features.length; i++) { if (implementation.hasFeature(features[i], "2.0")) { System.out.println("Implementation supports " + features[i]); } else { System.out.println("Implementation does not support " + features[i]); } } } }
% java ModuleChecker
Implementation supports XML
Implementation does not support HTML
Implementation does not support Views
Implementation does not support StyleSheets
Implementation does not support CSS
Implementation does not support CSS2
Implementation supports Events
Implementation does not support UIEvents
Implementation does not support MouseEvents
Implementation supports MutationEvents
Implementation does not support HTMLEvents
Implementation supports Traversal
Implementation does not support Range
Entire document is represented as a tree.
A tree contains nodes.
Some nodes may contain other nodes (depending on node type).
Each document node contains:
zero or one doctype nodes
one root element node
zero or more comment and processing instruction nodes
17 classes:
Attr
CDATASection
CharacterData
Comment
Document
DocumentFragment
DocumentType
DOMImplementation
Element
Entity
EntityReference
NamedNodeMap
Node
NodeList
Notation
ProcessingInstruction
Text
plus one exception:
DOMException
Plus a bunch of HTML stuff in org.w3c.dom.html
and other packages
we will ignore
Library specific code creates a parser
The parser parses the document and returns a DOM
org.w3c.dom.Document
object.
The entire document is stored in memory.
DOM methods and interfaces are used to extract data from this object
import org.apache.xerces.parsers.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class DOMParserMaker { public static void main(String[] args) { // This is simpler but less flexible than the SAX approach. // Perhaps a good creational design pattern is needed here? DOMParser parser = new DOMParser(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document d = parser.getDocument(); // work with the document... } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } }
package org.w3c.dom;
public interface Node {
// NodeType
public static final short ELEMENT_NODE = 1;
public static final short ATTRIBUTE_NODE = 2;
public static final short TEXT_NODE = 3;
public static final short CDATA_SECTION_NODE = 4;
public static final short ENTITY_REFERENCE_NODE = 5;
public static final short ENTITY_NODE = 6;
public static final short PROCESSING_INSTRUCTION_NODE = 7;
public static final short COMMENT_NODE = 8;
public static final short DOCUMENT_NODE = 9;
public static final short DOCUMENT_TYPE_NODE = 10;
public static final short DOCUMENT_FRAGMENT_NODE = 11;
public static final short NOTATION_NODE = 12;
public String getNodeName();
public String getNodeValue() throws DOMException;
public void setNodeValue(String nodeValue) throws DOMException;
public short getNodeType();
public Node getParentNode();
public NodeList getChildNodes();
public Node getFirstChild();
public Node getLastChild();
public Node getPreviousSibling();
public Node getNextSibling();
public NamedNodeMap getAttributes();
public Document getOwnerDocument();
public Node insertBefore(Node newChild, Node refChild) throws DOMException;
public Node replaceChild(Node newChild, Node oldChild) throws DOMException;
public Node removeChild(Node oldChild) throws DOMException;
public Node appendChild(Node newChild) throws DOMException;
public boolean hasChildNodes();
public Node cloneNode(boolean deep);
public void normalize();
public boolean supports(String feature, String version);
public String getNamespaceURI();
public String getPrefix();
public void setPrefix(String prefix) throws DOMException;
public String getLocalName();
}
package org.w3c.dom;
public interface NodeList {
public Node item(int index);
public int getLength();
}
Now we're really ready to read a document
import org.apache.xerces.parsers.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class NodeReporter { public static void main(String[] args) { DOMParser parser = new DOMParser(); NodeReporter iterator = new NodeReporter(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document doc = parser.getDocument(); iterator.followNode(doc); } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } // end main // note use of recursion public void followNode(Node node) { processNode(node); if (node.hasChildNodes()) { NodeList children = node.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { followNode(children.item(i)); } } } public void processNode(Node node) { String name = node.getNodeName(); String type = getTypeName(node.getNodeType()); System.out.println("Type " + type + ": " + name); } public static String getTypeName(int type) { switch (type) { case Node.ELEMENT_NODE: return "Element"; case Node.ATTRIBUTE_NODE: return "Attribute"; case Node.TEXT_NODE: return "Text"; case Node.CDATA_SECTION_NODE: return "CDATA Section"; case Node.ENTITY_REFERENCE_NODE: return "Entity Reference"; case Node.ENTITY_NODE: return "Entity"; case Node.PROCESSING_INSTRUCTION_NODE: return "Processing Instruction"; case Node.COMMENT_NODE : return "Comment"; case Node.DOCUMENT_NODE: return "Document"; case Node.DOCUMENT_TYPE_NODE: return "Document Type Declaration"; case Node.DOCUMENT_FRAGMENT_NODE: return "Document Fragment"; case Node.NOTATION_NODE: return "Notation"; default: return "Unknown Type"; } } }
% java NodeReporter hotcop.xml Type Document: #document Type Processing Instruction: xml-stylesheet Type Document Type Declaration: SONG Type Element: SONG Type Text: #text Type Element: TITLE Type Text: #text Type Text: #text Type Element: PHOTO Type Text: #text Type Element: COMPOSER Type Text: #text Type Text: #text Type Element: COMPOSER Type Text: #text Type Text: #text Type Element: COMPOSER Type Text: #text Type Text: #text Type Element: PRODUCER Type Text: #text Type Text: #text Type Comment: #comment Type Text: #text Type Element: PUBLISHER Type Text: #text Type Text: #text Type Element: LENGTH Type Text: #text Type Text: #text Type Element: YEAR Type Text: #text Type Text: #text Type Element: ARTIST Type Text: #text Type Text: #text Type Comment: #comment
Attributes are missing from this output. They are not nodes. They are properties of nodes.
Node Type | Node Value |
---|---|
element node | null |
attribute node | attribute value |
text node | text of the node |
CDATA section node | text of the section |
entity reference node | null |
entity node | null |
processing instruction node | content of the processing instruction, not including the target |
comment node | text of the comment |
document node | null |
document type declaration node | null |
document fragment node | null |
notation node | null |
The root node representing the entire document; not the same as the root element
Contains:
one element node
zero or more processing instruction nodes
zero or more comment nodes
zero or one document type nodes
package org.w3c.dom;
public interface Document extends Node {
public DocumentType getDoctype();
public DOMImplementation getImplementation();
public Element getDocumentElement();
public Element createElement(String tagName) throws DOMException;
public Element createElementNS(String namespaceURI, String qualifiedName) throws DOMException;
public DocumentFragment createDocumentFragment();
public Text createTextNode(String data);
public Comment createComment(String data);
public CDATASection createCDATASection(String data) throws DOMException;
public ProcessingInstruction createProcessingInstruction(String target, String data)
throws DOMException;
public Attr createAttribute(String name) throws DOMException;
public Attr createAttributeNS(String namespaceURI, String qualifiedName) throws DOMException;
public EntityReference createEntityReference(String name) throws DOMException;
public NodeList getElementsByTagName(String tagname);
public NodeList getElementsByTagNameNS(String namespaceURI, String localName);
public Element getElementById(String elementId);
public Node importNode(Node importedNode, boolean deep) throws DOMException;
}
UserLand's RSS based list of Web logs at http://static.userland.com/weblogMonitor/logs.xml:
<?xml version="1.0"?> <!DOCTYPE foo SYSTEM "http://msdn.microsoft.com/xml/general/htmlentities.dtd"> <weblogs> <log> <name>MozillaZine</name> <url>http://www.mozillazine.org</url> <changesUrl>http://www.mozillazine.org/contents.rdf</changesUrl> <ownerName>Jason Kersey</ownerName> <ownerEmail>kerz@en.com</ownerEmail> <description>THE source for news on the Mozilla Organization. DevChats, Reviews, Chats, Builds, Demos, Screenshots, and more.</description> <imageUrl></imageUrl> <adImageUrl>http://static.userland.com/weblogMonitor/ads/kerz@en.com.gif</adImageUrl> </log> <log> <name>SalonHerringWiredFool</name> <url>http://www.salonherringwiredfool.com/</url> <ownerName>Some Random Herring</ownerName> <ownerEmail>salonfool@wiredherring.com</ownerEmail> <description></description> </log> <log> <name>Scripting News</name> <url>http://www.scripting.com/</url> <ownerName>Dave Winer</ownerName> <ownerEmail>dave@userland.com</ownerEmail> <description>News and commentary from the cross-platform scripting community.</description> <imageUrl>http://www.scripting.com/gifs/tinyScriptingNews.gif</imageUrl> <adImageUrl>http://static.userland.com/weblogMonitor/ads/dave@userland.com.gif</adImageUrl> </log> <log> <name>SlashDot.Org</name> <url>http://www.slashdot.org/</url> <ownerName>Simply a friend</ownerName> <ownerEmail>afriendofweblogs@weblogs.com</ownerEmail> <description>News for Nerds, Stuff that Matters.</description> </log> </weblogs>
We can easily find out how many URLs there will be when we start parsing, since they're all in memory.
Single threaded by nature; no benefit to multiple threads since no data will be available until the entire document has been read and parsed.
The character data of each url
element needs to be read.
Everything else can be ignored.
The getElementsByTagName()
method in
Document
gives us a quick list of all the
url
elements.
The XML parsing is so straight-forward it can be done inside one method. No extra class is required.
import org.w3c.dom.*; import org.xml.sax.SAXException; import java.io.IOException; import java.util.*; import java.net.*; public class WeblogsDOM { public static String DEFAULT_URL = "http://static.userland.com/weblogMonitor/logs.xml"; public static List listChannels() throws DOMException { return listChannels(DEFAULT_URL); } public static List listChannels(String uri) throws DOMException { if (uri == null) { throw new NullPointerException("URL must be non-null"); } org.apache.xerces.parsers.DOMParser parser = new org.apache.xerces.parsers.DOMParser(); Vector urls = null; try { // Read the entire document into memory parser.parse(uri); Document doc = parser.getDocument(); NodeList logs = doc.getElementsByTagName("url"); urls = new Vector(logs.getLength()); for (int i = 0; i < logs.getLength(); i++) { try { Node element = logs.item(i); Node text = element.getFirstChild(); String content = text.getNodeValue(); URL u = new URL(content); urls.addElement(u); } catch (MalformedURLException e) { // bad input data from one third party; just ignore it } } } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } return urls; } public static void main(String[] args) { try { List urls; if (args.length > 0) { try { URL url = new URL(args[0]); urls = listChannels(args[0]); } catch (MalformedURLException e) { System.err.println("Usage: java WeblogsDOM url"); return; } } else { urls = listChannels(); } Iterator iterator = urls.iterator(); while (iterator.hasNext()) { System.out.println(iterator.next()); } } catch (/* Unexpected */ Exception e) { e.printStackTrace(); } } // end main }
% java WeblogsDOM
http://2020Hindsight.editthispage.com/
http://www.sff.net/people/mitchw/weblog/weblog.htp
http://nate.weblogs.com/
http://plugins.launchpoint.net
http://404.psistorm.net
http://home.att.net/~geek9000
http://daubnet.tzo.com/weblog
several hundred more...
Represents a complete element including its start tag, end tag, and content
Contains:
Element nodes
ProcessingInstruction nodes
Comment nodes
Text nodes
CDATASection nodes
EntityReference nodes
package org.w3c.dom;
public interface Element extends Node {
public String getTagName();
public NodeList getElementsByTagName(String name);
public NodeList getElementsByTagNameNS(String namespaceURI, String localName);
public String getAttribute(String name);
public String getAttributeNS(String namespaceURI, String localName);
public void setAttribute(String name, String value) throws DOMException;
public void setAttributeNS(String namespaceURI, String qualifiedName, String value) throws DOMException;
public void removeAttribute(String name) throws DOMException;
public void removeAttributeNS(String namespaceURI, String localName) throws DOMException;
public Attr getAttributeNode(String name);
public Attr getAttributeNodeNS(String namespaceURI, String localName);
public Attr setAttributeNode(Attr newAttr) throws DOMException;
public Attr setAttributeNodeNS(Attr newAttr) throws DOMException;
public Attr removeAttributeNode(Attr oldAttr) throws DOMException;
}
import org.apache.xerces.parsers.DOMParser; import org.w3c.dom.*; import org.xml.sax.*; import java.io.IOException; import org.apache.xml.serialize.*; public class IDTagger { int id = 1; public void processNode(Node node) { if (node instanceof Element) { Element element = (Element) node; String currentID = element.getAttribute("ID"); if (currentID == null || currentID.equals("")) { element.setAttribute("ID", "_" + id); id = id + 1; } } } // note use of recursion public void followNode(Node node) { processNode(node); if (node.hasChildNodes()) { NodeList children = node.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { followNode(children.item(i)); } } } public static void main(String[] args) { DOMParser parser = new DOMParser(); IDTagger iterator = new IDTagger(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document document = parser.getDocument(); iterator.followNode(document); // now we serialize the document... OutputFormat format = new OutputFormat(document); XMLSerializer serializer = new XMLSerializer(System.out, format); serializer.serialize(document); } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } // end main }
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE SONG SYSTEM "song.dtd"> <?xml-stylesheet type="text/css" href="song.css"?> <SONG ID="_1" xmlns="http://metalab.unc.edu/xml/namespace/song" xmlns:xlink="http://www.w3.org/1999/xlink"> <TITLE ID="_2">Hot Cop</TITLE> <PHOTO ALT="Victor Willis in Cop Outfit" HEIGHT="200" ID="_3" WIDTH="100" xlink:href="hotcop.jpg" xlink:show="onLoad" xlink:type="simple"/> <COMPOSER ID="_4">Jacques Morali</COMPOSER> <COMPOSER ID="_5">Henri Belolo</COMPOSER> <COMPOSER ID="_6">Victor Willis</COMPOSER> <PRODUCER ID="_7">Jacques Morali</PRODUCER> <!-- The publisher is actually Polygram but I needed an example of a general entity reference. --> <PUBLISHER ID="_8" xlink:href="http://www.amrecords.com/" xlink:type="simple"> A & M Records </PUBLISHER> <LENGTH ID="_9">6:20</LENGTH> <YEAR ID="_10">1978</YEAR> <ARTIST ID="_11">Village People</ARTIST> </SONG> <!-- You can tell what album I was listening to when I wrote this example -->View Output in Browser
Represents things that are basically text holders
Super interface of Text
, Comment
,
and CDATASection
package org.w3c.dom;
public interface CharacterData extends Node {
public String getData() throws DOMException;
public void setData(String data) throws DOMException;
public int getLength();
public String substringData(int offset, int count) throws DOMException;
public void appendData(String arg) throws DOMException;
public void insertData(int offset, String arg) throws DOMException;
public void deleteData(int offset, int count) throws DOMException;
public void replaceData(int offset, int count, String arg)
throws DOMException;
}
import org.apache.xerces.parsers.DOMParser; import org.apache.xml.serialize.*; import org.w3c.dom.*; import org.xml.sax.SAXException; import java.io.IOException; public class ROT13XML { public void processNode(Node node) { if (node instanceof CharacterData) { CharacterData text = (CharacterData) node; String data = text.getData(); text.setData(rot13(data)); } } // note use of recursion public void followNode(Node node) { processNode(node); if (node.hasChildNodes()) { NodeList children = node.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { followNode(children.item(i)); } } } public static String rot13(String s) { StringBuffer result = new StringBuffer(s.length()); for (int i = 0; i < s.length(); i++) { int c = s.charAt(i); if (c >= 'A' && c <= 'M') result.append((char) (c+13)); else if (c >= 'N' && c <= 'Z') result.append((char) (c-13)); else if (c >= 'a' && c <= 'm') result.append((char) (c+13)); else if (c >= 'n' && c <= 'z') result.append((char) (c-13)); else result.append((char) c); } return result.toString(); } public static void main(String[] args) { DOMParser parser = new DOMParser(); ROT13XML iterator = new ROT13XML(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document document = parser.getDocument(); iterator.followNode(document); // now we serialize the document... OutputFormat format = new OutputFormat(document); XMLSerializer serializer = new XMLSerializer(System.out, format); serializer.serialize(document); } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } // end main }
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE SONG SYSTEM "song.dtd"> <?xml-stylesheet type="text/css" href="song.css"?> <SONG xmlns="http://metalab.unc.edu/xml/namespace/song" xmlns:xlink="http://www.w3.org/1999/xlink"> <TITLE>Ubg Pbc</TITLE> <PHOTO ALT="Victor Willis in Cop Outfit" HEIGHT="200" WIDTH="100" xlink:href="hotcop.jpg" xlink:show="onLoad" xlink:type="simple"/> <COMPOSER>Wnpdhrf Zbenyv</COMPOSER> <COMPOSER>Uraev Orybyb</COMPOSER> <COMPOSER>Ivpgbe Jvyyvf</COMPOSER> <PRODUCER>Wnpdhrf Zbenyv</PRODUCER> <!-- Gur choyvfure vf npghnyyl Cbyltenz ohg V arrqrq na rknzcyr bs n trareny ragvgl ersrerapr. --> <PUBLISHER xlink:href="http://www.amrecords.com/" xlink:type="simple"> N & Z Erpbeqf </PUBLISHER> <LENGTH>6:20</LENGTH> <YEAR>1978</YEAR> <ARTIST>Ivyyntr Crbcyr</ARTIST> </SONG> <!-- Lbh pna gryy jung nyohz V jnf yvfgravat gb jura V jebgr guvf rknzcyr -->
Represents the text content of an element or attribute
Contains only pure text, no markup
Parsers will return a single maximal text node for each contiguous run of pure text
Editing may change this
package org.w3c.dom;
public interface Text extends CharacterData {
public Text splitText(int offset) throws DOMException;
}
Represents a CDATA section like this example from a hypothetical SVG tutorial:
<p>You can use a default <code>xmlns</code> attribute to avoid
having to add the svg prefix to all your elements:</p>
<![CDATA[
<svg xmlns="http://www.w3.org/2000/svg"
width="12cm" height="10cm">
<ellipse rx="110" ry="130" />
<rect x="4cm" y="1cm" width="3cm" height="6cm" />
</svg>
]]>
No children
package org.w3c.dom;
public interface CDATASection extends Text {
}
Represents a document type declaration
Has no children
package org.w3c.dom;
public interface DocumentType extends Node {
public String getName();
public NamedNodeMap getEntities();
public NamedNodeMap getNotations();
public String getPublicId();
public String getSystemId();
public String getInternalSubset();
}
Verify that a document is correct XHTML
From the XHTML 1.0 spec:
It must validate against one of the three DTDs found in Appendix A.
The root element of the document must be
<html>
.
The root element of the document must designate the XHTML namespace using the
xmlns
attribute [XMLNAMES]. The namespace for XHTML is defined to behttp://www.w3.org/1999/xhtml
.
There must be a DOCTYPE declaration in the document prior to the root element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in Appendix A using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "DTD/xhtml1-frameset.dtd">
import org.w3c.dom.*; import org.apache.xerces.parsers.*; import java.io.*; import org.xml.sax.*; public class XHTMLValidator { public static void main(String[] args) { for (int i = 0; i < args.length; i++) { validate(args[i]); } } private static DOMParser parser = new DOMParser(); static { // turn on validation try { parser.setFeature( "http://xml.org/sax/features/validation", true); parser.setErrorHandler(new ValidityErrorReporter()); } catch (SAXNotRecognizedException e) { System.err.println( "Installed XML parser cannot validate; " + "checking for well-formedness instead..."); } catch (SAXNotSupportedException e) { System.err.println( "Cannot turn on validation here; " + " checking for well-formedness instead..."); } } // not thread safe public static void validate(String source) { try { try { parser.parse(source); // ValidityErrorReporter prints any validity errors detected } catch (SAXException e) { System.out.println(source + " is not well formed."); return; } // If we get this far, then the document is well-formed XML. // Check to see whether the document is actually XHTML Document document = parser.getDocument(); DocumentType doctype = document.getDoctype(); if (doctype == null) { System.out.println("No DOCTYPE"); return; } String name = doctype.getName(); String systemID = doctype.getSystemId(); String publicID = doctype.getPublicId(); if (!name.equals("html")) { System.out.println("Incorrect root element name " + name); } if (publicID == null || (!publicID.equals("-//W3C//DTD XHTML 1.0 Strict//EN") && !publicID.equals( "-//W3C//DTD XHTML 1.0 Transitional//EN") && !publicID.equals( "-//W3C//DTD XHTML 1.0 Frameset//EN"))) { System.out.println(source + " does not seem to use an XHTML 1.0 DTD"); } // Check the namespace on the root element Element root = document.getDocumentElement(); String xmlnsValue = root.getAttribute("xmlns"); if (!xmlnsValue.equals("http://www.w3.org/1999/xhtml")) { System.out.println(source + " does not properly declare the" + " http://www.w3.org/1999/xhtml" + " namespace on the root element"); } // get ready for the next parse parser.reset(); } catch (IOException e) { System.err.println("Could not read " + source); } catch (Exception e) { System.err.println(e); e.printStackTrace(); } } }
Represents an attribute
Contains:
Text nodes
Entity reference nodes
package org.w3c.dom;
public interface Attr extends Node {
public String getName();
public boolean getSpecified();
public String getValue();
public void setValue(String value) throws DOMException;
public Element getOwnerElement();
}
import org.xml.sax.*; import org.apache.xerces.parsers.*; import java.io.*; import java.util.*; import org.w3c.dom.*; public class DOMSpider { private static DOMParser parser = new DOMParser(); // namespace suport is turned off by default in Xerces static { try { parser.setFeature( "http://xml.org/sax/features/namespaces", true); } catch (Exception e) { System.err.println(e); } } private static Vector visited = new Vector(); private static int maxDepth = 5; private static int currentDepth = 0; public static void listURIs(String systemId) { currentDepth++; try { if (currentDepth < maxDepth) { parser.parse(systemId); Document document = parser.getDocument(); Vector uris = new Vector(); // search the document for uris, // store them in vector, and print them searchForURIs(document.getDocumentElement(), uris); Enumeration e = uris.elements(); while (e.hasMoreElements()) { String uri = (String) e.nextElement(); visited.addElement(uri); listURIs(uri); } } } catch (SAXException e) { // couldn't load the document, // probably not well-formed XML, skip it } catch (IOException e) { // couldn't load the document, // likely network failure, skip it } finally { currentDepth--; System.out.flush(); } } // use recursion public static void searchForURIs(Element element, Vector uris) { // look for XLinks in this element String uri = element.getAttributeNS("http://www.w3.org/1999/xlink", "href"); if (uri != null && !uri.equals("") && !visited.contains(uri) && !uris.contains(uri)) { System.out.println(uri); uris.addElement(uri); } // process child elements recursively NodeList children = element.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { Node n = children.item(i); if (n instanceof Element) { searchForURIs((Element) n, uris); } } } public static void main(String[] args) { if (args.length == 0) { System.out.println("Usage: java DOMSpider URL1 URL2..."); } // start parsing... for (int i = 0; i < args.length; i++) { try { listURIs(args[i]); } catch (Exception e) { System.err.println(e); e.printStackTrace(); } } // end for } // end main } // end DOMSpider
Represents a processing instruction like
<?robots index="yes" follow="no"?>
No children
package org.w3c.dom;
public interface ProcessingInstruction extends Node {
public String getTarget();
public String getData();
public void setData(String data) throws DOMException;
}
import org.xml.sax.*; import org.apache.xerces.parsers.*; import java.io.*; import java.util.*; import org.w3c.dom.*; public class PoliteDOMSpider { private static DOMParser parser = new DOMParser(); // namespace suport is turned off by default in Xerces static { try { parser.setFeature("http://xml.org/sax/features/namespaces", true); } catch (Exception e) { System.err.println(e); } } private static Vector visited = new Vector(); private static int maxDepth = 5; private static int currentDepth = 0; public static void listURIs(String systemId) { currentDepth++; try { if (currentDepth < maxDepth) { parser.parse(systemId); Document document = parser.getDocument(); if (robotsAllowed(document)) { Vector uris = new Vector(); // search the document for uris, // store them in vector, print them searchForURIs(document.getDocumentElement(), uris); Enumeration e = uris.elements(); while (e.hasMoreElements()) { String uri = (String) e.nextElement(); visited.addElement(uri); listURIs(uri); } } } } catch (SAXException e) { // couldn't load the document, // probably not well-formed XML, skip it } catch (IOException e) { // couldn't load the document, // likely network failure, skip it } finally { currentDepth--; System.out.flush(); } } public static boolean robotsAllowed(Document document) { NodeList children = document.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { Node n = children.item(i); if (n instanceof ProcessingInstruction) { ProcessingInstruction pi = (ProcessingInstruction) n; if (pi.getTarget().equals("robots")) { String data = pi.getData(); if (data.indexOf("follow=\"no\"") >= 0) { return false; } } } } return true; } // use recursion public static void searchForURIs(Element element, Vector uris) { // look for XLinks in this element String uri = element.getAttributeNS("http://www.w3.org/1999/xlink", "href"); if (uri != null && !uri.equals("") && !visited.contains(uri) && !uris.contains(uri)) { System.out.println(uri); uris.addElement(uri); } // process child elements recursively NodeList children = element.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { Node n = children.item(i); if (n instanceof Element) { searchForURIs((Element) n, uris); } } } public static void main(String[] args) { if (args.length == 0) { System.out.println("Usage: java PoliteDOMSpider URL1 URL2..."); } // start parsing... for (int i = 0; i < args.length; i++) { try { listURIs(args[i]); } catch (Exception e) { System.err.println(e); e.printStackTrace(); } } // end for } // end main } // end PoliteDOMSpider
Represents a comment like this example from the XML 1.0 spec:
<!--* N.B. some readers (notably JC) find the following
paragraph awkward and redundant. I agree it's logically redundant:
it *says* it is summarizing the logical implications of
matching the grammar, and that means by definition it's
logically redundant. I don't think it's rhetorically
redundant or unnecessary, though, so I'm keeping it. It
could however use some recasting when the editors are feeling
stronger. -MSM *-->
No children
package org.w3c.dom;
public interface Comment extends CharacterData {
}
import org.apache.xerces.parsers.*; import org.w3c.dom.*; import org.xml.sax.*; import java.io.*; public class DOMCommentReader { public static void main(String[] args) { DOMParser parser = new DOMParser(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document d = parser.getDocument(); processNode(d); } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } // end main // note use of recursion public static void processNode(Node node) { int type = node.getNodeType(); if (type == Node.COMMENT_NODE) { System.out.println(node.getNodeValue()); System.out.println(); } else { if (node.hasChildNodes()) { NodeList children = node.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { processNode(children.item(i)); } } } } }
% java DOMCommentReader hotcop.xml
The publisher is actually Polygram but I needed
an example of a general entity reference.
You can tell what album I was
listening to when I wrote this example
Or try http://www.w3.org/TR/1998/REC-xml-19980210.xml for more interesting output
A runtime exception but you should catch it
Error code gives more detailed information:
DOMException.INDEX_SIZE_ERR
DOMException.DOMSTRING_SIZE_ERR
String
DOMException.HIERARCHY_REQUEST_ERR
DOMException.WRONG_DOCUMENT_ERR
DOMException.INVALID_CHARACTER_ERR
DOMException.NO_DATA_ALLOWED_ERR
DOMException.NO_MODIFICATION_ALLOWED_ERR
DOMException.NOT_FOUND_ERR
DOMException.NOT_SUPPORTED_ERR
DOMException.INUSE_ATTRIBUTE_ERR
DOMException.INVALID_STATE_ERR
DOMException.SYNTAX_ERR
DOMException.INVALID_MODIFICATION_ERR
DOMException.NAMESPACE_ERR
DOMException.INVALID_ACCESS_ERR
Current value accessible from the public code
field
Four interfaces:
DocumentTraversal
NodeFilter
NodeIterator
TreeWalker
package org.w3c.dom.traversal; public interface NodeIterator { public int getWhatToShow(); public NodeFilter getFilter(); public boolean getExpandEntityReferences(); public Node nextNode() throws DOMException; public Node previousNode() throws DOMException; public void detach(); }
import org.apache.xerces.parsers.*; import org.apache.xerces.dom.*; import org.w3c.dom.*; import org.w3c.dom.traversal.*; import org.xml.sax.*; import java.io.*; public class ValueReporter { public static void main(String[] args) { DOMParser parser = new DOMParser(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document doc = parser.getDocument(); DocumentImpl impl = (DocumentImpl) doc; NodeIterator iterator = impl.createNodeIterator(doc.getDocumentElement(), NodeFilter.SHOW_ALL, null, true); Node node; while ((node = iterator.nextNode()) != null) { processNode(node); } } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } // end main public static void processNode(Node node) { String name = node.getNodeName(); String type = getTypeName(node.getNodeType()); String value = node.getNodeValue(); System.out.println("Type " + type + ": " + name + " \"" + value + "\""); } public static String getTypeName(int type) { switch (type) { case Node.ELEMENT_NODE: return "Element"; case Node.ATTRIBUTE_NODE: return "Attribute"; case Node.TEXT_NODE: return "Text"; case Node.CDATA_SECTION_NODE: return "CDATA Section"; case Node.ENTITY_REFERENCE_NODE: return "Entity Reference"; case Node.ENTITY_NODE: return "Entity"; case Node.PROCESSING_INSTRUCTION_NODE: return "Processing Instruction"; case Node.COMMENT_NODE : return "Comment"; case Node.DOCUMENT_NODE: return "Document"; case Node.DOCUMENT_TYPE_NODE: return "Document Type Declaration"; case Node.DOCUMENT_FRAGMENT_NODE: return "Document Fragment"; case Node.NOTATION_NODE: return "Notation"; default: return "Unknown Type"; } } }
% java ValueReporter hotcop.xml Type Element: SONG "null" Type Text: #text " " Type Element: TITLE "null" Type Text: #text "Hot Cop" Type Text: #text " " Type Element: PHOTO "null" Type Text: #text " " Type Element: COMPOSER "null" Type Text: #text "Jacques Morali" Type Text: #text " " Type Element: COMPOSER "null" Type Text: #text "Henri Belolo" Type Text: #text " " Type Element: COMPOSER "null" Type Text: #text "Victor Willis" Type Text: #text " " Type Element: PRODUCER "null" Type Text: #text "Jacques Morali" Type Text: #text " " Type Comment: #comment " The publisher is actually Polygram but I needed an example of a general entity reference. " Type Text: #text " " Type Element: PUBLISHER "null" Type Text: #text " A & M Records " Type Text: #text " " Type Element: LENGTH "null" Type Text: #text "6:20" Type Text: #text " " Type Element: YEAR "null" Type Text: #text "1978" Type Text: #text " " Type Element: ARTIST "null" Type Text: #text "Village People" Type Text: #text " "
Attributes are missing from this output. They are not nodes. They are properties of nodes.
package org.w3c.dom.traversal; public interface NodeFilter { // Constants returned by acceptNode public static final short FILTER_ACCEPT = 1; public static final short FILTER_REJECT = 2; public static final short FILTER_SKIP = 3; // Constants for whatToShow public static final int SHOW_ALL = 0x0000FFFF; public static final int SHOW_ELEMENT = 0x00000001; public static final int SHOW_ATTRIBUTE = 0x00000002; public static final int SHOW_TEXT = 0x00000004; public static final int SHOW_CDATA_SECTION = 0x00000008; public static final int SHOW_ENTITY_REFERENCE = 0x00000010; public static final int SHOW_ENTITY = 0x00000020; public static final int SHOW_PROCESSING_INSTRUCTION = 0x00000040; public static final int SHOW_COMMENT = 0x00000080; public static final int SHOW_DOCUMENT = 0x00000100; public static final int SHOW_DOCUMENT_TYPE = 0x00000200; public static final int SHOW_DOCUMENT_FRAGMENT = 0x00000400; public static final int SHOW_NOTATION = 0x00000800; public short acceptNode(Node n); }
import org.apache.xerces.parsers.*; import org.apache.xerces.dom.*; import org.w3c.dom.*; import org.w3c.dom.traversal.*; import org.xml.sax.SAXException; import java.io.IOException; public class DOMTagStripper { public static void main(String[] args) { DOMParser parser = new DOMParser(); for (int i = 0; i < args.length; i++) { try { // Read the entire document into memory parser.parse(args[i]); Document doc = parser.getDocument(); DocumentImpl impl = (DocumentImpl) doc; NodeIterator iterator = impl.createNodeIterator(doc.getDocumentElement(), NodeFilter.SHOW_TEXT, null, true); Node node; while ((node = iterator.nextNode()) != null) { System.out.print(node.getNodeValue()); } } catch (SAXException e) { System.err.println(e); } catch (IOException e) { System.err.println(e); } } } // end main }
% java DOMTagStripper hotcop.xml Hot Cop Jacques Morali Henri Belolo Victor Willis Jacques Morali A & M Records 6:20 1978 Village People
DOM is for both input and output
New documents are created with a parser-specific API
A serializer + output format converts the DOM to a byte stream
A Xerces-specific class used to create new DOM documents
package org.apache.xerces.dom;
public class DOMImplementationImpl implements DOMImplementation {
public boolean hasFeature(String feature, String version)
public static DOMImplementation getDOMImplementation()
public DocumentType createDocumentType(String qualifiedName,
String publicID, String systemID, String internalSubset)
public Document createDocument(String namespaceURI,
String qualifiedName, DocumentType doctype)
throws DOMException
}
import java.math.BigInteger; import java.io.*; import org.w3c.dom.*; import org.apache.xerces.dom.*; import org.apache.xml.serialize.*; public class FibonacciDOM { public static void main(String[] args) { try { DOMImplementation impl = DOMImplementationImpl.getDOMImplementation(); Document fibonacci = impl.createDocument( null, // no namespace URI "Fibonacci_Numbers", // root element null // no DOCTYPE declaration ); BigInteger low = BigInteger.ZERO; BigInteger high = BigInteger.ONE; Element root = fibonacci.getDocumentElement(); for (int i = 0; i <= 25; i++) { Element number = fibonacci.createElement("fibonacci"); number.setAttribute("index", Integer.toString(i)); Text text = fibonacci.createTextNode(low.toString()); number.appendChild(text); root.appendChild(number); BigInteger temp = high; high = high.add(low); low = temp; } // Now the document has been created and exists in memory } catch (DOMException e) { e.printStackTrace(); } } }
The process of taking an in-memory DOM tree and converting it to a stream of characters that can be written onto an output stream
Not a standard part of the DOM
The public interface DOMSerializer public interface Serializer public abstract class BaseMarkupSerializer
extends Object
implements DocumentHandler, org.xml.sax.misc.LexicalHandler, DTDHandler,
org.xml.sax.misc.DeclHandler, DOMSerializer, Serializer public class HTMLSerializer
extends BaseMarkupSerializer public final class TextSerializer
extends BaseMarkupSerializer public final class XHTMLSerializer
extends HTMLSerializer public final class XMLSerializer
extends BaseMarkupSerializerorg.apache.xml.serialize
package:
import java.math.BigInteger; import java.io.*; import org.w3c.dom.*; import org.apache.xerces.dom.*; import org.apache.xml.serialize.*; public class FibonacciDOMSerializer { public static void main(String[] args) { try { DOMImplementation impl = DOMImplementationImpl.getDOMImplementation(); Document fibonacci = impl.createDocument( null, // no namespace URI "Fibonacci_Numbers", // root element null // no DOCTYPE declaration ); BigInteger low = BigInteger.ZERO; BigInteger high = BigInteger.ONE; Element root = fibonacci.getDocumentElement(); for (int i = 0; i <= 25; i++) { Element number = fibonacci.createElement("fibonacci"); number.setAttribute("index", Integer.toString(i)); Text text = fibonacci.createTextNode(low.toString()); number.appendChild(text); root.appendChild(number); BigInteger temp = high; high = high.add(low); low = temp; } try { // Now that the document is created we need to *serialize* it OutputFormat format = new OutputFormat(fibonacci); XMLSerializer serializer = new XMLSerializer(System.out, format); serializer.serialize(fibonacci); } catch (IOException e) { System.err.println(e); } } catch (DOMException e) { e.printStackTrace(); } } }
<?xml version="1.0" encoding="UTF-8"?> <Fibonacci_Numbers><fibonacci index="0">0</fibonacci><fibonacci index="1">1</fibonacci><fibonacci index="2">1</fibonacci><fibonacci index="3">2</fibonacci><fibonacci index="4">3</fibonacci><fibonacci index="5">5</fibonacci><fibonacci index="6">8</fibonacci><fibonacci index="7">13</fibonacci><fibonacci index="8">21</fibonacci><fibonacci index="9">34</fibonacci><fibonacci index="10">55</fibonacci><fibonacci index="11">89</fibonacci><fibonacci index="12">144</fibonacci><fibonacci index="13">233</fibonacci><fibonacci index="14">377</fibonacci><fibonacci index="15">610</fibonacci><fibonacci index="16">987</fibonacci><fibonacci index="17">1597</fibonacci><fibonacci index="18">2584</fibonacci><fibonacci index="19">4181</fibonacci><fibonacci index="20">6765</fibonacci><fibonacci index="21">10946</fibonacci><fibonacci index="22">17711</fibonacci><fibonacci index="23">28657</fibonacci><fibonacci index="24">46368</fibonacci><fibonacci index="25">75025</fibonacci></Fibonacci_Numbers>
package org.apache.xml.serialize;
public class OutputFormat extends Object {
public OutputFormat()
public OutputFormat(String method, String encoding, boolean indenting)
public OutputFormat(Document doc)
public OutputFormat(Document doc, String encoding, boolean indenting)
public String getMethod()
public void setMethod(String method)
public String getVersion()
public void setVersion(String version)
public int getIndent()
public boolean getIndenting()
public void setIndent(int indent)
public void setIndenting(boolean on)
public String getEncoding()
public void setEncoding(String encoding)
public String getMediaType()
public void setMediaType(String mediaType)
public void setDoctype(String publicID, String systemID)
public String getDoctypePublic()
public String getDoctypeSystem()
public boolean getOmitXMLDeclaration()
public void setOmitXMLDeclaration(boolean omit)
public boolean getStandalone()
public void setStandalone(boolean standalone)
public String[] getCDataElements()
public boolean isCDataElement(String tagName)
public void setCDataElements(String[] cdataElements)
public String[] getNonEscapingElements()
public boolean isNonEscapingElement(String tagName)
public void setNonEscapingElements(String[] nonEscapingElements)
public String getLineSeparator()
public void setLineSeparator(String lineSeparator)
public boolean getPreserveSpace()
public void setPreserveSpace(boolean preserve)
public int getLineWidth()
public void setLineWidth(int lineWidth)
public char getLastPrintable()
public static String whichMethod(Document doc)
public static String whichDoctypePublic(Document doc)
public static String whichDoctypeSystem(Document doc)
public static String whichMediaType(String method)
}
Latin-1 encoding
Indentation
Word wrapping
Document type declaration
try {
// Now that the document is created we need to *serialize* it
OutputFormat format = new OutputFormat(fibonacci, "8859_1", true);
format.setLineSeparator("\r\n");
format.setLineWidth(72);
format.setDoctype(null, "fibonacci.dtd");
XMLSerializer serializer = new XMLSerializer(System.out, format);
serializer.serialize(root);
}
catch (IOException e) {
System.err.println(e);
}
Question: Why won't this let us add an xml-stylesheet
directive?
<?xml version="1.0" encoding="8859_1"?> <!DOCTYPE Fibonacci_Numbers SYSTEM "fibonacci.dtd"> <Fibonacci_Numbers> <fibonacci index="0">0</fibonacci> <fibonacci index="1">1</fibonacci> <fibonacci index="2">1</fibonacci> <fibonacci index="3">2</fibonacci> <fibonacci index="4">3</fibonacci> <fibonacci index="5">5</fibonacci> <fibonacci index="6">8</fibonacci> <fibonacci index="7">13</fibonacci> <fibonacci index="8">21</fibonacci> <fibonacci index="9">34</fibonacci> <fibonacci index="10">55</fibonacci> <fibonacci index="11">89</fibonacci> <fibonacci index="12">144</fibonacci> <fibonacci index="13">233</fibonacci> <fibonacci index="14">377</fibonacci> <fibonacci index="15">610</fibonacci> <fibonacci index="16">987</fibonacci> <fibonacci index="17">1597</fibonacci> <fibonacci index="18">2584</fibonacci> <fibonacci index="19">4181</fibonacci> <fibonacci index="20">6765</fibonacci> <fibonacci index="21">10946</fibonacci> <fibonacci index="22">17711</fibonacci> <fibonacci index="23">28657</fibonacci> <fibonacci index="24">46368</fibonacci> <fibonacci index="25">75025</fibonacci> </Fibonacci_Numbers>
Using the DOM to write documents automatically maintains well-formedness constraints
Validity is not automatically maintained.
This presentation: http://www.ibiblio.org/xml/slides/xmloneaustin2001/xmlandjava/
Elliotte Rusty Harold and Scott Means
O'Reilly & Associates, 2001
ISBN: 0-596-00058-8
DOM Level 2 Core Specification: http://www.w3.org/TR/DOM-Level-2-Core/
DOM Level 2 Traversal and Range Specification: http://www.w3.org/TR/DOM-Level-2-Traversal-Range/