11 Make structure explicit through markup

11. Make structure explicit through markup

All structure in an XML document should be indicated through XML tags, not through other means. XML parsers are designed to process tags. They do not see any other form of structure in the data, explicit or implicit. Using anything except tags and their attributes to delineate structure makes the data much harder to read. In essence, programmers who write software to process such documents must invent mini-parsers for the non-XML structures in the data.

For example, in the bank statement XML application, a transaction could theoretically be represented like this:

<Transaction>Withdrawal 2003 12 15 200.00</Transaction>

This would require applications reading the data to know that the first field is the kind of transaction, the second field is the year, the third is the month, the fourth is the date, an the last is the amount. In most cases a client application would have to split this data along the white space to further process it. On the other hand, the client application has less work to do and can operate more smoothly if it's presented with data marked up like this:

<Transaction type="withdrawal">
  <Date>2003-12-15</Date>
  <Amount>200.00</Amount>
</Transaction>

Here each useful unit of information can be seen as the complete content of an element or an attribute.

Tag Each Unit of Information

The key idea is that what's between two tags should be the minimum unit of text that can usefully be processed as a whole. It should not need to be further subdivided for the common use-cases. An Amount element contains a complete amount, and nothing else. The amount is a single thing, a whole unit of information, in this case a number. It does not have internal structure that any application is likely to care about.

Occasionally the question of what constitutes a unit may depend on where and how the data is used. For example, consider the Date element in the above Transaction. It contains implicit markup based on the hyphen. It could instead be written like this:

<Date>
  <Year>2003<Year>
  <Month>12</Month>
  <Day>15</Day>
</Date>

Whether this is useful or not depends on how the dates will be used. If they're merely formatted on a page as is, or passed to an API that knows how to create Date objects from strings like 2003-12-15 then you may not need to separate out the month, day, and year as separate elements. Generally, whether to further subdivide data depends on the use to which the information is put and the operations that will be performed on it. If dates are intended purely to notate a particular moment in history, then a format like <Date>2003-12-15</Date> is appropriate. This would be useful for figuring out if it's time to drink a bottle of wine, determining whether a worker is eligible for retirement benefits, or calculating how much time remains on a car's warranty, for example. In none of these cases is the individual day of month, month, or even year very significant. Only the combination of these quantities matters. That dates are even divided into these quantities in the first place in mostly a fluke of astronomy and the planet we live on, not something intrinsic to the nature of time.

On the other hand, consider weather data. Since weather varies with the seasons and has a roughly periodic structure tied to the years and the months, it does make sense to compare weather from one February to the next, without necessarily considering the year. Other real world data tied to annual and monthly cycles includes birthdays, pay periods, and financial results. If you're modeling this sort of data you will want to be able to separate months, days, and years from each other. In this case, more structured markup such as <date><year>2003</year><month>12</month><day>15</day></date> is appropriate. The question is really whether processes manipulating this data are likely to want to treat the text as a single unit of information or a composite of more fundamental data.

On the other hand, just because you don't need to extract the individual components of a date does not mean that no one who works with the data will need to do that. Generally, I prefer to err on the side of too much markup rather than too little. Larger chunks of data can normally be formed by manipulating the parent or ancestor elements when necessary. It is easier to remove structure when processing than to add it.

The classic example of what not to do is Scalable Vector Graphics (SVG). SVG uses huge amounts of non-XML based mark-up. For example, consider this polygon element:

<polygon points="350,75 379,161 469,161 397,215 423,301 
                 350,250 277,301 303,215 231,161 321,161" />

In particular, look at the value of the points attribute. That's not just a string of characters. Instead it's a sequence of x, y coordinates. An SVG processor cannot simply work with the attribute value. Instead it first has to divide the attribute value into matching pairs and decide which are x's and which are y's. The proper approach would have been to define the coordinates as child elements, like this:

<polygon>
  <point x="350" y="75"/> 
  <point x="379" y="161"/> 
  <point x="469" y="161"/> 
  <point x="397" y="215"/> 
  <point x="423" y="301"/> 
  <point x="350" y="250"/> 
  <point x="277" y="301"/> 
  <point x="303" y="215"/> 
  <point x="231" y="161"/> 
  <point x="321" y="161"/> 
</polygon>

This way the XML processor would present the coordinates to the application already nicely parsed. This also demonstrates the important point that attributes don't support structure very well (See Item 12). Structured data normally needs to be stored in element hierarchies. Only the lowest, most unstructured pieces should be put in attributes.

The reasoning behind this bad decision was to avoid excessive file-size and verbosity. However, terseness of markup is an explicit non-goal of XML. If you really care that much about how many characters a user must type, you shouldn't be using XML in the first place. In this case, however, terseness truly has no benefits. Almost all practical SVG is either generated by a computer program or drawn in a WYSIWYG application such as Adobe Illustrator. Software can easily handle a more verbose, pure XML format. Indeed, it would be considerably easier to write such SVG-processing and generating software if all the structures were based on XML. File size is even less important. SVG documents are routinely gzipped in practice anyway, which rapidly eliminates any significant differences between the less and more verbose formats. (See Item 53, Compress if space is a problem).

SVG goes even further in the wrong direction by incorporating the non-XML CSS format. For example, a polygon can be filled, stroked and colored like this:

<polygon style="fill: red; stroke: blue; stroke-width: 10" 
         points="350,75 379,161 469,161 397,215 423,301 
                 350,250 277,301 303,215 231,161 321,161" />

Fortunately for the most important and common styles, SVG also allows an attribute based alternative. For example, this is an equivalent polygon:

<polygon fill="red" stroke="blue" stroke-width="10" 
         points="350,75 379,161 469,161 397,215 423,301 
                 350,250 277,301 303,215 231,161 321,161" />

Nonetheless, because the CSS style attribute is allowed, an SVG renderer needs both an XML parser and a CSS parser. It's easier to write a CSS parser than an XML parser, but it's still a non-trivial amount of work. Furthermore, it's much harder to detect violations of CSS. Its less draconian error handling makes it easier to produce incorrect SVG documents that may not be noticed by authors. SVG is less interoperable and reliable than it would be if it were pure XML.

XSL Formatting Objects (XSL-FO), by contrast, is an example of how to properly integrate XML formats with legacy formats such as CSS. It maintains the CSS property names, values, and meanings. However, it replaces CSS's native structure with an XML equivalent. XSL-FO doesn't have polygons, but here's a paragraph whose color is blue, whose background color is red, and whose border is ten pixels wide:

<fo:block color="blue" background-color="red" border="10px">
  The text of the paragraph goes here.
</fo:block>

This has all the advantages of familiarity with CSS but none of the disadvantages of non-XML structure. The semantics of CSS are retained while the syntax is changed to more convenient XML.

Avoid Implicit Structure

You need to be especially wary of implicit markup, often indicated by white space. For example, consider the simple case of a name:

<Name>Lenny Bruce</Name>

The name is sometimes treated as a single thing, but quite often you need to extract the first name and last name separately, most commonly to sort by last name. This seems easy enough to do: just split the string on the white space. The first name is everything before the space. The last name is everything after the space. Of course this algorithm falls apart as soon as you add middle names:

<Name>Lenny Alfred Bruce</Name>

You may decide that you don't really care about middle names, that they can just be appended to the first name. You're just going to sort by last name anyway. However, now consider what happens when the last name contains white space:

<Name>Stefania de Kennessey</Name>

The obvious algorithm assigns people the wrong last name. This can be quite offensive to the person whose name you've butchered, not that I haven't seen a lot of naïve software that does exactly this.

What about titles? For example, consider these names:

<Name>Mr. Lenny Bruce</Name>
<Name>Dr. Benjamin Spock</Name>
<Name>Timothy Leary, PhD</Name>
<Name>William Kunstler, Esq.</Name>
<Name>Ms. Anita Hoffman</Name>
<Name>Prof. John H. Exton, M.D., PhD</Name>

Given a large list of likely titles you can probably design an algorithm that accounts for these, but what seemed like a simple operation is rapidly complexifying in the face of real-world data.

Finally, let's recall that not all cultures put the family name last. For example, in Japan the family name normally comes first:

<Name>Kawabata Yasunari</Name>

Thus when sorting Japanese names you sort by first name rather than last name. Do you really want to try to design a system that can guess whether a string is a Japanese name or an English one? To make matters worse, often, but not always, when Japanese names are translated into English the order of the names is reversed:

<Name>Yasunari Kawabata</Name>

In fact, Japanese written in Kanji normally doesn't even use white space between the family and given name:

<Name>川端康成</Name>

The problem is a lot messier than it looks at first glance.

All of this goes away as soon as you use explicit markup to identify the different components of a name, instead of relying on software to sort it out:

<Name><Given>Lenny</Given> <Family>Bruce</Family></Name>
<Name><Given>Lenny</Given> <Middle>Alfred</Middle> <Family>Bruce</Family></Name>
<Name><Given>Stefania</Given> <Family>de Kennessey</Family></Name>
<Name><Title>Mr.</Title> <Given>Lenny</Given> <Family>Bruce</Family></Name>
<Name><Title>Dr.</Title> <Given>Benjamin</Given> <Family>Spock</Family></Name>
<Name><Given>Timothy</Given> <Family>Leary</Family>, <Title>Ph.D</Title></Name>
<Name>
  <Given>William</Given> <Family>Kunstler</Family>, <Title>Esq.</Title>
</Name>
<Name><Title>Ms.</Title> <Given>Anita</Given> <Family>Hoffman</Family></Name>
<Name>
  <Title>Prof.</Title> 
  <Given>John</Given> <MiddleInitial>H.</MiddleInitial>
  <Family>Exton</Family>, 
  <Title>M.D.</Title>, <Title>PhD</Title>
</Name>
<Name><Family>川端</Family><Given>康成</Given></Name>
<Name><Family>Kawabata</Family> <Given>Yasunari</Given></Name>
<Name><Given>Yasunari</Given> <Family>Kawabata</Family></Name>

Another example of abuse of white space occurs in narrative documents that attempt to treat white space as significant, as in this poem:

<poem type="sonnet" poet="Eleanor Alexander">
  For me, my friend, no grave-side vigil keep
  With tears that memory and remorse might fill;
  Give me your tenderest laughter earth-bound still,
  And when I die you shall not want to weep.
  No epitaph for me with virtues deep
  Punctured in marble pitiless and chill:
  But when play time is over, if you will,
  The songs that soothe beloved babes to sleep.

  No lenten lilies on my breast and brow
  Be laid when I am silent; roses red,
  And golden roses bring me here instead,
  That if you love or bear me I may know;
  I may not know, nor care, when I am dead:
  Give me your songs, and flowers, and laughter now.
</poem>

Here the line breaks indicate the end of a verse, and the blank lines indicate the end of a stanza. However, this can be problematic when the content is displayed in an environment where the lines are wrapped or the white space is otherwise adjusted for typographical reasons. Furthermore, these white space based constraints can't be validated, either with respect to XML (Every stanza contains one or more verses) or poetry (the first stanza of a sonnet has eight verses; the second has six). Authors are likely to make mistakes when the white space is too significant. It's much better to make the stanza and verse division explicit like this:

<poem type="sonnet" poet="Eleanor Alexander">
  <stanza>
    <line>For me, my friend, no grave-side vigil keep</line>
    <line>With tears that memory and remorse might fill;</line>
    <line>Give me your tenderest laughter earth-bound still,</line>
    <line>And when I die you shall not want to weep.</line>
    <line>No epitaph for me with virtues deep</line>
    <line>Punctured in marble pitiless and chill:</line>
    <line>But when play time is over, if you will,</line>
    <line>The songs that soothe beloved babes to sleep.</line>
  </stanza>

  <stanza>
    <line>No lenten lilies on my breast and brow</line>
    <line>Be laid when I am silent; roses red,</line>
    <line>And golden roses bring me here instead,</line>
    <line>That if you love or bear me I may know;</line>
    <line>I may not know, nor care, when I am dead:</line>
    <line>Give me your songs, and flowers, and laughter now.</line>
  </stanza>
</poem>

I think the only time you should insist on exact white space preservation is when the white space is actually a significant component of the content, as in the poetry of e.e. cummings or Python source code.

Computer source code, whether in Python or in other languages, is a special case. It has a huge amount of structure that just does not lend itself to expression in XML. Furthermore parsers for this structure exist and are as common and useful as parsers for XML. (They're generally bundled as parts of compilers.) Most importantly, there are only two normal uses for source code embedded in XML documents:

Passing the code to a compiler
Displaying the complete, unformatted code to an end user, as in a programming tutorial

In neither of these cases is the process reading the XML likely to want to subdivide the data into smaller parts and treat them individually, even though these parts demonstrably exist. Thus it makes sense to leave the structure in source code implicit.

Where to Stop?

At the absolute extreme I've even seen it suggested (facetiously) that an integer such as 6587 should be written like this:

<integer>
  <thousands>6</thousands>
  <hundreds>5</hundreds>
  <tens>8</tens>
  <ones>7</ones>
</integer>

Obviously, this is going too far. It would be far more troublesome to process than a simple, unmarked up number. After all, almost everyone who wants to use a number treats it as an atomic quantity rather than a composition of four single digits. However, this does suggest a good rule of thumb for where to stop inserting tags. Anything that will normally be treated as a single atomic value should not be further divided by mark up. However, if a value is composed of smaller parts that will need to be addressed individually, they should be marked up.

Here are a few other common edge cases, and my thoughts on why I would or wouldn't further divide them:

Numbers with units such as 7px, 8.5kg, or 108db: Neither the unit nor the number means anything in isolation. It doesn't help much to know that a mass is denoted in kilograms without knowing how many kilograms. Similarly, there's not much point to knowing that the mass is 3.2 if you don't know whether that's 3.2 grams, 3.2 kilograms, or 3.2 metric tons. Thus I prefer to write such quantities as <mass>7.5kg</mass> and <speed>32mph</speed>.
Time: The division of time into hours, minutes, and seconds is very similar to the date case. Indeed a date is just a somewhat more coarsely grained measure of time, and times can be appended to dates to more precisely identify a moment. However, durations of time are a different story. These include quantities such as the flight time from San Jose to New York or the number of minutes that can be recorded on a video tape in SP mode. Here's it is the total time that matters, not the beginning point and end point. The division of time into 24 hours per day, 60 minutes per hour, and 60 seconds per minute is a historical relic of Babylonian astronomy and their base-60 number system, not anything fundamentally related to natural quantities, a point which is proved by the fact that durations can be flattened to a total number of minutes or seconds rather than using three different units. Thus I tend to treat a duration as a single quantity and write it using a form like <FlightTime>6h32m</ FlightTime> instead of a more structured form such as <FlightTime><hours>6</hours><minutes>32</minutes></ FlightTime>.
Lists: Both DTDs and schemas define list data types that can describe content separated by white space. In DTDs, these include attributes declared to have type IDREFS and ENTITIES. In schemas this includes any element or attribute declared with a list type. I really don't like this. This may be the only way to store plural quantities such as a list of entities or numbers in attributes. However, when faced with potentially plural things I prefer to use child elements. Overuse of attributes leads to markup that's hard to manage.
URLs: A URL (or URI) has a lot of internal structure. For instance, the URL http://www.cafeconleche.org:80/books/xmljava/chapters/ch09s07.html#d0e15480 has a protocol, a host (which itself has a host name, a domain name, and a top-level domain), a port, a file path, and a fragment identifier. Theoretically, you could mark this up like so:

<url>
  <protocol>http</protocol>
  <host>www.cafeconleche.org</host>
  <port>80</port>
  <file>/books/xmljava/chapters/ch09s07.html</file>
  <fragment>d0e15480</fragment>
</url>

However, in practice this is almost never done, and with good reason. Almost every use of a URL, from passing it to a method in a programming API to copying it and pasting into the browser location bar to painting it on the side of a building, expects to receive an entire URL, not a piece of one. In those rare cases where you need to divide a URL into its component parts, most APIs provide adequate support. Thus it's best not to subdivide the URL beyond what everyone expects.
In general, however, if I suspect that an element might usefully be further divided, I will divide it. XML has the opposite of the Humpty-Dumpty problem: it's much easier to put the pieces back together again when content is split by tags than it is to break it apart when there aren't enough tags. Having too much markup in your data is rarely a practical problem. Having too little markup is much more cumbersome.

Tag Each Unit of Information

Avoid Implicit Structure

Where to Stop?