Cafe con Leche News Sunday, January 6, 2008

John Cowan has released TagSoup 1.2, an open source, Java-language, SAX parser for nasty, ugly HTML. Version 1.2 changes the license to Apache 2.0. In addition,

The default content model for bogons (unknown elements) is now ANY rather than EMPTY. This is a breaking change, which I have done only because there was so much demand for it. It can be undone on the command line with the --emptybogons switch, or programmatically with "parser.setFeature(Parser.emptyBogonsFeature, true)".
The processing of entity references in attribute values has finally been fixed to do what browsers do. That is, a reference is only recognized if it is properly terminated by a semicolon; otherwise it is treated as plain text. This means that URIs like "foo?cdown=32&cup=42" are no longer seen as containing an instance of the cup character.
Several new switches have been added:
- --doctype-system and --doctype-public force a DOCTYPE declaration to be output and allow setting the system and public identifiers.
- --standalone and --version allow control of the XML declaration that is output. (Note that TagSoup's XML output is always version 1.0, even if you use --version=1.1.)
- --norootbogons causes unknown elements not to be allowed as the document root element. Instead, they are made children of the default root element (the html element for HTML).
The TagSoup core now supports character entities with values above U+FFFF. As a consequence, the HTML schema now supports all 2,210 standard character entities from the 2007-12-14 draft of XML Entity Definitions for Characters, except the 94 which require more than one Unicode character to represent.
The SAX events startPrefixMapping and endPrefixMapping are now being reported for all cases of foreign elements and attributes.
All bugs around newline processing on Windows should now be gone.
A number of content models have been loosened to allow elements to appear in new and non-standard (but commonly found) places. In particular, tables are now allowed inside paragraphs, against the letter of the W3C specification.
Since the span element is intended for fine control of appearance using CSS, it should never have been a restartable element. This very long-standing bug has now been fixed.
The following non-standard elements are now at least partly supported: bgsound, blink, canvas, comment, listing, marquee, nobr, rbc, rb, rp, rtc, rt, ruby, wbr, xmp.
In HTML output mode, boolean attributes like checked are now output as such, rather than in XML style as checked="checked".
Runs of < characters such as << and <<< are now handled correctly in text rather than being transformed into extremely bogus start-tags.

XML News from Sunday, January 6, 2008