John Cowan has released TagSoup 1.2,
an open source, Java-language, SAX parser for nasty, ugly HTML. Version 1.2 changes the license to Apache 2.0. In addition,
- The default content model for bogons (unknown elements) is now ANY rather than EMPTY. This is a breaking change, which I have done only because there was so much demand for it. It can be undone on the command line with the --emptybogons switch, or programmatically with "parser.setFeature(Parser.emptyBogonsFeature, true)".
- The processing of entity references in attribute values has finally been fixed to do what browsers do. That is, a reference is only recognized if it is properly terminated by a semicolon; otherwise it is treated as plain text. This means that URIs like "foo?cdown=32&cup=42" are no longer seen as containing an instance of the cup character.
- Several new switches have been added:
- --doctype-system and --doctype-public force a DOCTYPE declaration to be output and allow setting the system and public identifiers.
- --standalone and --version allow control of the XML declaration that is output. (Note that TagSoup's XML output is always version 1.0, even if you use --version=1.1.)
- --norootbogons causes unknown elements not to be allowed as the document root element. Instead, they are made children of the default root element (the html element for HTML).
- The TagSoup core now supports character entities with values above U+FFFF. As a consequence, the HTML schema now supports all 2,210 standard character entities from the 2007-12-14 draft of XML Entity Definitions for Characters, except the 94 which require more than one Unicode character to represent.
- The SAX events startPrefixMapping and endPrefixMapping are now being reported for all cases of foreign elements and attributes.
- All bugs around newline processing on Windows should now be gone.
- A number of content models have been loosened to allow elements to appear in new and non-standard (but commonly found) places. In particular, tables are now allowed inside paragraphs, against the letter of the W3C specification.
- Since the span element is intended for fine control of appearance using CSS, it should never have been a restartable element. This very long-standing bug has now been fixed.
- The following non-standard elements are now at least partly supported: bgsound, blink, canvas, comment, listing, marquee, nobr, rbc, rb, rp, rtc, rt, ruby, wbr, xmp.
- In HTML output mode, boolean attributes like checked are now output as such, rather than in XML style as checked="checked".
- Runs of < characters such as << and <<< are now handled correctly in text rather than being transformed into extremely bogus start-tags.