XML News from Monday, September 27, 2004

Note to Dare (posted here because your comments are broken): The basic rule I had for creating an RSS feed for this site was that it couldn't require me to do more work than I was doing already, at least not on an ongoing basis. That's why I use XSLT driven by a cron job to generate the feed. I can edit the same way I always have, and the RSS happens automatically. The tools adjust to fit me rather than me adjusting to fit the tools. Adding individual URLs for each story would require bending myself around the tools, and I have this silly idea that computers were meant to serve people rather than the other way around. Actually the solution for the problem you note is to use XPointers to identify the individual news items. It would be easy enough to generate them automatically using XSLT. I haven't actually tried that. Maybe it would work, but I sort of expect it might run into some problems with browser compatibility.

The initial stumbling block that kept me from adding an RSS feed to this site was that my news items don't have titles, but then it occurred to me that I could use the first sentence of each item as the title. Of course this broke some RSS software written by developers who hadn't actually paid much attention to the specs because sometimes my sentences are on the long side and tend to drone on and on and on, but you get the idea and anyway this should be handled by clients because of course developers don't write arbitrary limitations on string size into their code because sooner or later those assumptions are going to be violated, as they were for some web browser's layout algorithms a couple of day's ago when I posted a pre fragment that couldn't fit within a browser window although in that case it was really important for semantic reasons to reproduce the exact line breaks and anyway how's that for a run-on sentence—once in high school I wrote an entire 500 word theme as a single run-on sentence.

Anyway, back to the point. I'm not going to rearrange my site to fit the needs of broken news readers. RSS got a lot of things wrong, and one of those things may be the lack of any unique identifier for articles separate from titles and URLs. However, that doesn't mean a client is justified in assuming that other things in the feed are in fact unique identifiers. If an RSS 0.92 client really needs to figure out whether two items are the same or different, it needs to use a combination of heuristics rather than relying on some assumed uniqueness that isn't actually present. Most simply, it could retrieve both items, and see if the old one is still there or not. It could also compare the descriptions, URLs, and titles and do a fuzzy match, without assuming that a change in a single byte reflected a completely new item. And if it can't do that, then it needs to be designed to operate correctly without any information about which items are new and which aren't. None of this is rocket science. It simply requires implementing the spec as it is, rather than as we might wish to be.


Ian E. Gorman has released GXParse 1.5, a free (LGPL) Java library that sits on top of a SAX parser and provides semi-random access to the XML document. The documentation isn't very clear, but as near as I can tell, it buffers various constructs like elements until their end is seen, rather than dumping pieces on you immediately like SAX does. This release completes namespace support is complete, eases exception handling, and adds a few operators to CurrentElement.