I remember the early days of the web -- and the last days of CD ROM -- when there was this mainstream consensus that the web and PCs were too durned geeky and difficult and unpredictable for "my mom" (it's amazing how many tech people have an incredibly low opinion of their mothers). If I had a share of AOL for every time someone told me that the web would die because AOL was so easy and the web was full of garbage, I'd have a lot of AOL shares.
And they wouldn't be worth much.
--Cory Doctorow
Read the rest in Why I won't buy an iPad (and think you shouldn't, either)
I've released XOM 1.2.5, my free-as-in-speech (LGPL) dual streaming/tree-based API for processing XML with Java. 1.2.5 is a very minor release. The only visible change is that Builder.build((Reader) null) now throws a NullPointerException instead of a confusing MalformedURLException. I've also added support for Maven 2, and hope to get the packages uploaded to the central repository in a week or two.
In other news, I have had very little time to work on this site lately. In order to have any time to work on other projects including XOM and Jaxen, I've had to let this site slide. I expect to have more news about that soon.
Also, speaking of Jaxen, I noticed that the website has been a little out of date for a while now because I neglected to update the releases page when 1.1.2 was released in 2008. Consequently, a lot of folks have been missing out on the latest bug fixes and optimizations. If you're still using Jaxen 1.1.1 or earlier, please upgrade when you get a minute. Also, note that the official site is http://jaxen.codehaus.org/. jaxen.org is a domain name spammer. I'm not sure who let that one slide, but we'll have to see about grabbing it back one of these days.
Permalink to Today's News | Recent News | Today's Java News on Cafe au Lait | The Cafes | Older News | E-mail Elliotte Rusty Harold
Selected content that might have some relevance or interest for this site's visitors:
You can also see previous recommended reading or subscribe to the recommended reading RSS feed if you like.
Yesterday I figured out how to process form input. Today I figured out how to parse strings into nodes in eXist. This is very eXist specific, but briefly:
let $doc := "<html xmlns='http://www.w3.org/1999/xhtml'> <div> foo </div> </html>" let $list := util:catch('*', (util:parse($doc)), ($util:exception-message)) return $list
I'll need this for posts and comments. There's also a parse-html
function but it's based on the flaky NekoHTML instead of the m,orereliable TagSoup.
I'm slowly continuing to work on the new backend. I've finally gotten indexing to work. It turns out that eXist's namespace hanlding for index configuration files is broken in 1.4.0, but that shouldn be fixed in the release. I've also manged to get the source built and most of the tests to run so I can contribute patches back. Next up I'm looking into the supoprt for the Atom Publishing Protocol.
I spent a morning debugging a problem that I have now boiled down to this test case. The following query prints 3097:
<html> { let $num := count(collection("/db/quotes")/quote) return $num } </html>and this query prints 0:
<html xmlns="http://www.w3.org/1999/xhtml"> { let $num := count(collection("/db/quotes")/quote) return $num } </html>
The only difference is the default namespace declaration. In the
documents being queried the quote
elements are indeed in no namespace.
Much to my surprise XQuery has broken the
semantics of XPath 1.0 by
applying default namespaces to
unqualified names in path expressions.
Who thought it would
be a good idea to break practice with XSLT, every single XPath
implementation on the planet, and years of experience and
documentation?
There's an argument to be made for default namespaces applying in path
expressions, but the time for that argument to be made was 1998. Once
the choice was made, the cost of switching was far higher than any
incremental improvement you might make. Stare decisis isn't just for
the supreme court.
XQuery executing for about an hour now. O(N^2) algorithm perhaps? Maybe I should learn about indexes? Or is eXist just hung?
declare namespace xmldb="http://exist-db.org/xquery/xmldb";
declare namespace html="http://www.w3.org/1999/xhtml";
declare namespace xs="http://www.w3.org/2001/XMLSchema";
declare namespace atom="http://www.w3.org/2005/Atom";
for $date in distinct-values(
for $updated in collection("/db/news")/atom:entry/atom:updated
order by $updated descending
return xs:date(xs:dateTime($updated)))
let $entries := collection("/db/news")/atom:entry[xs:date(xs:dateTime(atom:updated)) = $date]
return <div>
for $entry in $entries
return $entry/atom:title
<hr />
</div>
I've got a lot of the old data loaded into eXist (news and quotes; readings and other pages I still have to think about). I'm now focusing on how to get it back out again and put it in web pages. Once that's done, the remaining piece is setting up some system for putting new data in. It will probably be a fairly simple HTML form, but some sort of markdown support might be nice. Perhaps I can hack something together that will insert paragraphs if there are no existing paragraphs, and otherwise leave the markup alone. I'm also divided on the subject of whether to store the raw text, the XHTML converted text, or both. This will be even more critical when I add comment support.
I've more or less completed the script that converts the old news into Atom entry documents:
xquery version "1.0";
declare namespace xmldb="http://exist-db.org/xquery/xmldb";
declare namespace html="http://www.w3.org/1999/xhtml";
declare namespace xs="http://www.w3.org/2001/XMLSchema";
declare namespace atom="http://www.w3.org/2005/Atom";
declare namespace text="http://exist-db.org/xquery/text";
declare function local:leading-zero($n as xs:decimal) as xs:string {
let $result := if ($n >= 10)
then string($n)
else concat("0", string($n))
return $result
};
declare function local:parse-date($date as xs:string) as xs:string {
let $day := normalize-space(substring-before($date, ","))
let $string-date := normalize-space(substring-after($date, ","))
let $y1 := normalize-space(substring-after($string-date, ","))
(: strip permalink :)
let $year := if (contains($y1, "("))
then normalize-space(substring-before($y1, "("))
else $y1
let $month-day := normalize-space(substring-before($string-date, ","))
let $months := ("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
let $month := substring-before($month-day, " ")
let $day-of-month := local:leading-zero(xs:decimal(substring-after($month-day, " ")))
let $monthnum := local:leading-zero(index-of($months,$month))
(: I don't necessarily know the time so I'll pick something vaguely plausible. :)
return concat($year, "-", $monthnum, "-", $day-of-month, "T07:00:31-05:00")
};
declare function local:first-sentence($text as xs:string) as xs:string {
let $r0 := normalize-space($text)
let $r1 := substring-before($text, '. ')
let $penultimate := substring($r1, string-length($r1)-1, 1)
let $sentence := if ($penultimate != " " or not(contains($r1, ' ')))
then concat($r1, ".")
else concat($r1, ". ", local:first-sentence($r1))
return $sentence
};
declare function local:make-id($date as xs:string, $position as xs:integer) as xs:string {
let $day := normalize-space(substring-before($date, ","))
let $string-date := normalize-space(substring-after($date, ","))
let $y1 := normalize-space(substring-after($string-date, ","))
(: strip permalink :)
let $year := if (contains($y1, "("))
then normalize-space(substring-before($y1, "("))
else $y1
let $month-day := normalize-space(substring-before($string-date, ","))
let $months := ("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
let $month := substring-before($month-day, " ")
let $day-of-month := local:leading-zero(xs:decimal(substring-after($month-day, " ")))
let $monthnum := local:leading-zero(index-of($months,$month))
return concat($month, "_", $day-of-month, "_", $year, "_", $position)
};
declare function local:permalink-date($date as xs:string) as xs:string {
let $day := normalize-space(substring-before($date, ","))
let $string-date := normalize-space(substring-after($date, ","))
let $y1 := normalize-space(substring-after($string-date, ","))
(: strip permalink :)
let $year := if (contains($y1, "("))
then normalize-space(substring-before($y1, "("))
else $y1
let $month-day := normalize-space(substring-before($string-date, ","))
let $month := substring-before($month-day, " ")
let $day-of-month := xs:decimal(substring-after($month-day, " "))
return concat($year, $month, $day-of-month)
};
for $newsyear in (1998 to 2009)
return
for $dt in doc(concat("file:///Users/elharo/cafe%20con%20Leche/news", $newsyear ,".html"))/html:html/html:body/html:dl/html:dt
let $dd := $dt/following-sibling::html:dd[1]
let $date := string($dt)
let $itemstoday := count($dd/html:div)
return
for $item at $count in $dd/html:div
let $sequence := $itemstoday - $count + 1
let $id := if ($item/@id)
then string($item/@id)
else local:make-id($date, $sequence)
let $published := if ($item/@class)
then string($item/@class)
else local:parse-date($date)
let $link := concat("http://www.cafeconleche.org/#", $id)
let $permalink := if ($item/@id)
then concat("http://www.cafeconleche.org/oldnews/news", local:permalink-date($date), ".html#", $item/@id)
else concat("http://www.cafeconleche.org/oldnews/news", local:permalink-date($date), ".html")
return
<atom:entry xml:id="{$id}">
<atom:author>
<atom:name>Elliotte Rusty Harold</atom:name>
<atom:uri>http://www.elharo.com/</atom:uri>
</atom:author>
<atom:id>{$link}</atom:id>
<atom:title>{local:first-sentence(string($item))}</atom:title>
<atom:updated>{$published}</atom:updated>
<atom:content type="xhtml" xml:lang="en"
xml:base="http://www.cafeconleche.org/"
xmlns="http://www.w3.org/1999/xhtml">{$item/node()}</atom:content>
<link rel="alternate" href="{$link}"/>
<link rel="permalink" href="{$permalink}"/>
</atom:entry>
I should probably figure out how to remove some of the duplicate date parsing code, but it's basically a one-off migration script so I may not bother.
I think I have enough in place now that I can start setting up the templates for the main index.html page and the quote and news archives. Then I can start exploring the authoring half of the equation.
I'm beginning to seriously hate the runtime error handling (or lack thereof) in XQuery. It's just too damn hard to debug what's going wrong where compared to Java. You can't see where the bad data is coming from, and there's no try-catch facility to help you out. Now that I think about it, I had very similar problems with Haskell last year. I wonder if this is a common issue with functional languages?
I've just about finished importing all the old quotes into eXist. (There was quite a bit of cleanup work going back 12 years. The format changed solowly over time.) Next up is the news.
I am wondering if maybe this is backwards. Perhaps first I should build the forms and backend for posting new content, and then import the old data? After all, it's the new content people are interested in. There's not that much call for breaking XML news from 1998. :-)
Parsing a date in the form "Wednesday, January 20, 2010" in XQuery:
xquery version "1.0"; declare function local:leading-zero($n as xs:decimal) as xs:string { let $result := if ($n >= 10) then string($n) else concat("0", string($n)) return $result }; declare function local:parse-date($date as xs:string) as element() { let $day := normalize-space(substring-before($date, ",")) let $string-date := normalize-space(substring-after($date, ",")) let $year := normalize-space(substring-after($string-date, ",")) let $month-day := normalize-space(substring-before($string-date, ",")) let $months := ("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December") let $month := substring-before($month-day, " ") let $day-of-month := number(substring-after($month-day, " ")) return <postdate> <day>{$day}</day> <date>{$year}-{local:leading-zero(index-of($months,$month))}-{local:leading-zero($day-of-month)}</date> </postdate> }; local:parse-date("Monday, April 27, 2009")
Today I went from merely splitting the quotes files apart into indiviodual quotes to actually storing them back into the database:
xquery version "1.0"; declare namespace xmldb="http://exist-db.org/xquery/xmldb"; declare namespace html="http://www.w3.org/1999/xhtml"; for $dt in doc("/db/quoteshtml/quotes2009.html")/html:html/html:body/html:dl/html:dt let $id := string($dt/@id) let $date := string($dt) let $dd := $dt/following-sibling::html:dd[1] let $quote := $dd/html:blockquote let $cite := string($quote/@cite) let $source := $quote/following-sibling::* let $sourcetext := normalize-space(substring-after($source, "--")) let $author := if (contains($sourcetext, "Read the")) then substring-before($sourcetext, "Read") else substring-before($sourcetext, "on the") let $location := if ($source/html:a) then $source/html:a else substring-after($sourcetext, "on the") let $quotedate := if (contains($sourcetext, "list,")) then normalize-space(substring-after($sourcetext, "list,")) else "" let $justlocation := if (contains($location, "list,")) then normalize-space(substring-after(substring-before($sourcetext, ","), "on the")) else $location let $singlequote := <quote> <id>{$id}</id> <postdate>{$date}</postdate> <content>{$quote}</content> <cite>{$cite}</cite> <author>{$author}</author> <location>{$justlocation}</location> { if ($quotedate) then <quotedate>{$quotedate}</quotedate> else "" } </quote> let $name := concat("quote_", $id) let $store-return := xmldb:store("quotes", $name, $singlequote) return <store-result> <store>{$store-return}</store> <documentname>{$name}</documentname> </store-result>
I suspect the next thing I should do is work on iomproving the dates somewhat since I'll likely want to sort and query by them. Right now they're human reabale but not so easy to process. E.g.
<postdate>Monday, April 27, 2009</postdate>
I should try to turn this into
<postdate>
<day>Monday</day>
<date>2009-04-27</date>
</postdate>
Time to read up on the XQuery date and time functions. Hmm, looks like it's going to be regular expressions after all.
I've converted all the old quotes archives to well-formed (though not necessarily valid) XHTML and uploaded them into eXist. Now I have to come up with an XQuery that breaks them up into individual quotes. This is proving trickier than expected (and I expected it to be pretty tricky, especially since a lot of the old quotes aren't in perfectly consistent formats.
Maybe it's time to try out Oxygen's XQuery debugger since they sent me a freebie? If only the interface weren't such a horrow show. They say they have a debugger but I can't find it, and the buttons they're using in the screencast don't seem to be present in the latest version. In the meantime, can anyone see the syntax error in this code?
xquery version "1.0"; declare namespace xmldb="http://exist-db.org/xquery/xmldb"; declare namespace html="http://www.w3.org/1999/xhtml"; for $dt in doc("/db/quoteshtml/quotes2010.html")/html:html/html:body/html:dl/html:dt let $id := string($dt/@id) let $date := string($dt) let $dd := $dt/following-sibling::html:dd let $quote := $dd/html:blockquote let $cite := string($quote/@cite) let $source := $quote/following-sibling::html:p let $author := normalize-space(substring-after($source/*[1], "--")) return <quote> <id>{$id}</id> <date>{$date}</date> <quote>{$quote}</quote> <cite>{$cite}</cite> <source>{$quote}</source> <author>{$author}</author> </quote>
The error message from exist is "The actual cardinality for parameter 1 does not match the cardinality declared in the function's signature: string($arg as item()?) xs:string. Expected cardinality: zero or one, got 4."
Found the bug: the debugger wasn't very helpful (once I found it--apparently Author and Oxygen are not the same thing), but Saxon had much better error messages than eXist.
I needed to change let $dd := $dt/following-sibling::html:dd
to
let $dd := $dt/following-sibling::html:dd[1]
.
eXist didn't tell me which line had the problem so I was looking in the wrong place. Saxon pointed me straight to it. Score 1 for Saxon.
Here's the finished script. It works for at least the lasy couple of years. I still have to test it out on some of the older files:
xquery version "1.0"; declare namespace xmldb="http://exist-db.org/xquery/xmldb"; declare namespace html="http://www.w3.org/1999/xhtml"; for $dt in doc("/db/quoteshtml/quotes2009.html")/html:html/html:body/html:dl/html:dt let $id := string($dt/@id) let $date := string($dt) let $dd := $dt/following-sibling::html:dd[1] let $quote := $dd/html:blockquote let $cite := string($quote/@cite) let $source := $quote/following-sibling::* let $sourcetext := normalize-space(substring-after($source, "--")) let $author := if (contains($sourcetext, "Read the")) then substring-before($sourcetext, "Read") else substring-before($sourcetext, "on the") let $location := if ($source/html:a) then $source/html:a else substring-after($sourcetext, "on the") let $quotedate := if (contains($sourcetext, "list,")) then normalize-space(substring-after($sourcetext, "list,")) else "" let $justlocation := if (contains($location, "list,")) then normalize-space(substring-after(substring-before($sourcetext, ","), "on the")) else $location return <quote> <id>{$id}</id> <postdate>{$date}</postdate> <quote>{$quote}</quote> <cite>{$cite}</cite> <author>{$author}</author> <location>{$justlocation}</location> { if ($quotedate) then <quotedate>{$quotedate}</quotedate> else "" } </quote>
The XQuery work continues to roll along. I think I've roughly figured out to configure the server. I found and reported a few more bugs in eXists, none too critical. I now have eXist serving this entire web site on my local box, though I haven't changed the server here on IBiblio yet. That's still Apache and PHP. The next step is to convert all the static files from the last 12 years--quotes, news, books, conferences, etc.--into smaller documents in the database. For instance, each quote will be its own document. Then I have to rewrite the pages the as XQuery "templates" that query the database. From that point I can add suppor for new posts, submissions, and comments via a web browser and forms.
I didn't really like the format of yesterday's Twitter dump so today I opened another can of XQuery ass-kicking to improve it. First, let's group by date:
xquery version "1.0"; declare namespace atom="http://www.w3.org/2005/Atom"; let $tweets := for $entry in reverse(document("/db/twitter/elharo")/atom:feed/atom:entry) return <div><date>{substring-before($entry/atom:updated/text(), "T")}</date> <p> <span>{substring-before(substring-after($entry/atom:updated/text(), "T"), "+")} UTC</span> {substring-after($entry/atom:title/text(), "elharo:")}</p></div> return for $date in distinct-values($tweets/date) return <div><h3>{$date}</h3> { for $tweet in $tweets where $tweet/date = $date return $tweet/p }</div>
Now let's hyperlink the URLs:
xquery version "1.0"; declare namespace atom="http://www.w3.org/2005/Atom"; let $tweets := for $entry in reverse(document("/db/twitter/elharo")/atom:feed/atom:entry) return <div><date>{substring-before($entry/atom:updated/text(), "T")}</date> <p> <span>{substring-before(substring-after($entry/atom:updated/text(), "T"), "+")} </span> {replace(substring-after($entry/atom:title/text(), "elharo:"), "(http://[^\s]+)", "<a href='http://$1'>http://$1</a>")}</p></div> return for $date in distinct-values($tweets/date) return <div><h3>{$date}</h3> { for $tweet in $tweets where $tweet/date = $date return $tweet/p }</div>
Let's do the same for @names:
xquery version "1.0"; declare namespace atom="http://www.w3.org/2005/Atom"; let $tweets := for $entry in reverse(document("/db/twitter/elharo")/atom:feed/atom:entry) return <div><date>{substring-before($entry/atom:updated/text(), "T")}</date> <p> <span>{substring-before(substring-after($entry/atom:updated/text(), "T"), "+")} </span> { replace ( replace(substring-after($entry/atom:title/text(), "elharo:"), "(http://[^\s]+)", "<a href='$1'>$1</a>"), " @([a-zA-Z]+)", " <a href='http://twitter.com/$1'>@$1</a>" ) }</p></div> return for $date in distinct-values($tweets/date) return <div><h3>{$date}</h3> { for $tweet in $tweets where $tweet/date = $date return $tweet/p }</div>
And one more time for hash tags:
xquery version "1.0"; declare namespace atom="http://www.w3.org/2005/Atom"; let $tweets := for $entry in reverse(document("/db/twitter/elharo")/atom:feed/atom:entry) return <div><date>{substring-before($entry/atom:updated/text(), "T")}</date> <p> <span>{substring-before(substring-after($entry/atom:updated/text(), "T"), "+")} </span> { replace ( replace ( replace(substring-after($entry/atom:title/text(), "elharo:"), "(http://[^\s]+)", "<a href='$1'>$1</a>"), " @([a-zA-Z]+)", " <a href='http://twitter.com/$1'>@$1</a>" ), " #([a-zA-Z]+)", " <a href='http://twitter.com/search?q=#$1'>#$1</a>" ) }</p></div> return for $date in distinct-values($tweets/date) return <div><h3>{$date}</h3> { for $tweet in $tweets where $tweet/date = $date return $tweet/p }</div>
And here's the finished result.
This morning a simple practice exercise to get my toes wet. First load my Tweets from their Atom feed into eXist:
xquery version "1.0"; declare namespace xmldb="http://exist-db.org/xquery/xmldb"; let $collection := xmldb:create-collection("/db", "twitter") let $filename := "" let $URI := xs:anyURI("file:///Users/elharo/backups/elharo_statuses.xml") let $retcode := xmldb:store($collection, "elharo", $URI) return $retcode
Then generate HTML of each tweet:
xquery version "1.0"; declare namespace atom="http://www.w3.org/2005/Atom"; for $entry in document("/db/twitter/elharo")/atom:feed/atom:entry return <p>{$entry/atom:updated/text()} {substring-after($entry/atom:title/text(), "elharo:")}</p>
Can I reverse them so they go forward in time? Yes, easily:
for $entry in reverse(document("/db/twitter/elharo")/atom:feed/atom:entry)
Now how do I dump that to a file? Maybe something like this?
xquery version "1.0"; declare namespace atom="http://www.w3.org/2005/Atom"; let $tweets := <html> {for $entry in document("/db/twitter/elharo")/atom:feed/atom:entry return <p>{$entry/atom:updated/text()} {substring-after($entry/atom:title/text(), "elharo:")}</p> } </html> return xmldb:store("/db/twitter", "/Users/elharo/tmp/tweets.html", $tweets)
Oh damn. Almost, but that puts it back into the database instead of the filesystem. Still I can now run a query that grabs just that and copy and paste the result since there's only 1. The first query gave almost 1000 results and the query sandbox only shows one at a time.
Tomorrow: how do I serve that query as a web page?
What I've learned about eXist so far:
What I still don't know:
Partial answer:
xquery version "1.0"; declare namespace xmldb="http://exist-db.org/xquery/xmldb"; for $foo in collection("/db/collectionname") return $foo
First bug filed against exist during this project: excessive confirmation, a common UI anti-pattern, especially on Windows though in this case it's cross-platform.
Second bug filed. This one comes with potential for data loss.
Third bug and I haven't even left the installer yet. Time to check out the source code. (I hope I don't have to fix IzPack too.)
At the tune of the new year and a new decade, I've decided to explore some changes here. Several points are behind this:
I don't have a lot of spare time these days; and what I do have is mostly occupied with photography and chasing birds, but I've decided that there's not a lot of point to continuing with this site as it is.
Don't worry though. It's not going away. I'm just going to focus on building a new infrastructure rather than on posting more news. I'm going to dogfood my work right here on Cafe con Leche. (I will keep Cafe au Lait on the old system until I'm happy with the new one.) I've decided to begin by experimenting with bringing the site up on top of existDB. It may go down in flames. It may not work at all. I may have to revert to the old version. It will probably sometimes be unavailable. There will have to be several iterations. But certainly along the way I'll learn a few things about XQuery databases, and just maybe I'll produce something that's more widely useful than a few bits of AppleScript and XSLT. See you on the other side!