Cafe con Leche News Tuesday, December 4, 2007

The morning kicked off with a Microsoft sponsored breakfast on interoperability. This is a little like listening to Cardinal Egan preach about the evils of having sex with altar boys. Also like Cardinal Egan, it's pretty obvious that Microsoft doesn't really understand the issues at hand or why people or annoyed with them. Their definition of interoperability still involves Microsoft defining the formats, languages, and APIs exclusively, and Microsoft refusing to accept any compromises that would in any way hinder their own goals and values. To them interoperability means that they define the formats and allow other developers to use them, and if they're feeling especially magnanimous they might even let us do that without a click-through license an an NDA. However they certainly don't intend to allow any of us peasants to have a voice in what the rules should be. They claim they've reformed, but what they really mean is that now they're a benevolent dictator instead of a malevolent dictator. They don't understand that we want is a democracy, not a dictator at all.

Gregg Pollack talks about RESTful XML Web Services with Ruby on Rails. He plays Ruby in this video:

He's here to talk about the "big problem in web services."

RAILS has three parts:

Model: database
View: HTML and XML display logic
Controller: code, User action controller. These need to be as small as possible. There should only be one controller per each model (database table).

I'm still not sure I like convention over configuration. The database is rarely configured like I'd want, and I'm usually not working with a green field database over which I have full control. However, the RAILS he's showing looks a lot nicer than the version I last looked at. They've fixed a lot of their early REST mistakes. I wonder what this would look like if the backend were a native XML database instead of a relational database? Maybe just the eXist REST API.

I have to figure out how he does that cool cursor spotlight trick.

At 9:45, John Davies from IONA Technologies gave a fascinating talk about working with XML in financial services and banking industries: 100 billion dollar hedge funds that need 3 millisecond response times and the like. (They aren't doing very well: despite that capitalization they're only making about $19,000,000 a day. They'd do better in index funds.)

I didn't have the energy to take live notes, but there was a lot of good stuff. He thinks that the industry is moving away from web services and to REST. On the other hand, he also said tat HTTP didn't work very well for them because it was stateless, so they needed to use JMS and MQSeries. I can't quite reconcile those two. Maybe I misheard him on the bit about HTTP, or he switched from one part of financial services to another.

He's also doing a lot of work with legacy comma-separated value formats by defining filters that present it as XML without actually rewriting it on disk. Then he can use XPath, XSLT, and so forth. A lot of his data does not fit well into relational databases. (He says the SQL queries to reconstruct some of this are half a page long.) He didn't talk much about native XML databases, but at least to me it sounded like he was hinting that that was what he needed.

After the coffee break (I'm finally awake after chugging about 16 ounces) I'm listening to Arofan Gregory from the Open Data Foundation talk about "Towards a Global Infrastructure for Data and Metadata: The Open Data Foundation." They're mostly looking at the raw data collected by governments and researchers not the information generated by processing this data.

The organization is virtual and lives on Skype. He's wasting too much time telling us about the organization. He has yet to tell us what they're actually doing.

Response rates are falling in surveys because a lot of people (including myself) flat out refuse to participate in any surveys.

Disses the semantic web. He wants a federated web of data registries run by professionals. He wants to have standard ontologies to enable semantic interoperability. SDMX (ISO 17369) is important as are several other ISO standards including METS.

For the next session there are three talks I want to hear. Which to pick? BBC iPlayer? Bringing Collaborative Edting of Open Document Format (ODF) Documents to the Web? The missing architecture of the AEA (AJAX Enterprise Applications)? The ODF folks didn't show up, so I think I'll pick the iPlayer.

Robin Doran and Matthew Browning from the BBC:

First attempts at implementing iPlayer took for granted that a relational database would be used to store programme data. Project requirements, however, manifested themselves as amendments to the underlying data model. Each of these amendments would, in turn, require an update to the database schema and corresponding modification to the assumptions of the client code. This became cumbersome and confusing. Additionally, serialisation to and from the database store introduced latency to the publication pipeline. Huge quantities of data that quickly became of only historical interest were being stored, requiring tuning of the database server just to make it perform acceptably, using software developer resource when it could better be spent elsewhere.

Rationalisation and Streamlining

It was recognised that the evolving data model could be expressed in terms of a RelaxNG schema. A great deal of work went into getting this right and outputting documents adhering to it in a single transformation on input data. This gave us a readable point of reference and a handy way to determine the feasibility of new requirements: if they can be expressed in terms of a transformation on our so-called ‘Content Package’ they are possible. Business Rules Database removal and input rationalisation allowed us to impose order upon domain-specific business rules. Requirements were no longer implemented in the software but both specified and implemented in terms of a transform on our input data. Separation also enabled more effective testing and tracking of ownership and history of rules.

Content Publication

Output content is destined for both human and machine consumption. All publishing is divided into two stages: firstly, the production of an initial XML representation of an artefact and, secondly, publication of the output itself. Two-pass publication allowed us to validate XML representations against their corresponding schemas for quality assurance. Config-driven implementation means that adding a new output format is just a matter of dropping in a schema that describes it.

Did It Work?

Solution initially conceived as a mechanism to publish a subset of web content has expanded without effort to make two other internal projects redundant and produce all non-dynamic web content as well as inter-component messaging.

Doesn't work in the U.S. because it's UK government funded. That's utter crap. It's time to tear down artificial boundaries. If the BBC doesn't want to send their content overseas, then we'll just get it from BitTorrent. (And our files will likely be higher quality too.)

TVA is an emerging XML standard for television schedules and TV content metadata.

Relational database performance was "not great". Why was it so slow? Schemas were too inflexible for RAD. What database were they using? MySQL. They don't need to store historical data. They can throw it away. What content store are they using now? a file system or a non-relational DB? They're using the file system.

Directed acyclic graphs are the basic nature of their data. Sounds like a forest to me (and any forest can trivially be come a tree by adding a special root element that holds all the trees in the forest.)

In the first afternoon session, I'm listening to Intel's Ken Graf talk about "Building a XSLT Processor for large documents and high-performance." Large to him means 0.3-2 GB, and the largest document they can handle is 32GB. That qualifies in my mind. I haven't seen many documents bigger than that. He does warn that these techniques may not work for smaller documents. They apparently just announced this product this week.

He asks how many people have XML performance problems (about a third of the audience) and how many have abandoned a project due to performance problems. (No on admits to this. One person tentatively puts their hand half way up.) He should have asked how many of those with problems were using DOM or XSLT vs. SAX. He's basically right that DOM takes 3-5 times the size of the actual document, but he severely underrates SAX's abilities, and vastly overestimates its memory usage. After all these years SAX still gets no respect, even though it's the obvious choice for documents like these.

The core data structure is a table that manages symbols. They store event records for each parser type. The records contain an offset into the actual XML document on disk. The table can be built in streaming mode, and you can work with the start of the data before you get to the end. This reminds me of VTD-XML.

By Intel's measurements, new threads are only justified for operations that take one million or more assembly level instructions.

I'm not sure but I think they're breaking the document into pieces and and running it across several threads/processors/cores at once. This is called "Simultaneous XPath expressions". This doesn't seem to help as much on transforms, as opposed to pure queries.

The multithreaded approach is interesting. I wonder if it's possible to design a multithreaded parser that would give us another order of magnitude improvement in parsing speed. Parsing seems like a fundamentally serial operation but a lot of apparently serial operations can be parallelized when you think about the problem a little. Nonetheless if it's true that new threads are only justified for operations that take one million or more assembly level instructions, then this may well not help for a lot of documents. I suspect the real gain is in running many smaller documents through multiple threads simultaneously.

Now Tony Lavinio from Data Direct XQuery talks about "Using XQuery and XSLT on Non-XML Data". They do this by plugging a converter into a URIResolver. They can also represent the input as a SAXSource or DOMSource ("Nothing good to say about DOM.") or StreamSource or STAXSource. Transforms happen on the fly.

I want to see what the output XML from a CSV input (for example) or a relational query looks like. (Update: just constant td, tr, and table elements.) To resolve a CSV file:

java -cp saxon9.jar -r com.ddtek.xml2007.CSVResolver net.sf.saxon.Transform x-csv:file///c:/XML_2007/books.txt -u table.xsl

OutputURIResolver goes the other way. The XQuery resolver converts some stuff to SQL and other parts to Java code.

Micah Dubinko from Yahoo talks about "WebPath: Querying the web as XML". He calls this the "Platonic Web". He says we need better web tools.

WebPath started as a Hack Day project. 5 Main components:

Lexer: PLY (Python Lex-Yacc)
Recognizer (because of div div div and * * *, middle tokens are different than outer ones in XPath)
Parser (top down operator precedence)
Interpreter

(I missed one.)

Liberal name tests that don't require prefixes in XPath expressions for XHTML.

Adds a get(url) extension function that retrieves a page from the web. Sort of like the document function, but can use this as a location step; e.g. get(a/@href). Or could use ---> or a traverse() function. In fact, this turns out to be the the XPath 2.0 doc() function.

Perhaps I was just in my usual afternoon daze, but I confess I didn't see what exactly the point was here.

For the final afternoon session, Mark Birbeck talks about "XForms, REST, XQuery...and skimming" (like a stone bouncing across a lake):

'Skimming' is about being able to install various pieces of server-side software and then not have to touch them again. No configuration…no writing of server-side scripts…just store data and retrieve it. It may sound a little odd, but a good example of a component that can do this is a WebDAV server; here you simply install the software and then start saving documents, editing and updating them, searching, and so on.

There is no reason why you couldn't build an entire client-side application that manipulates documents and stores and retrieves them, without having to do any more to the server than the initial installation of the WebDAV software.

The XML database eXist can be much the same as WebDAV in that you can install it and then immediately start punching XML documents; unlike relational databases you don't need to know in advance what you want to store so there no need to create tables first, define schemas, etc.

But the skimming architecture goes further; by using a standard interface to our data-in this case XQuery-we don't even need to write server-side scripts, applications or servlets to manage the data. Instead we just use queries from our 'rich client'. The resulting application is very loosely-coupled, and can run on just about any server-side architecture; client-side forms can be deployed by any HTTP server because there is no scripting involved in their creation, and the data can be delivered by any XML database that supports XQuery.

Application development and deployment can therefore become very fast.

He believes the client is too thin, and insufficient to build web applications, so most of the work is done on the server. When the server program generates the user interface from the data (as in Rails or AJAX based XForms toolkits) it's difficult to decouple them, and use the same UI with multiple data sources or the same data source with multiple UIs. URLs invariably point to the application that acts on the data, rather than the data itself. This also helps splits tasks into writers and HTML jockies and away from server side developers.

XML News from Tuesday, December 4, 2007