XML News from Friday, August 6, 2004

The final of Extreme kicks off with Simon St. Laurent talking about "General parsed Entities: Unfinished Business". Simon says he's the only presentation looking at the layers below XML rather than building on top of XML. General parsed entities mostly worked in the early days when he was writing XML: A Primer. "Then I discovered that general parsed entities don't actually work" because parsers are allowed to ignore external general entities. "If James Clark wrote it, it must be right", he says referring to expat and its lack of support for external general entities. He encountered problems in the real world when working with DocBook manuscripts at O'Reilly. "Then there's the SGML envy problem which afflicts anyone who uses XML for an extended period of time." SDATA isn't necessary but CDATA and subdocs are cool. XML provides enough rope to hang oneself, but not as comfortably as you could hang yourself with SGML.

XInclude parse="text" replaces CDATA in SGML. XInclude content negotiation is a good thing. He thinks there's a real problem with character entities though the W3C disagrees. The last five years have seen a steady march away from DTDs. "No one wants to open up the XML spec and start over. I'll propose doing so later, but I don't expect to be taken seriously." "Is there room for a general solution?" The solution for character entities is "dead simple". In general, he's presenting a system for resolving entities based on an XML instance document format defining the entities, and an extra processing layer.


The second session of the morning I switched rooms again to hear Walter Perry talk about Dealing with the Instance: Markup and Processing. According to the abstract,

The two practices which define our work are marking up instance texts and then processing those marked-up instances. A particular text might, or might not, be marked up within the confines of a particular vocabulary, schema or DTD, just as an instance text might or might not be processed within the constraints of a particular schematic, which might or might not be the schematic anticipated when the instance was marked up. Thus it is the instance which is crucial both to markup and to processing, and the schematic, if any, is not the primary subject of either. The implications of this premise will determine the future of our field and the applicability of our practices to discrete areas of expertise. In this presentation I intend to derive from this notion of instance-first a clear picture of what our practices must look like in order to carry out this premise and to fulfill its promise.

Yesterday during a coffee break, Walter told me he's been working on marking up Sanskrit grammars for the last six months. Sometimes he might as well be writing in Sanskrit for all that people understand him. His ideas are so radically nonconformant that nobody ever understands him the first time they hear him, or believes him when they finally do understand him. But he's basically right.

Talk begins. Markup and processing are both performed on the instance, not on the schematic. The instance is key. "The processor expects specific markup." It has to. If it can't find the markup it expects, it can't do its job. "Markup expects particular processing." "Processing expects particular markup." But expectations are not requirements. "It is possible (and useful) to break the expected correlation of markup and processing." This allows us to achieve specific expertise in the process. This is useful for fraud detection, audits, and generation of different forms of output.

Validation of output is more important than validation of input. It is more important to produce what you need than to receive what you need. Data structures belong to processes, not markup. Data structures are instantiated from the instance document and its markup. Expectations are generated by business processes, not by the documents and their markup. Bills of lading turn into bills of shipping by the action of a process. Each separate document should be appropriate to what it is (and what the expertise of the process that created it is), not what it expects to be turned into. "This instance centricity is the complement of REST."

Processes expect certain input. You can test instances that they comply to the document types that are expected as input. (Walter says, this is not quite the same as validation, though the difference is unclear to me.)

Someone in the audience is resisting the idea that one cannot reject invalid documents that nonetheless provide the actual information needed. This always happens at Walter's talks.

Markup delineates the author's reading of the document, but it loses intent. Different processes have different intentions.

He's telling a very interesting story about how standardized schemas eliminate the unique value and expertise of different organizations. I may have to transcribe this story later. This reminds me a lot of Joel Spolsky's calls to never start over from scratch. He says you should consolidate "like processes" (and only like processes).

Processes are easily schematized and rationalized by consolidation.

What is net.syntags?


I'm editing this page manually, almost in realtime. I mentioned in the coffee break that I was having trouble keeping it well-formed. I don't want the DTD to force me to fill in end-tags and such before I'm ready; but it would be nice if BBEdit (or another editor) gave me a little icon somewhere that indicated the current status of the document, perhaps a green check mark for well-formed, a red X for malformed.


Bruce Rosenblum (co-author Irina Golfman) is giving the final regular session on "Automated quality assurance for heuristic-based XML creation systems." Schemas aren't enough. Validity isn't enough. By show of hands, the vast majority of the authors at this conference used hand editing in emacs or equivalent to write their papers. He's looking at heuristic based, pattern based, and manual conversion from existing documents. In these cases we need quality assurance beyond simply validating against a DTD or a schema.

Some techniques like color coding help manual proofing a lot. But they want more automated checking. They want to run a test suite across the output of an automatic conversion. So far the techniques seem pretty obvious. This is nothing anybody accustomed to running automated test suites on software doesn't already know. They're just testing that their software that converts existing data to XML runs properly. Maybe this is news to some XML-developers (though I doubt it) but anybody doing extreme programming already knows this. Their test suite takes in the ballpark of 8-10 hours to run so they do overnight testing rather than continuous testing.


As is tradition, Michael Sperberg-McQueen delivers the closing keynote. The nominal title is "Runways, product differentiation, snap-together joints, airplane glue, and switches that really switch." The question is "Does XML have a model? A supermodel? Does it matter?"

He begins by talking about glottal stops and learning Danish to read secondary literature about Norse sagas. As an English speaker, he couldn't hear glottal stops. He doesn't understand the effort to drop models into XML any more than he could hear glottal stops. He feels like an alien. He doesn't know what people mean by models. He describes various meanings of the word model and decides they don't apply. Model trains, fashion models. One use of the word model is for things that are a simplification or theory of reality, possibly discredited (phlogiston model of fire, Rutherford model of the atom). Something has to be different between a model and the real thing. Useful models are simpler, or more familiar, or easier to calculate with.

He's suspicious of having a model. He wants different models at different times because not all models capture the same things. To understand the model of SGML you have to understand the difference between a document type declaration and a document type definition. He's bringing up Wilkins and Leibnitz's effort to design a perfect language that does not allow untruthful statements. (Shades of QuickSilver, The Confusion, and Daniel Waterhouse.) But we do not believe there is a perfect universal vocabulary. SQL is too constrained by a single model. Thus, "users of SQL miss XML, but users of XML don't miss SQL." The lack of a single data model is a strength of XML. The models are owned by the users, not by ISO or the W3C. "Go forth and become models."


The conference will take place next year in Montreal, probably in August, probably in the same hotel. Some of the papers seem to be online on the Mulberry Tech website.