XML News from Monday, October 11, 2004

The W3C XML Binary Characterization Working Group has posted the first public working draft of XML Binary Characterization Properties. This describes the goals/hopes/dreams the group has for a binary format to replace XML. These include:

Accelerated Sequential Access
Byte Preserving
Compact
Data Model Versatility
Efficient Update
Embedding of arbitrary files
Encryptable
Extensible at the format level
Fragmentable
Hinting (I have no idea what this is, and neither does the draft)
Human Readable/Editable/Deducible
Integratable into the Web
Integratable into XML Family
No Arbitrary Limits
Fast
Random Access
Robust
Round Trippable
Schema Instance Change Resilience
Self Contained
Signable
Specialized codecs
Streamable
Support for Error Correction
Transcodable to XML
Transport Independence
Support for Open Content Models
Verifiable Integrity
Version Identification
Draconian error handling
Forward Compatible
Free
Small Footprint
Ubiquitous Implementation
Net decrease in entropy

OK. I snuck that last point in myself. It seems only slightly less likely than satisfying all the rest of these goals in a single format. I am glad the working group is so ambitious. Hopefully they'll either fail, and let the rest of us go back to doing real work with plain vanilla XML, or they'll succeed and produce something quite useful. However, I do hope they won't accept half measures. Some of these goals have not been important to vendors in this space before, human readability perhaps foremost among them. The group does use a very unorthodox definition of human readable though. Human deducible is more accurate; i.e. the format needs to be able to be reverse engineered without access to the specification or documentation. The requirement to be self contained rules out a lot of schema based compression systems.

There's one important non-goal that's notable by its absence. There's no requirement here that the format be language neutral. Actually that's two requirements: one that it not prefer one programming language over another and one that it not prefer one human language over another. A lot of the proposals I've seen have been designed so that they ran very fast in one particular environment but slowed down noticeably in environments with different byte orders, primitive data type widths, and other memory layout characteristics. Binary formats by their nature tend to be very tied to one particular architecture to the detriment of others. Java byte code, for instance, happens to look a lot like what a Sparc engineer would expect to see, and that meant Java was less than optimal on X86 systems even though it was nominally platform independent. XSLT 1.0 is very hard to implement outside of java (and these days, even inside Java) because it normatively references the Java 1.1 specification. XSLT 1.1 died due to infighting between the Python and Java communities.

Even more seriously the specification should not favor some natural languages over others. For instance, it would be unacceptable to design a format where English ASCII data was highly compressed but Chinese data wasn't. UTF-8 has this issue, but Unicode and XML don't because they allow individuals documents to choose their own encodings. Each document can be optimized for tis own needs. Typical data-neutral compression schemes like basic Huffman coding have this property naturally. However, some of the schemes for XML compression I've seen make a lot of assumptions about what the data looks like, and optimize for particular scenarios at the expense of others. At least when it comes to text, we need to make sure that English and Chinese are both supported well (as should be the 6,000 or so other languages on the planet too, of course).