XML News from Tuesday, July 8, 2008

Google has released protobufs. Think of protobufs as doing for ASN.1 what XML did for SGML. That is, it's a simpler format for exchanging binary data that mere mortals may be able to use. Libraries are available for C++, Java, an Python; and the format is well-documented for anyone else who wants to work in some other language. According to Google,

Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers:

For example, let's say you want to model a person with a name and an email. In XML, you need to do:

  <person>
    <name>John Doe</name>

    <email>jdoe@example.com</email>
  </person>

while the corresponding protocol buffer message definition (in protocol buffer text format) is:

  person {
    name = "John Doe"
    email = "jdoe@example.com"
  }

In binary format, this message would probably be 28 bytes long and take around 100-200 nanoseconds to parse. The XML version is at least 69 bytes (if you remove whitespace) and would take around 5,000-10,000 nanoseconds to parse.

Also, manipulating a protocol buffer is much easier:

  cout << "Name: " << person.name() << endl;
  cout << "E-mail: " << person.email() << endl;

Whereas with XML you would have to do something like:

  cout << "Name: "
       << person.getElementsByTagName("name")->item(0)->innerText()
       << endl;
  cout << "E-mail: "
       << person.getElementsByTagName("email")->item(0)->innerText()
       << endl;

However, protocol buffers are not always a better solution than XML – for instance, protocol buffers would not be a good way to model a text-based document with markup (e.g. HTML), since you cannot easily interleave structure with text. In addition, XML is human-readable and human-editable; protocol buffers, at least in their native format, are not. XML is also – to some extent – self-describing. A protocol buffer is only meaningful if you have the message definition (the .proto file).

I think Google is overstating the downsides of XML here. They make the common mistake of conflating a horrible API (DOM) with XML itself. In a sane API, you'd just do something like this:

<xsl:template match='person'>
  <xsl:value-of select="name"/>
  <xsl:text/>
</xsl:text>
  <xsl:value-of select="email"/>
<xsl:template>

This would look even simpler in XQuery or E4X, but I don't have enough practice with those languages to type them with reasonable confidence before my morning coffee.

Still, maybe this binary format can give the people who really need (or who think they need) a binary format for efficiency or other reasons their own sandbox, so they can stop peeing in ours.

Protobufs do show one lesson learned from experience: they mirror XML's must-ignore semantics. It is possible to put extra fields in a protobuf and not break every downstream consumer that doesn't know about those fields. That's a rare quality in a binary format.

One question I have is what does it mean for a protobuf to be malformed? How easy is it to detect a corrupt byte stream? What will happen if someone deliberately attempts to feed bad data to a protobuf consumer? Protobufs are clearly designed with the idea in mind of taking bytes off the wire or from disk and shoving them into memory. This technique has been incredibly dangerous in the past, and led to incredibly brittle software. Whether the protobuf libraries are actually doing that or not, I'm not sure. However although I do see wire format documentation on Google's site, I don't see an actual BNF grammar anywhere and that makes me nervous. A good rule of thumb for any wire format or file format (and protobufs are really both) is that consumers must be prepared for absolutely any byte stream as input, whether it's what they expect or not. Any byte stream that does not satisfy the grammar must be detected and rejected. Any byte stream that does satisfy the grammar must be acceptable. Never trust external input to a program without verification. Anything less is insecure and dangerous. I do note that there C++ examples return error codes rather than throwing exceptions on parse failure, which smells bad to this java programmer, but maybe that's just C++.

The real question in my mind is whether protobufs have any hope of working over the public Internet. Schema-dependent, opaque binary formats work a lot better behind the firewall where one group writes the software to both produce and consume the data, than over the heterogenous world of the Internet where you have little idea who's reading your data or why. In that world, self-describing text makes all the difference, efficiency be damned.