Table of Contents
At its core, SAX, the Simple API for XML, is based on just two interfaces, the XMLReader interface that represents the parser and the ContentHandler interface that receives data from the parser. These two interfaces alone suffice for 90% of what you need to do with SAX. This chapter shows the basic operation of XMLReader and discusses ContentHandler in detail. The next chapter explores a variety of ways to customize the parsing process through the more advanced features of the XMLReader interface.
The Simple API for XML, SAX, was invented in late 1997/early 1998 when Peter Murray-Rust and several authors of XML parsers written in Java decided there wasn’t much point to maintaining multiple similar yet incompatible APIs to do exactly the same thing. Murray-Rust was the first to suggest what he called “YAXPAPI”. The reason Murray-Rust wanted Yet Another XML Parser API was that he was thoroughly sick of supporting multiple, incompatible XML parsers for his parser-client application JUMBO. Instead, he wanted a standard API everyone could agree on. Parser authors Tim Bray and David Megginson quickly signed on to the project, and work began in public on the xml-dev mailing list where many people participated. Megginson wrote the initial draft of SAX. After a short beta period, SAX 1.0 was released on May 11, 1998.
SAX was designed around abstract interfaces rather than concrete classes so it could be layered on top of parsers’ existing native APIs. SAX is not the most sophisticated XML API imaginable, but that’s part of its beauty. The ease with which SAX could be implemented by many parser vendors with very different architectures contributed to its success and rapid standardization.
SAX in other languages SAX has been unofficially ported to several other object oriented languages including C++, Visual Basic, Python, and Perl. The general patterns and names of most functions remain the same. However, the details of implementation change quite a bit. For instance, C++ doesn’t have interfaces, but does have multiple inheritance, so ContentHandler, XMLReader and the like become classes containing nothing but pure virtual functions. The C++ string classes can’t handle Unicode so parsers must use pointers to arrays of custom types such as XMLCh instead. Unfortunately, there’s no standard C++ binding for SAX so the custom classes vary from one parser to the next, and you can’t easily port C++ SAX programs between different compilers and platforms in either binary or source form. Although supporting the “Desperate Perl Hacker” was a goal of the original XML working group, Perl has always lagged other languages quite a bit when it comes to XML. The initial problem was the lack of support for Unicode, a sine qua non for XML. Today modern Perls have decent Unicode support. To really handle XML you need at least version 5.005_52 of Perl, preferably, 5.6.1 or later and ideally 5.8. There are several XML parsers available for Perl, though far and away the most popular is Larry Wall and Clark Cooper’s XML::Parser. This is a wrapper around James Clark’s expat XML parser written in C. However, this parser isn’t really SAX compatible though it’s used in a lot of legacy code. New projects should use XML::SAX instead. However, even with this module, in my opinion Perl is still not as ideal a language for processing XML as you might expect. Perl’s strength is its ability to work with the implicit structure in text documents such as tab delimited text files and comma separated values files. However, XML documents tend to have very explicit structure that is easily addressed by a language like Java. Perl’s strengths don’t come into play; but you still suffer the numerous well-known disadvantages of working with Perl, The inevitable obfuscation of Perl code seems to me too high a price to pay. Python probably has the best support for SAX and XML of any of the non-Java languages. XML parsing including a SAX port has been a standard part of Python since version 2.0. Furthermore, Python has a standard Unicode string type. This is not quite the same as Python’s regular string type, but Python’s weak typing means this isn’t nearly as big an inconvenience as it is in C++. However, the fact remains that SAX is designed in and for Java, and Java is certainly the most convenient language with which to write SAX programs. |
Although SAX is very much a de facto standard, it has not gone through any formal standardization process. Its development was open to anyone interested. All you had to do was join the xml-dev mailing list and participate in the discussions. The end result was explicitly placed in the public domain. It is free to be implemented or extended by anyone for any purpose without permission from anybody. It is not copyrighted or trademarked. As far as is known, no parts of it are patented by anyone either.
In late 1999, work began on SAX2. This was a radical reformulation of SAX that, while maintaining the same basic event-oriented architecture, replaced almost every class in SAX1. The main impetus for this radical shift was the need to make SAX namespace aware. However many other new capabilities were added in SAX2 including filters and optional support for lexical events and DTDs. SAX2 was finished in May 2000, and has proven even more successful than SAX1. Indeed SAX2 is the most complete XML API available anywhere. In 2002, all major parsers that support SAX at all support SAX2. There is no reason to learn or concern yourself with the older classes and interfaces from SAX1, and henceforth I will discuss SAX2 exclusively.
For the first few years of its life, the official SAX distribution and documentation was maintained by David Megginson. However, he recently passed the torch to David Brownell who has begun work on SAX 2.1. At the time of this writing, SAX 2.1 seems unlikely to be as radical a shift relative to SAX2 as SAX2 was relative to SAX1. Version 2.1 will add a few bits of information from the XML document that are not exposed by SAX2 such as the encoding declaration. However, no SAX2 classes, interfaces, or methods will be deprecated in SAX 2.1; and only programmers with very special needs will need to concern themselves with the new functionality in SAX 2.1.
Copyright 2001, 2002 Elliotte Rusty Harold | elharo@metalab.unc.edu | Last Modified May 26, 2002 |
Up To Cafe con Leche |