nu.xom
Class Serializer

java.lang.Object
  extended bynu.xom.Serializer

public class Serializer
extends java.lang.Object

A serializer outputs a Document object in a specific encoding using various options for controlling white space, indenting, line breaking, and base URIs. However, in general these do affect the document's infoset. In particular, if you set either the maximum line length or the indent size to a positive value, then the serializer will not respect input white space. It may trim leading and trailing space, condense runs of white space to a single space, convert carriage returns and line feeds to spaces, add extra space where none was present before, and otherwise muck with the document's white space. The defaults, however, preserve all significant white space including ignorable white space, to the maximum extent possible.

Version:
1.0d23
Author:
Elliotte Rusty Harold

Constructor Summary
Serializer(java.io.OutputStream out)
           Create a new serializer that uses the UTF-8 encoding.
Serializer(java.io.OutputStream out, java.lang.String encoding)
           Create a new serializer that uses a specified encoding.
 
Method Summary
protected  void breakLine()
           Writes the current line break string onto the underlying OutputStream and indents as specified by the current level and the indent property.
 void flush()
           Flush the data onto the output stream.
protected  int getColumnNumber()
           This method returns the current column number of the output stream, It's useful for subclasses that wish to implement their own pretty printing strategies by inserting white space and line breaks at appropriate points.
 java.lang.String getEncoding()
           Returns the name of the character encoding used by this Serializer.
 int getIndent()
           Returns the number of spaces this serializer indents.
 java.lang.String getLineSeparator()
           Returns the String used as a line separator.
 int getMaxLength()
           Returns the preferred maximum line length.
 boolean getPreserveBaseURI()
           Returns true if this serializer preserves the original base URIs by inserting extra xml:base attributes.
 boolean getUnicodeNormalizationFormC()
           If true, this property indicates serialization will perform Unicode normalization on all data using normalization form C (NFC).
 void setIndent(int indent)
           Sets the number of additional spaces to add to each successive level in the hierarchy.
 void setLineSeparator(java.lang.String lineSeparator)
           Sets the lineSeparator.
 void setMaxLength(int maxLength)
           Sets the suggested maximum line length for this serializer.
 void setOutputStream(java.io.OutputStream out)
           Flushes the previous OutputStream and sets redirects further output to the new OutputStream.
 void setPreserveBaseURI(boolean preserve)
           Determines whether this Serializer inserts extra xml:base attributes to attempt to preserve base URI information from the document.
 void setUnicodeNormalizationFormC(boolean normalize)
           If true, this property indicates serialization will perform Unicode normalization on all data using normalization form C (NFC).
protected  void write(Attribute attribute)
           This method writes an attribute in the form name="value".
protected  void write(Comment comment)
           Serializes a Comment object onto the output stream using the current options.
protected  void write(DocType doctype)
           Serializes a DocType object onto the output stream using the current options.
 void write(Document doc)
           Serializes a document onto the output stream using the current options.
protected  void write(Element element)
           Serializes an element onto the output stream using the current options.
protected  void write(ProcessingInstruction instruction)
           Serializes a ProcessingInstruction object onto the output stream using the current options.
protected  void write(Text text)
           Serializes a Text object onto the output stream using the current options.
protected  void writeAttributes(Element element)
           This method writes all the attributes of the specified element onto the output stream, one at a time, separated by white space.
protected  void writeAttributeValue(java.lang.String value)
           Writes a string onto the underlying OutputStream.
protected  void writeChild(Node node)
           Serializes a child node onto the output stream using the current options.
protected  void writeEmptyElementTag(Element element)
           This method writes an empty-element tag for the element including all its namespace declarations and attributes.
protected  void writeEndTag(Element element)
           This method writes the end-tag for an element in the form </name.
protected  void writeEscaped(java.lang.String text)
           Writes a string onto the underlying OutputStream.
protected  void writeNamespaceDeclaration(java.lang.String prefix, java.lang.String uri)
           This writes a namespace declaration in the form xmlns:prefix="uri" or xmlns="uri".
protected  void writeNamespaceDeclarations(Element element)
           This method writes all the namespace declaration attributes of the specified element onto the output stream, one at a time, separated by white space.
protected  void writeRaw(java.lang.String text)
           Writes a string onto the underlying OutputStream.
protected  void writeStartTag(Element element)
           This method writes the start-tag for the element including all its namespace declarations and attributes.
protected  void writeXMLDeclaration()
           This method writes the XML declaration onto the output stream, followed by a line break.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Serializer

public Serializer(java.io.OutputStream out)

Create a new serializer that uses the UTF-8 encoding.

Parameters:
out - the output stream to write the document on
Throws:
java.lang.NullPointerException - if out is null

Serializer

public Serializer(java.io.OutputStream out,
                  java.lang.String encoding)
           throws java.io.UnsupportedEncodingException

Create a new serializer that uses a specified encoding. The encoding must be recognized by the Java virtual machine. Currently the following encodings are recognized by XOM:

More will be added in the future. You can use encodings not in this list as long as the local virtual machine supports them. However, characters may unnecessarily be output as character references. Conversely, not all versions of Java support all of these encodings. If you attempt to use an encoding that the local Java virtual machine does not support, the constructor will throw an UnsupportedEncodingException.

Parameters:
out - the output stream to write the document on
encoding - the character encoding for the serialization
Throws:
java.lang.NullPointerException - if out or encoding is null
java.io.UnsupportedEncodingException - if the VM does not support the requested encoding
Method Detail

setOutputStream

public void setOutputStream(java.io.OutputStream out)
                     throws java.io.IOException

Flushes the previous OutputStream and sets redirects further output to the new OutputStream.

Parameters:
out - the output stream to write the document on
Throws:
java.lang.NullPointerException - if out is null
java.io.IOException - if the previous OutputStream encounters an I/O error when flushed

write

public void write(Document doc)
           throws java.io.IOException

Serializes a document onto the output stream using the current options.

Parameters:
doc - the Document to serialize
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error
java.lang.NullPointerException - if doc is null

writeXMLDeclaration

protected void writeXMLDeclaration()
                            throws java.io.IOException

This method writes the XML declaration onto the output stream, followed by a line break.

Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

write

protected void write(Element element)
              throws java.io.IOException

Serializes an element onto the output stream using the current options. The result is guaranteed to be well-formed. If element does not have a parent element, it will also be namespace well-formed.

If the element is empty, this method invokes writeEmptyElementTag. If the element is not empty, then:

  1. It calls writeStartTag
  2. It passes each of the element's children to write in order.
  3. It calls writeEndTag

It may break lines or add white space if the serializer has been configured to indent or use a maximum line length.

Parameters:
element - the Element to serialize
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

writeEndTag

protected void writeEndTag(Element element)
                    throws java.io.IOException

This method writes the end-tag for an element in the form </name.

Parameters:
element - the element whose end-tag is written
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

writeStartTag

protected void writeStartTag(Element element)
                      throws java.io.IOException

This method writes the start-tag for the element including all its namespace declarations and attributes.

The writeAttributes method is called to write all the non-namespace-declaration attributes. The writeNamespaceDeclarations method is called to write all the namespace declaration attributes.

Parameters:
element - the element whose start-tag is written
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

writeEmptyElementTag

protected void writeEmptyElementTag(Element element)
                             throws java.io.IOException

This method writes an empty-element tag for the element including all its namespace declarations and attributes.

The writeAttributes method is called to write all the non-namespace-declaration attributes. The writeNamespaceDeclarations method is called to write all the namespace declaration attributes.

If subclasses don't wish empty-element tags to be used, they can override this method to simply invoke writeStartTag followed by writeEndTag.

Parameters:
element - the element whose empty-element tag is written
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

writeAttributes

protected void writeAttributes(Element element)
                        throws java.io.IOException

This method writes all the attributes of the specified element onto the output stream, one at a time, separated by white space. If preserveBaseURI is true, and it is necessary to add an xml:base attribute to the element in order to preserve the base URI, then that attribute is also written here. Each individual attribute is written by invoking write(Attribute).

Parameters:
element - the Element whose attributes are written
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

writeNamespaceDeclarations

protected void writeNamespaceDeclarations(Element element)
                                   throws java.io.IOException

This method writes all the namespace declaration attributes of the specified element onto the output stream, one at a time, separated by white space. Each individual declaration is written by invoking writeNamespaceDeclaration.

Parameters:
element - the Element whose attributes are written
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

writeNamespaceDeclaration

protected void writeNamespaceDeclaration(java.lang.String prefix,
                                         java.lang.String uri)
                                  throws java.io.IOException

This writes a namespace declaration in the form xmlns:prefix="uri" or xmlns="uri". It does not write the spaces on either side of the namespace declaration. These are written by writeStartTag

Parameters:
prefix - the namespace prefix; the empty string for the default namespace
uri - the namespace URI
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

write

protected void write(Attribute attribute)
              throws java.io.IOException

This method writes an attribute in the form name="value". Characters in the attribute value are escaped as necessary.

Parameters:
attribute - the Attribute to write
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

write

protected void write(Comment comment)
              throws java.io.IOException

Serializes a Comment object onto the output stream using the current options.

Since character and entity references are not resolved in comments, comments can only be serialized when all characters they contain are available in the current encoding.

Parameters:
comment - the Comment to serialize
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

write

protected void write(ProcessingInstruction instruction)
              throws java.io.IOException

Serializes a ProcessingInstruction object onto the output stream using the current options.

Since character and entity references are not resolved in processing instructions, processing instructions can only be serialized when all characters they contain are available in the current encoding.

Parameters:
instruction - the ProcessingInstruction to serialize
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

write

protected void write(Text text)
              throws java.io.IOException

Serializes a Text object onto the output stream using the current options. Reserved characters such as <, > and " are escaped using the standard entity references such as &lt;, &gt;, and &quot;.

Characters which cannot be encoded in the current character set (for example, Ω in ISO-8859-1) are encoded using character references.

Unsupported character sets encode all non-ASCII characters. Supported character sets currently include:

Non-ASCII characters from other character sets will probably be hexadecimally escaped. even when they don't need to be. More standard character sets will be added in the future. This will not require any changes to the public API.

Parameters:
text - the Text to serialize
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

write

protected void write(DocType doctype)
              throws java.io.IOException

Serializes a DocType object onto the output stream using the current options.

Parameters:
doctype - the document type declaration to serialize
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

writeChild

protected void writeChild(Node node)
                   throws java.io.IOException

Serializes a child node onto the output stream using the current options. It is invoked when walking the tree to serialize the entire document. It is not called, and indeed should not be called, for either the Document node or for attributes.

Parameters:
node - the Node to serialize
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error
XMLException - if an Attribute or a Document is passed to this method

writeEscaped

protected final void writeEscaped(java.lang.String text)
                           throws java.io.IOException

Writes a string onto the underlying OutputStream. Non-ASCII characters that are not available in the current character set are hexadecimally escaped. The three reserved characters <, >, and & are escaped using the standard entity references &lt;, &gt;, and &amp;. Double and single quotes are not escaped.

Parameters:
text - the String to serialize
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

writeAttributeValue

protected final void writeAttributeValue(java.lang.String value)
                                  throws java.io.IOException

Writes a string onto the underlying OutputStream. Non-ASCII characters that are not available in the current character set are escaped using hexadeicmal numeric character references. Carriage returns, line feeds, and tabs are also escaped using hexadecimal numeric character references in order to ensure their preservation on a round trip. The four reserved characters <, >, &, and " are escaped using the standard entity references &lt;, &gt;, &amp;, and &quot;. The single quote is not escaped.

Parameters:
value - the String to serialize
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

writeRaw

protected final void writeRaw(java.lang.String text)
                       throws java.io.IOException

Writes a string onto the underlying OutputStream. without escaping any characters. Non-ASCII characters that are not available in the current character set cause an IOException.

Parameters:
text - the String to serialize
Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error or text contains characters not available in the current character set

breakLine

protected final void breakLine()
                        throws java.io.IOException

Writes the current line break string onto the underlying OutputStream and indents as specified by the current level and the indent property.

Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

flush

public void flush()
           throws java.io.IOException

Flush the data onto the output stream. It is not enough to flush the output stream. You must flush the serializer object itself because it uses some internal buffering. The serializer will flush the underlying output stream.

Throws:
java.io.IOException - if the underlying OutputStream encounters an I/O error

getIndent

public int getIndent()

Returns the number of spaces this serializer indents.

Returns:
the number of spaces this serializer indents each successive level beyond the previous one

setIndent

public void setIndent(int indent)

Sets the number of additional spaces to add to each successive level in the hierarchy. Use 0 for no extra indenting. The maximum indentation is in limited to approximately half the maximum line length. The serializer will not indent further than that no matter how many levels deep the hierarchy is.

When this variable is set to a value greater than 0, the serializer does not preserve white space. Spaces, tabs, carriage returns, and line feeds can all be interchanged at the serializer's discretion, and additional white space may be added before and after tags. Carriage returns, line feeds, and tabs will not be escaped with numeric character references.

Inside elements with an xml:space="preserve" attribute, white space is preserved and no indenting takes place, regardless of the setting of the indent property, unless, of course, an xml:space="default" attribute overrides the xml:space="preserve" attribute.

The default value for indent is 0; that is, the default is not to add or subtract any white space from the source document.

Parameters:
indent - the number of spaces to indent each successive level of the hierarchy
Throws:
java.lang.IllegalArgumentException - if indent is less than zero

getLineSeparator

public java.lang.String getLineSeparator()

Returns the String used as a line separator. This is always "\n", "\r", or "\r\n".

Returns:
the line separator

setLineSeparator

public void setLineSeparator(java.lang.String lineSeparator)

Sets the lineSeparator. This can only be one of the three strings "\n", "\r", or "\r\n". All other values are forbidden. If this method is invoked, then line separators in the character data will be changed to this string. Line separators in attribute values will be changed to the hexadecimal numerica character references corresponding to this string.

The default line separator is "\r\n". However, line separators in character data and attribute values are not changed to this string, unless you explicitly call this method.

Parameters:
lineSeparator - The lineSeparator to set
Throws:
java.lang.IllegalArgumentException - if you attempt to use any line separator other than "\n", "\r", or "\r\n".

getMaxLength

public int getMaxLength()

Returns the preferred maximum line length.

Returns:
the maximum line length.

setMaxLength

public void setMaxLength(int maxLength)

Sets the suggested maximum line length for this serializer. Setting this to 0 indicates that no automatic wrapping is to be performed. When a line approaches this length, the serializer begins looking for opportunities to break the line. Generally it will break on any ASCII white space character (tab, carriage return, linefeed, and space). In some circumstances the serializer may not be able to break the line before the maximum length is reached. For instance, if an element name is longer than the maximum line length the only way to correctly serialize it is to exceed the maximum line length. In this case, the serializer will exceed the maximum line length.

The default value for max line length is 0, which is interpreted as no maximum line length. Setting this to a negative value just sets it to 0.

When this variable is set to a value greater than 0, the serializer does not preserve white space. Spaces, tabs, carriage returns, and line feeds can all be interchanged at the serializer's discretion. Carriage returns, line feeds, and tabs will not be escaped with numeric character references.

Inside elements with an xml:space="preserve" attribute, the maximum line length is not enforced, regardless of the setting of the this property, unless, of course, an xml:space="default" attribute overrides the xml:space="preserve" attribute.

Parameters:
maxLength - the suggested maximum line length

getPreserveBaseURI

public boolean getPreserveBaseURI()

Returns true if this serializer preserves the original base URIs by inserting extra xml:base attributes.

Returns:
true if this Serializer inserts extra xml:base attributes to attempt to preserve base URI information from the document.

setPreserveBaseURI

public void setPreserveBaseURI(boolean preserve)

Determines whether this Serializer inserts extra xml:base attributes to attempt to preserve base URI information from the document. The default is false, do not preserve base URI information. xml:base attributes that are part of the document's infoset are always output. This property only determines whether or not extra xml:base attributes are added.

Parameters:
preserve - true if xml:base attributes should be added as necessary to preserve base URI information

getEncoding

public java.lang.String getEncoding()

Returns the name of the character encoding used by this Serializer.

Returns:
the encoding used for the output document

setUnicodeNormalizationFormC

public void setUnicodeNormalizationFormC(boolean normalize)

If true, this property indicates serialization will perform Unicode normalization on all data using normalization form C (NFC). Performing Unicode normalization may change the document's infoset. The default is false; do not normalize.

The implementation used is IBM's International Components for Unicode for Java (ICU4J) 2.6. This version is based on Unicode 4.0.

This feature has not yet been benchmarked or optimized. It may result in substantially slower code.

If all your data is in the first 256 code points of Unicode (i.e. the ISO-8859-1, Latin-1 character set) then it's already in normalization form C and renormalizing won't change anything.

Parameters:
normalize - true if normalization is performed; false if it isn't

getUnicodeNormalizationFormC

public boolean getUnicodeNormalizationFormC()

If true, this property indicates serialization will perform Unicode normalization on all data using normalization form C (NFC). The default is false; do not normalize.

Returns:
true if this serialization performs Unicode normalization; false if it doesn't

getColumnNumber

protected final int getColumnNumber()

This method returns the current column number of the output stream, It's useful for subclasses that wish to implement their own pretty printing strategies by inserting white space and line breaks at appropriate points.

Columns are counted based on Unicode characters, not Java chars. A surrogate pair counts as one character in this context, not two. However, a character followed by a combining character (e.g. e followed by combining accent acute) counts as two characters. This latter choice (treating combining characters like regular characters) is under review, and may change in the future if it's not too big a performance hit.

Returns:
the current column number


Copyright 2002-2004 Elliotte Rusty Harold
elharo@metalab.unc.edu