The document is encoded in UTF-8 and the text inside the root element uses two non-ASCII characters, encoded in UTF-8 and each of which expands to a Unicode surrogate pair.
<!DOCTYPE doc [ <!ELEMENT doc (#PCDATA)> ]> <doc>𐀀</doc>
Expected result | Actual result for org.apache.crimson.parser.XMLReaderImpl |
---|---|
<?xml version="1.0" encoding="UTF-8"?> <ConformanceResults> <startDocument/> <startElement> <namespaceURI/> <localName>doc</localName> <qualifiedName>doc</qualifiedName> <attributes/> </startElement> <char>\uD800</char> <char>\uDC00</char> <char>\uDBFF</char> <char>\uDFFD</char> <endElement> <namespaceURI/> <localName>doc</localName> <qualifiedName>doc</qualifiedName> </endElement> <endDocument/> </ConformanceResults> | <?xml version="1.0" encoding="UTF-8"?> <ConformanceResults> <startDocument/> <startElement> <namespaceURI/> <localName>doc</localName> <qualifiedName>doc</qualifiedName> <attributes/> </startElement> <char>\uD800</char> <char>\uDC00</char> <char>\uDBFF</char> <char>\uDFFD</char> <endElement> <namespaceURI/> <localName>doc</localName> <qualifiedName>doc</qualifiedName> </endElement> <endDocument/> </ConformanceResults> |