RELAX: Schemas Don't Have to be HardElliotte Rusty HaroldFriday, March 18, 2005elharo@metalab.unc.eduhttp://www.cafeconleche.org/ |
Generically, a document that describes what a correct document may contain
Specifically, a very complex and baroque W3C Recommendation for an XML-document syntax that describes the permissible contents of XML documents
But there other options!
Created by James Clark and Murata Makoto
Based on TREX and RELAX
Formal theory is hedge automata
ISO standard: ISO/IEC 19757-2:2002(E), Document Schema Definition Languages (DSDL) — Part 2:Regular-grammar-based validation — RELAX NG, I
Unusual, non-XML like syntax
No data typing, especially for element content
Limited extensibility
Only marginally compatible with namespaces
Cannot use mixed content and enforce order and number of child elements
Cannot enforce number of child elements without also enforcing order.
(i.e. no &
operator from SGML)
Confuses infoset augmentation with validation
Complex, hard to understand
Not sufficiently extensible:
Checksums
SKUs match database
sum of item prices equals total price
Poorly implemented
Confuses infoset annotation with validation
$ jing greeting.rng greeting.xml
$ jing greeting.rng greeting3.xml /Users/elharo/Documents/speaking/sd2005west/relaxng/examples/greeting3.xml: 3:6: error: unknown element "P"
Notice how completely decoupled the validation is from the instance document. We can validate any document against any schema. We do not need to specify the schema in the instance document as you must do with DTDs and often need to do with W3C schemas.
Jing (Java): http://www.thaiopensource.com/relaxng/jing.html
libxml2 (C): http://xmlsoft.org/
Sun's Multischema Validator (Java): http://wwws.sun.com/software/xml/developers/multischema/
Tenuto (.NET): http://sourceforge.net/projects/relaxng
zeroOrMore
oneOrMore
optional
Boundary white space does not need to be declared
All of these structures can nest straight-forwardly
PRODUCER
and COMPOSER
are
really the same type.
Much more powerful than either DTDs or schemas
RELAX NG can enforce order and appearance of elements in mixed content.
Interleave with text
element to validate mixed content
...
<define name="personContent">
<element name="NAME">
<interleave>
<text />
<element name="GIVEN">
<text/>
</element>
<element name="FAMILY">
<text/>
</element>
</interleave>
</element>
</define>
...
personContent =
element NAME {
text
& element GIVEN { text }
& element FAMILY { text }
}
choice
requires exactly one of a group
of specified items to appear
Can be enclosed in optional
, oneOrMore
, or zeroOrMore
A song must have at least one of ARTIST
, COMPOSER
, or PRODUCER
:
group
defines a model that can be used as a particle in other models
Like a sequence in W3C schemas or parentheses in DTDs
Allow a NAME
element to contain either plain text or a GIVEN
and a FAMILY
but not both:
<element name="NAME>
<choice>
<text/>
<group>
<interleave>
<element name="GIVEN">
<text/>
</element>
<element name="FAMILY">
<text/>
</element>
</interleave>
</group>
</choice>
</element>
attribute
element
name
attribute specifies the name of the attribute
optional
element can make an attribute optional
List of all allowed values from which one must be chosen
A choice
containing value
elements
The publisher must be one of the oligopoly that controls 90% of U.S. music (Warner-Elektra-Atlantic, Universal Music Group, Sony Music Entertainment, Inc., Capitol Records, Inc., BMG Music)
ns
attribute specifies the namespace of the element it matches
Only applies in schema
Applies to that element
element and its descendants
Until overridden
Validates solely based on URI; irrelevant whether instance document uses prefixes or not
Can use prefixed names instead if you prefer
Usual namespace prefix binding attributes
Convenient in rare cases when namespaces shift from one element to the next (two widely intermixed vocabularies like XSLT)
attribute
element can have an ns
attribute to specify the namespace
attribute
element does not inherit ns
attribute
Consider this document:
<foo>
<value>45.67</value>
</foo>
What is the type of value
?
A longitude or latitude
A decimal monetary type, as in COBOL
A fixed point number
An infinitely precise floating point number such as
represented by the java.math.BigDecimal
class
An IEEE754 double
A Java double
An IEEE 754 float
A VAX Fortran REAL
An imprecisely known decimal number with 4 significant digits that's plus or minus 1 in the last place.
An imprecisely known decimal number with 4 significant digits that's plus or minus 5 in the last place.
Build 67 of version 45 of Microsoft Word
A regular expression matching all strings that begin with the two characters '4' and '5', followed by a single character, followed by the two characters '6' and '7'.
A string of characters a monkey typed on a keyboard
Other interpretations are doubtless possible, and even make sense in particular contexts.
There's no guarantee that the string 45.67
in fact represents any particular type.
RELAX NG defines no data types beyond text
But implementations are free to implement other type libraries
W3C XML Schema Data Types are commonly available
User defined type libraries
data
element replaces text
element in content models
Boolean
String
URIs
Numeric types
Time types
XML types
XML Schema Built-In Numeric Simple Types | ||
---|---|---|
Name | Type | Examples |
float | IEEE 754 32-bit floating point number | -INF, -1E4, -0, 0, 12.78E-2, 12, INF, NaN |
double | IEEE 754 64-bit floating point number | -INF, 1.401E-90, -1E4, -0, 0, 12.78E-2, 12, INF, NaN, 3.4E42 |
decimal | arbitrary precision, decimal numbers | -2.7E400, 5.7E-444, -3.1415292, 0, 7.8, 90200.76, 3.4E1024 |
integer | an arbitrarily large or small integer | -500000000000000000000000, -9223372036854775809, -126789, -1, 0, 1, 5, 23, 42, 126789, 9223372036854775808, 456734987324983264987362495809587095720978 |
nonPositiveInteger | an integer less than or equal to zero | 0, -1, -2, -3, -4, -5, ... |
negativeInteger | an integer strictly less than zero | -1, -2, -3, -4, -5, ... |
long | an eight-byte two's complement integer such as Java's
long type |
-9223372036854775808, -12678967543233, -1, 9223372036854775807 |
int | an integer that can be represented as a four-byte,
two's complement number such as Java's int type |
-2147483648, -1, 0, 1, 5, 23, 42, 2147483647 |
short | an integer that can be represented as a two-byte,
two's complement number such as Java's short type |
-32768, -1, 0, 1, 5, 23, 42, 32767 |
byte | an integer that can be represented as a one-byte,
two's complement number such as Java's byte type |
-128, -1, 0, 1, 5, 23, 42, 127 |
nonNegativeInteger | an integer greater than or equal to zero | 0, 1, 2, 3, 4, 5, ... |
unsignedLong | an eight-byte unsigned integer | 0, 1, 2, 3, 4, 5, ...18446744073709551614, 18446744073709551615 |
unsignedInt | a four-byte unsigned integer | 0, 1, 2, 3, 4, 5, ...4294967294, 4294967295 |
unsignedShort | a two-byte unsigned integer | 0, 1, 2, 3, 4, 5, ...65534, 65535 |
unsignedByte | a one-byte unsigned integer | 0, 1, 2, 3, 4, 5, ...254, 255 |
positiveInteger | an integer strictly greater than zero | 1, 2, 3, 4, 5, 6, ... |
XML Schema Built-In Time Simple Types | ||
---|---|---|
Name | Type | Examples |
dateTime | a particular moment in Coordinated Universal Time; up to an arbitrarily small fraction of a second | 1999-05-31T13:20:00.000-05:00 |
gMonth | A given month in a given year | 2000-10 |
gYear | a given year | 2000 |
gMonthDay | a date in no particular year, or rather in every year | --10-31 |
gDay | a day in no particular month, or rather in every month | ----31 |
duration | a length of time, without fixed endpoints, to an arbitrary fraction of a second | P2000Y10M31DT09H32M7.4312S |
date | a specific day in history | 2000-10-31 |
time | a specific time of day, that recurs every day | 14:30:00.000, 09:30:00.000-05:00 |
XML Schema Built-In XML Simple Types | ||
---|---|---|
Name | Type | Examples |
ID | XML 1.0 ID attribute type | any XML name that's unique among ID type attributes |
IDREF | XML 1.0 IDREF attribute type | any XML name that's used as an ID type attribute elsewhere in the document |
ENTITY | XML 1.0 ENTITY attribute type | any XML name that's declared as an unparsed entity in the DTD |
NOTATION | ???? | ???? |
language | Permissible values for xml:lang as defined in XML 1.0
|
en-GB, en-US, fr |
IDREFS | XML 1.0 IDREFS attribute type | a white space separated list of IDREF names |
ENTITIES | XML 1.0 ENTITIES attribute type | a white space separated list of ENTITY names |
NMTOKEN | XML 1.0 NMTOKEN attribute type | 12 are you ready |
NMTOKENS | XML 1.0 NMTOKENS attribute type | a white space separated list of name tokens |
Name | An XML 1.0 Name | set, title, rdf, math, math123, href |
QName | an optionally prefixed, namespace qualified name | song:title |
NCName | a local name without any colons | title |
XML Schema Built-In Simple Types | ||
---|---|---|
Name | Type | Examples |
string | Parsed Character Data; #PCDATA | Hot Cop |
normalizedString | A string whose normalized value does not contain any tabs, carriage returns, or linefeeds | PIC1, PIC2, PIC3, cow_movie, MonaLisa, Hello World , Warhol, red green |
token | A string whose normalized value has no leading or trailing white space, no tabs, no linefeeds, and not more than one consecutive space | p1 p2, ss123 45 6789, _92, red, green, NT Decl, seventeenp1, p2, 123 45 6789, ^*&^*&_92, red green blue, NT-Decl, seventeen; Mary had a little lamb, The love of money is the root of all Evil. |
boolean | C++'s bool type |
true, false, 1, 0 |
anyURI | relative or absolute URI | http://www.w3.org/TR/2000/WD-xmlschema-2-20000407/#duration, /javafaq/reports/JCE1.2.1.html |
hexBinary | Arbitrary binary data encoded in hexadecimal form | A4E345EC54CC8D52198000FFEA6C |
base64Binary | Arbitrary binary data encoded in Base64 | 6jKpNnmkkWeArsn5Oeeg2njcz+nXdk0f9kZI892ddlR8Lg1aMhPeFTYuoq3I6neFlb BjWzuktNZKiXYBfKsSTB8U09dTiJo2ir3HJuY7eW/p89osKMfixPQsp9vQMgzph6Qa lY7j4MB7y5ROJYsTr1/fFwmj/yhkHwpbpzed1LE= |
param
children of data
specify the constraints on the type to create a subset of the normally accepted values
For example, this data
element restricts a date to be any
year from 1877 (the year Edison invented the
phonograph) on:
<element name="YEAR">
<data type="gYear">
<param name="minInclusive">1877</param/>
</data>
</element>
In the compact syntax:
element YEAR {xsd:gYear {minInclusive = "1877"}}
Facets include:
length
minLength
maxLength
pattern
maxInclusive
maxExclusive
minInclusive
minExclusive
totalDigits
fractionDigits
Not all facets apply to all types.
Facets not allowed/applicable in RELAX NG:
enumeration
whiteSpace
The number of units allowed in a value
For strings (string
,
normalizedString
, token
,
QName
, NCname
,
ID
, IDREF
,
language
, anyURI
, ENTITY
,
NOTATION
, and NMTOKEN
)
the units are characters
For lists (IDREFS
, ENTITIES
,
and
NMTOKENS
) the units are tokens
For binary types (hexBinary
, base64Binary
)
the units are bytes after decoding
Must be a non-negative integer
For example, to say that all names and titles must contain between 1 and 255 characters:
In compact syntax:
Determines the minimum and maximum allowed values
Applies to ordered simple types including
byte
, unsignedByte
,
integer
, positiveInteger
,
negativeInteger
, nonNegativeInteger
,
nonPositiveInteger
, int
,
unsignedInt
, long
,
number
, unsignedLong
,
short
, unsignedShort
, number
,
float
, double
, time
,
dateTime
,
duration
, date
, gMonth
,
gYear
, gDay
,
and gMonthDay
.
For example, to say that the year must be between 1877 and 2100:
In the compact syntax:
totalDigits
facet
specifies the maximum number of decimal digits in a number
as a positive integer
fractionDigits
facet
specifies the maximum number of decimal digits to the right of the decimal
point as a non-negative integer
Applies to all types derived from decimal
including byte
, unsignedByte
,
integer
, positiveInteger
,
negativeInteger
, nonNegativeInteger
,
nonPositiveInteger
,
int
, unsignedInt
, long
,
unsignedLong
, short
, and
unsignedShort
.
Does not apply to float
and double
You can specify at most two fractional digits or at most seven decimal digits, but not at least two fractional digits or exactly seven decimal digits
Suppose you want a money type to specify that the PRICE
element content must look like $1.35 or ¥11000
Use the pattern
facet to specify
a regular expression instances must match
More or less Perl-like including the Unicode extensions introduced in Perl 5.6
The money regular expression:
\p{Sc}\p{Nd}+(\.\p{Nd}\p{Nd})?
\p{Sc}
\p{Nd}
\p{Nd}+
\.
(\.\p{Nd}\p{Nd})
(\.\p{Nd}\p{Nd})?
Matches a list of words separated by white space
Each word has a type
Can have multiple types
Allow multiple years in the YEAR
element:
<element name="YEAR">
<list>
<oneOrMore>
<data type="gYear"/>
</oneOrMore>
</list>
</element>
element SONG {
element TITLE { text },
element COMPOSER { xsd:string }+,
element PRODUCER { xsd:string }*,
element PUBLISHER { xsd:string }?,
element LENGTH { xsd:string }?,
element YEAR {
list { xsd:gYear+ }
}?,
element ARTIST { xsd:string }+
}
Available in Java (1.2 or later) and .NET
Implement the org.relaxng.datatype.DatatypeLibraryFactory
interface
Put the datatype library in a JAR archive
Include a
META-INF/services/org.relaxng.datatype.DatatypeLibraryFactory
file that contains the name of the class that implements the
org.relaxng.datatype.DatatypeLibraryFactory
interface
Add the .jar file to the CLASSPATH
Add the MSV/Jing .jar file to the CLASSPATH (Runnable archives don't work)
Breaks schema into multiple parts.
Matches any pattern defined in an external schema
href
attribute points to the external schema
All elements/attributes from namespaces other than http://relaxng.org/ns/structure/1.0 are ignored
Annotate with XHTML, Schematron or any other namespaced vocabulary
Use RELAX NG div
element to group multiple elements in a grammar
together with a single annotation
Not bundled; must install third party library
Wild cards with anyName
and nsName
Exclusions with except
Redefining external content models when importing
Attribute or child element
Cannot declare entities
Parent models
Extra-document validation
ID-IDREF/key-keyref*
PSVI
* Can be added
and many more...
Converts between:
RELAX XML syntax
RELAX Compact syntax
DTDs
W3C XML Schema Language
$ trang http://www.w3.org/Graphics/SVG/1.2/rng/Full-1.2/Full-1.2.rng full.xsd http://www.w3.org/Graphics/SVG/1.2/rng/Tiny-1.2/tiny-structure.rng:401:13: warning: choice between attributes and children cannot be represented; approximating http://www.w3.org/Graphics/SVG/1.2/rng/Tiny-1.2/tiny-structure.rng:420:13: warning: choice between attributes and children cannot be represented; approximating http://www.w3.org/Graphics/SVG/1.2/rng/Tiny-1.2/tiny-structure.rng:438:13: warning: choice between attributes and children cannot be represented; approximating http://www.w3.org/Graphics/SVG/1.2/rng/Full-1.2/structure.rng:24:13: warning: choice between attributes and children cannot be represented; approximating http://www.w3.org/Graphics/SVG/1.2/rng/Tiny-1.2/script.rng:38:13: warning: choice between attributes and children cannot be represented; approximating http://www.w3.org/Graphics/SVG/1.2/rng/Tiny-1.2/tiny-flow.rng:22:15: warning: cannot represent an optional group of attributes; approximating http://www.w3.org/Graphics/SVG/1.2/rng/Tiny-1.2/handler.rng:44:13: warning: choice between attributes and children cannot be represented; approximating http://www.w3.org/Graphics/SVG/1.2/rng/Full-1.2/style.rng:81:13: warning: choice between attributes and children cannot be represented; approximating
RELAX NG
Eric van der Vlist
O'Reilly & Associates, 2004
ISBN 0-596-00421-4
This presentation: http://www.cafeconleche.org/slides/sd2005west/relaxng/
RELAX NG Tutorial: http://www.oasis-open.org/committees/relax-ng/tutorial.html