|
XML Schema types and equivalence classes
reconstructing
DTD best practice
|
 |
Eve L. Maler and Jeanne El Andaloussi in their book "Developing SGML
DTDs" describe a flexible and powerful methodology for DTD design and development
which is widely used in a range of application environments, and is generally
recognised as constituting 'best practice' in this area. It makes heavy use
of parameter entities to define and exploit a class hierarchy of element types.
XML Schema is a W3C-sponsered effort to define an alternative to DTDs
for defining the structure of XML documents, using XML instance syntax. Not
surprisingly therefore, it defines element types for declaring elements and
attributes.
Despite an official requirement to at least reproduce the functionality
of DTDs, XML Schema none-the-less contains no text macro facility, which might
be expected to reproduce the functionality of parameter entities. How then
is 'best practice' to be carried forward from DTDs to XML Schemas?
The answer lies in two powerful mechanisms which XML Schema introduces:
user-defined types (distinct from element types, but crucially involved in
their declaration) and element equivalence classes. This paper describes in
detail XML Schema's concepts of complex type, type definition by derivation
and element equivalence class, shows how they relate to one another, and illustrates
their use to define type hierarchies and element class hierarchies without
recourse to parameter entities.
Introduction
Eve L. Maler and Jeanne El Andaloussi in their book "Developing SGML
DTDs"
[MA 1996] describe a flexible and powerful methodology
for DTD design and development which is widely used in a range of application
environments, and is generally recognised as constituting 'best practice'
in this area. It makes heavy use of parameter entities to define and exploit
a class hierarchy of element types. Our understanding of formal language design
has progressed since SGML was born, and text substitution macros, which is
what parameter entities are, have come to be recognised as a less-than-ideal
mechanism for enabling re-use and sharing of structure in formal definitions.
Accordingly in the design of the XML Schema document type definition language
[TH 2000], no text-substitution macro mechanism is supplied, but
rather explicit provision is made for a hierarchically-structured approach
to the definition of document types and their component parts. This paper
explores the utility of these mechanisms in reconstructing best practice in
structured DTD design without recourse to text substitution.
Document structure definition in XML Schema
The XML Schema language distinguishes between element declarations and
type definitions. A type definition is a collection of constraints on the
names and forms of attributes and children which an element may have, for
example:
<xs:complexType name="meeting">
<xs:all>
<xs:element ref="venue"/>
<xs:element ref="organiser"/>
<xs:element ref="participants"/>
</xs:all>
<xs:attribute name="when" type="xs:date" use="required"/>
</xs:complexType>
Note:
The above example uses the xs prefix for names taken
from XML Schema, as do all the subsequent XML Schema examples, without providing
a namespace declaration. In practice, of course, any prefix could be used
(or none), given the correct namespace declaration, which as of this writing
should use http://www.w3.org/1999/XMLSchema as the namespace
URI.
An element declaration associates a tag with a type definition, thereby
requiring elements in instances with that tag to conform to the definition,
for example:
<xs:element name="appointment" type="meeting"/>
Taken together the above pair of definition and declaration is similar
in effect to the following SGML declarations:
<!ELEMENT appointment (venue & organiser & participants)>
<!ATTLIST meeting when CDATA #REQUIRED>
XML Schema differs from SGML or XML DTDs in separating type definitions
from element declarations, thereby providing a mechanism for re-use which
in DTDs would have necessarily involved parameter entities.
Structuring the type space: type definition derivation in XML Schema
We observe that many uses of parameter entities in well-structured DTDs
imply some sort of family relationship between element types. Whenever a parameter
entity is used, for instance, to add one or more attribute declarations to
an ATTLIST, every instance of an element type with such an ATTLIST
is a member of a larger set, that is, the set of all elements which have,
or may have, the relevant attributes.
XML Schema makes explicit provision for defining types which specifically
contain exactly what such families have in common, and then allowing other
type definitions to be derived from the shared core. Consider the following
fragment from the XHTML strict
[XHTML 2000] DTD:
<!ENTITY % cellhalign
"align (left|center|right|justify|char) #IMPLIED
char %Character; #IMPLIED
charoff %Length; #IMPLIED"
>
This defines the cellhalign parameter entity with text
for three attribute declarations. Virtually all the table-related element
types then use this parameter entity, for example:
<!ATTLIST thead
%attrs;
%cellhalign;
%cellvalign;
>
In fact all the element types in XHTML which allow the cellhalign
attributes allow the others referenced here as well, so in an XML Schema schema
for XHTML, we would probably want to provide an abstract type definition with
all these attributes, for example:
<xs:complexType name="tabular" content="empty">
<xs:attribute name="align" type="halignPos"/>
<xs:attribute name="char" type="Character"/>
. . .
</xs:complexType>
Then all the relevant elements would have type definitions derived from
this one, for example:
<xs:complexType name="tableBlock" content="elementOnly"
base="tabular" derivedBy="extension">
<xs:element ref="tr" minOccurs="1" maxOccurs="unbounded"/>
</xs:complexType>
<xs:element name="thead" type="tableBlock"/>
<xs:element name="tbody" type="tableBlock"/>
<xs:element name="tfoot" type="tableBlock"/>
Several new bits of XML Schema syntax are introduced above. The type
definition for tableBlock identifies itself as derived from that
for tabular using the base attribute, and furthermore
specifies that its relation to that definition is one of extension.
What this means is that the attributes allowed and content model enforced
by the derived definition is the union and concatenation respectively of those
specified explicitly and those 'inherited' from the base definition.
In the example above, this has the desired effect, in that thead, tbody
and tfoot are all declared with reference to a type definition
which by the definition of derivation by extension allows all the tabular
attributes as well as having an appropriate list-of-rows content model.
Equally important to the economy and transparency of the syntactic aspects of this approach to re-use, there
is a parallel gain in semantic transparency:
It is now manifest that the similarity of the content models and attribute
inventory of the three elements is not accidental. The most straightforward
approach to changing one of them would change them all, which is probably
what is wanted. Applications which find it appropriate can treat them all
similarly, by dealing with them at the level of type definition. To facilitate
this, XML Schema-compliant processors must record the identity of the type
definition used in schema-validating every element and attribute in instance
documents.
Type definition derivation details: extension
A complex type definition (one constraining an element's content and
attribute inventory) may be derived by extension from another type definition,
called the base type definition. If the base is a simple
type definition (constraining text content), then the only allowed extension
is to add attribute declarations. If the base is itself a complex type definition,
then not only may attribute declarations be added, but also the base's content
model may be extended, if it is allows this.
It follows that an important relationship holds between the members
of a type defined by extension (that is, the element instances which satisfy
its definition): every member of a type defined by extension contains within
it a member of its base, where in version 1 of XML Schema we understand 'within'
to mean 'subset' for attributes and 'prefix' for content model.
Consider the following two type definitions:
<xs:complexType name='name'>
<xs:element name='title'
minOccurs='0'/>
<xs:element name='forename'
minOccurs='0'
maxOccurs='unbounded'/>
<xs:element name='surname'/>
</xs:complexType>
<xs:complexType name='fullName'
base='name'
derivedBy='extension'>
<xs:element name='suffix'
minOccurs='0'/>
</xs:complexType>
Now consider members of the two types defined above:
<...>
<foreName>George</foreName>
<foreName>W</foreName>
<surname>Bush</surname>
</...>
<...>
<foreName>Albert</foreName>
<surname>Gore</surname>
<suffix>Jr.</suffix>
</...>
The second, a member of the derived type, contains as a prefix a member
of the base type.
Type definition derivation details: restriction
A complex type definition (one constraining an element's content and
attribute inventory) may also be derived by
restriction
from its base type definition, which must be a complex type definition. Restriction
amounts to closing down flexibility allowed in the base definition:
- Eliminating optional attributes;
- Removing members of choice groups;
- Reducing allowed occurrence ranges on content model particles (perhaps
all the way to elimination, if minOccurs is 0 in
the base);
- Restricting the type definitions of attributes or content.
A different important relationship holds between the members of a type
defined by restriction.: every member of a type defined by restriction is
necessarily also a member of its base.
Simple types may also be defined by restricting other simple type definitions,
for instance be reducing the membership of an enumerated type or narrowing
a value range.
Consider the following three type definitions, a simplified version
of definitions from the schema for schemas:
<xs:complexType name="group">
<xs:element ref="particle" minOccurs="0" maxOccurs="unbounded"/>
<xs:attribute name="name" use="optional" type="xs:NCName"/>
<xs:attribute name="ref" use="optional" type="xs:QName"/>
</xs:complexType>
<xs:complexType name="topLevelGroup" base="group" derivedBy="restriction">
<xs:element ref="particle" minOccurs="1" maxOccurs="unbounded"/>
<xs:attribute name="name" use="required" type="xs:NCName"/>
<xs:attribute name="ref" use="prohibited" type="xs:QName"/>
</xs:complexType>
<xs:complexType name="refToGroup" base="group" derivedBy="restriction">
<xs:element ref="particle" minOccurs="0" maxOccurs="0"/>
<xs:attribute name="name" use="prohibited" type="xs:NCName"/>
<xs:attribute name="ref" use="required" type="xs:QName"/>
</xs:complexType>
The first definition defines a group as having optional name
and ref attributes and any number of <particle>
as content. The second restricts this for use at the top level, to define
a group, in which case the name and at least one <particle>
are required, while the ref is prohibited. For use within content
models, the third restricts in the other direction, requiring a ref
(to a top-level defined group, by name and namespace), and forbidding either name
or content. It should be clear that any member of either of the two derived
types is a member of the more general base.
Element equivalence classes
The mechanism of type definition derivation described above allows XML
Schema authors to reconstruct usages of parameter entities which reflect commonality
of structure:
- Elements with the same structure can be declared using the same
type definition;
- Elements with the similar structure can be declared with one using
a type definition derived from that the other is declared to have, or both
can be declared with definitions derived from a common base.
But elements may also be related because they appear in the same
context. The following (slightly simplified) extracts from the XML specification
DTD
[MA 1999] illustrates how this is annotated and exploited
in the Maler and El Andaloussi style:
<!ENTITY % local.list.class "">
<!ENTITY % list.class "ulist|olist|slist|glist
%local.list.class;">
. . .
<!ELEMENT div1 (head, (. . .|%list.class;|. . .)*, div2*)>
References to %list.class; appear elsewhere in other content
models in the DTD as well, and no member of the class appears anywhere _else_
on its own. XML Schema provides for reflecting this kind of element commonality
using the notion of (asymmetric) equivalence class: any top-level element
declaration can nominate another top-level declaration as one it is equivalent
to. The set of all declarations which identify (perhaps via several steps)
another declaration as their equivalence class (using the equivClass
attribute) form its equivalence class. Whereever it
appears in content models, isntances may contain not only it, but also any
member of its equivalence class. One possible XML Schema reconstruction of
the above example would look like this:
<xs:element name="div1">
<xs:complexType>
<xs:element ref="head"/>
<xs:choice minOccurs="0" maxOccurs="unbounded">
. . .
<xs:element ref="list"/>
. . .
</xs:choice>
<xs:element ref="div2" minOccurs="0" maxOccurs="unbounded"/>
</xs:complexType>
</xs:element>
<xs:element name="list" abstract="true" type="listType"/>
<xs:element name="slist" equivClass="list" type="simpleListType"/>
<xs:element name="flaggedList" abstract="true" equivClass="list"/>
<xs:element name="ulist" equivClass="flaggedList" type="bulletedListType"/>
<xs:element name="olist" equivClass="flaggedList" type="enumeratedListType"/>
<xs:element name="glist" equivClass="list" type="glossaryListType"/>
Via one or two steps, all four of glist, olist, slist
and ulist are declared as part of list's equivalence
class, so all of them may occur within <div1> in the indicated
place. list itself and flaggedList are declared
as abstract, meaning they can't themselves actually appear in
documents -- they are included in the schema simply to provide a potentially
useful layer of structuring.
Two further aspects of this design disserve mention. No special provision
for subsequent extension of the membership of the classes involved is required
(this is what the
local.list.class entity is for in the original
DTD). Another XML Schema document which includes one with the above definitions
by reference (by one of several mechanisms provided for modular and/or multi-namespace
schema specification, see
[TH 2000]) can add its own elements
to one or the other class simply by referring to them via the
equivClass
attribute in its own declarations. Also, in order to enforce a degree of coherency,
XML Schema does require that the type definition of elements declared as equivalent
to others must be derived from its type definition. In the above example,
this means that, for instance,
enumeratedListType would have
to be derived from
listType (the type definition of
flaggedList,
by default).
Conclusion
The above examples have introduced two distinct but related mechanisms
which the XML Schema language provides for reconstructing some common uses
of parameter entities in structured DTD design. By bringing these constructs
inside the language, rather than relegating them to the status of conventions
for the use of text substitution, XML Schema has endorsed and facilitated
an approach to structured document type definition of recognised power and
generality.
Bibliography
| [MA 1996] | Maler, Eve L. and Jeanne El Andaloussi, 1996. Developing SGML DTDs, Prentice Hall PTR, New Jersey,
USA. ISBN 0-13-309881-8. |
| [MA 1999] | Maler, Eve L., 1999. XML specification
DTD, W3C, Cambridge, MA, USA. Available online as http://www.w3.org/XML/1998/06/xmlspec-19990429.dtd. |
| [XHTML 2000] | Stephen Pemberton et al., 2000. XHTML™
1.0: The Extensible HyperText Markup Language, W3C, Cambridge,
MA, USA. Strict DTD also available as http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. |
| [TH 2000] | Thompson, Henry S., David Beech, Murray Maloney and
Noah Mendelsohn, eds, 2000. XML Schema part 1: Structures,
W3C, Cambridge, MA, USA. Also available as http://www.w3.org/TR/xmlschema-1. |