Derivation, tolerance and validity
a formal model of type definition in XML Schemas
Allen Brown Jr.
Find


Abstract
We investigate the general mechanism of type derivations in XML schemas1, of which the restrictions and extensions of the proposed schema standard (in [XSD1] and [XSD2]) are but special cases. The principal idea here is that the designer of a modified or derived type be able to declare the nature of the derivations made. Moreover, we would like to be able to substitute the derived type in places where the ancestral type was expected. But what if an application cannot tolerate such a substitution? In order to exploit fully such derived types, an application, in turn, needs to be able to declare what sorts of modifications it is willing to tolerate. Furthermore, the validator, upon hearing such advice from the application, needs to validate an instance according to the tolerances declared. In effect, we propose to make applications responsible for telling the validator what classes of type transformations lead to substitutable derivations from the application's perspective. The validator, then, is responsible for guaranteeing that the effective derivations fall within the application's tolerances.

Keywords

Contents
  1. Introduction
  2. Validation: an informal account
  3. Derivation by transformation
  4. The formalities
    1. Schemas
    2. Instances
    3. Basic validity
    4. Validity with derivation
    5. Validity with derivation and tolerance
    6. Modeling particular derivation transformations
      1. Adding a new tag
      2. Creating a new type by adding an attribute
      3. Creating a new type by restricting choices
      4. Creating a new type by cyclic permutation
  5. Conclusions
  6. Bibliography

Introduction
The principal idea here is that the designer of a modified or derived type be able to declare the nature of the modification made. In order to exploit fully such a declaration, an application needs to be able to declare what sorts of modifications it is willing to tolerate. Moreover, the validator, upon hearing such advice from the application, needs to validate an instance according to the tolerances declared. In effect we are proposing that we make applications responsible for telling the validator what classes of transformations lead to substitutable derivations from the application's perspective. The validator, then, is responsible for guaranteeing that the effective derivations fall within the application's tolerances.
We open by informally sketching a model of the declaration, the notion of tolerance, and the kinds of derivations with respect to which validations might be successfully conducted. Later, we will give a rigorous account of the concepts in question, from which one can design a concrete specification syntax, and implement a validator.
First, let's embellish the element declaration with a tolerance declaration:
<element type="mumble" tolerates="(attr-extension | elt-extension)*"/>
and then define a derived type:
<complexType name="grumble" base="mumble" derivedBy="attr-extension">
... </complexType>
The former declares an element that permits2 the substitution of a type derived by any sequence of element or attribute extensions. The latter derivation defines a subtype by adding attributes to the base type. (See the section below on transformations.)
The example above illustrates how the original type designer might indicate the permissibility of the further elaboration of the type. But an application knowing only about the original type might not be as permissive as the designer. In general, an application should be able to declare to the validator that certain kinds of derivation are tolerated. One possibility is to make a global declaration of such tolerance. Another possibility is to make local declarations of the form "this application tolerates derivations of the mumble type by any sequence of transformations of the following kinds..."
Now there are two questions:
In this informal account, we only consider derivations vis à vis content models, and "element only" content models at that. (It should be reasonably clear how to extend the same ideas to cover derivations with respect to attributes.) For the time being, types will be conflated with regular expressions, where a derived type is accessed through a new tag.3 Finally, we assume that all content models are representable by regular expressions (in [Regex]).
Previous Previous Table of Contents
Validation: an informal account
First, let's consider validation without derivation. Validity is defined recursively roughly as follows: A node of an XML tree tagged with , which has the type definition ( being a regular expression in tags) is valid just in case the children of the node tagged with (when taken in sequence) are in the regular set denoted by R, and each of the children is itself valid. An XML tree, by extension, is valid just in case all of its nodes are valid.
Without saying what can count as a derivation (other than to note that every type is a derivation of itself), we extend the definition of validity to accommodate derivation. A node of an XML tree tagged with , is valid just in case has some derivation defined by , such that each of the children of the node tagged with (when taken in sequence) is in the regular set denoted by , and each of the children is valid. An XML tree, by extension, is valid just in case all of its nodes are valid.
Previous Previous Table of Contents
Derivation by transformation
Suppose ,…,are type definitions, we will define a set of immediate transformations below yielding a type definition such that is an immediate derivation from ,…, by one such transformation. If ,…, is a set of transformations, there is the obvious inductive definition of a derivation from by a sequence of immediate transformations taken from that set. So what might be interesting immediate transformations?
Consider the immediate derivation of from ,…, by the "prepend" immediate transformation (or inheritance transformation) illustrated as follows:
yields
where is a new regular expression in tags. Similarly, one might define the "postpend" immediate transformation by
One can define a kind of "interleaving" immediate transformation by beginning with the type definitions
one "derives" them to an inherited type definition
where the 's are all new regular expressions. The transformations described thus far are essentially additive. But there are lots of others.
Refining from by the "insertion"' immediate transformation:
becomes
Refining from by the "deletion" immediate transformation:
becomes
Deriving from by a "restriction on choices" immediate transformation::
can be restricted to any of:
Refining from by the "cyclic permutation" immediate transformation:.
becomes
No doubt you have noticed that we have not told you how the positions at which the transformations occur are to be identified. Also, the only transformations we have shown are on the immediate subexpressions of regular expressions. There are endless grammar-based processors that allow the identification of subexpressions and modification thereof, so we'll defer providing a solution.
By composing the transformations described, a wide variety of insertions, deletions and rearrangements can be accommodated. (Indeed, if we allow the transformations above to occur at any level, we get a universal editor.) While each immediate transformation is eminently understandable, their composition can rapidly become opaque, so one may want to declare the number of immediate transformations that are tolerable in refining from . An application's robustness against transformations is more likely, however, to be determined by the nature of the transformations tolerated rather than their number.
Imagine that an application can declare that it will tolerate derivations from by any sequence of transformations in a set of immediate transformations . Thus, in the validation definition above, we amend it to look for not just derivations, but "tolerated" derivations. The validator, of course, should leave a record of what derivations were used at each node during the validation process. Moreover, the application can have access to the transformational relationship between the type associated with a node's actual tag and the type that led to the validation of the node's children.
Previous Previous Table of Contents
The formalities
Now we are ready to provide mathematical models of schemas, instances and derivation wherein:
Schemas
A schema4 S is a structure where
A type is basic if there does not exist a distinct type such that .7 Also, we will sometimes call the signature of .
Instances
An instance I of a schema S is a structure where
A labeling, being an ordered 4-tuple, has a first component called its tag10, a second component, called its attribution, a third component called its type; and a fourth component called its value. When the type is , it is said to be implicit, and otherwise explicit. The only nodes of an instance with a value component of are the non-minimal nodes of the instance. The partial order induces a tree on the nodes of I rooted at .
Basic validity
An instance I is S=-valid (read "S-basic valid") with respect to a schema S iff all of its nodes have implicit types, 's tag is and each node is S=-valid with respect to a schema S. A node is S=-valid with respect to a schema S iff either, or and
Validity with derivation
The intent of derivation is to control the circumstances under which a tag might occur at a node where another kind of tag is expected.12 Our task here is to define the circumstances under which such a substitution might be valid. Regular expressions offer us an obvious mechanism by which we might achieve this--namely, alternation. If , , then is a -alternative of . Let be all of the -alternatives of , then is the -closure of . By extension, for the -closure of (denoted ) is obtained from by replacing each element in the latter with its -closure. So any occurrence of the above would be replaced with .13 Finally, if , , then is a -alternative of . Let be all of the -alternatives , then the -effective content model of is
An instance I is -valid (read "-derivation valid") with respect to a schema S iff 's tag is either or a -alternative of , and each node is -valid with respect to a schema S. A node is -valid with respect to a schema S iff either, or and
Validity with derivation and tolerance
Recalling that is a subset of , consider a segmentation of into a finite collection of segments such that . Since we are free to construct and partition it any way we like, we could think of each of the segments as corresponding to a particular atomic derivation transformation on types. (In the next section we'll examine some particular kinds of atomic transformations.) Furthermore, we can abuse the notation of juxtaposition and consider composing the atomic transformations as relations, e.g.. Indeed, we can consider regular expressions over atomic transformations, and note that in the usual notation for regular expressions .
Let S and I be as above. Suppose is segmented into a finite collection of segments such that . Let be a regular expression over the segments of . Putting in the place of , we can define the -alternatives, the -closure, and the -effective signature by analogy with the -alternatives, the -closure, and the -effective signature above. An instance I is -valid (read "S -derivation valid tolerating transformations in the regular set ") with respect to a schema S iff 's tag is either or an -alternative of , and each node is -valid with respect to a schema S. A node is -valid with respect to a schema S iff either, or and
Modeling particular derivation transformations
We now explore how some of the forms of modification described above fit into the formal model. Generally speaking, we proceed by taking a schema S and defining a new schema S' where the components of the latter structure are defined by making incremental changes to the components of the former, but leaving the root element the same in both structures. We will subscript the components of S and S' with their respective schema names, e.g. and .
Adding a new tag
From schema S with type we create a new schema S' with a new element :
and may now order types with content models that are regular expressions over .
Creating a new type by adding an attribute
From schema S with type we create a new schema S' having a new type with an additional attribute assigned from the subspace :
Creating a new type by restricting choices
From schema S with type we create a new schema S' having a new type such that is derived from by restriction of choices (see the example in the derivation by transformation section):
Creating a new type by cyclic permutation
From schema S with type we create a new schema S’ having a new type such that is derived from by cyclic permutation (see the example in the derivation by transformation section):
Previous Previous Table of Contents
Conclusions
In the foregoing we have abstracted
In doing so we assume only that attributes have simple values, and that elements are either simple values or structures characterized by regular expressions. Indeed the model of types (in [Types]) that we employ is essentially the extensional one traditionally studied by logicians. On the surface it would appear that this is not quite the same as the typical computer science model of types where two named types having the same type signature represent distinct types. It happens that we can encode the name of a type both in a regular expression and in an attribution, producing, in effect, distinct copies of the same structure. Using this device together with imposing various restrictions on the derivation partial order, we can reproduce the validation semantics of the XML Schema candidate recommendation. More importantly, however, we have a much richer notion of type derivation and validation.
While implementation is outside the scope of this paper, we should observe that validation, as presented, depends only on a set of regular expressions that can be statically constructed from the definition of a schema. Adding derivation and/or tolerance simply increases the size and/or number of the regular expressions. It is easy to imagine a compiler that undertakes such a construction. Validation at a node reduces to checking if a sequence of tokens associated with each of the child nodes is in the regular set (in [Regex]) defined by a particular regular expression.
Previous Previous Table of Contents
Bibliography
[Regex] Alfred V. Aho. Algorithms for finding patterns in strings. In Jan van Leeuwen, editor, Handbook of Theoretical Computer Science.MIT Press Elsevier, Cambridge, Massachusetts, 1 990.
[Types] Solomon Feferman. Theories of finite type related to mathematical practice. In Jon Barwise, editor, Handbook of Mathematical Logic.Elsevier North Holland, Amsterdam, 1 977.
[XSD1] Paul V. Biron and Ashok Malhotra, editors. XML Schema Part 2: Datatypes. World Wide Web Consortium, Cambridge, Massachusetts, 2000.
[XSD2]Paul V. Biron and Ashok Malhotra, editors. XML Schema Part 2: Datatypes. World Wide Web Consortium, Cambridge, Massachusetts, 2000.
Previous Previous Table of Contents