|
Derivation, tolerance and validity
a formal model of type
definition in XML Schemas
|
 |
We investigate the general mechanism of type derivations in XML schemas
1, of which the restrictions and extensions of the proposed schema
standard (in
[XSD1] and
[XSD2])
are but special cases. The principal idea here is that the designer of a modified
or derived type be able to declare the nature of the derivations made. Moreover,
we would like to be able to substitute the derived type in places where the
ancestral type was expected. But what if an application cannot tolerate such
a substitution? In order to exploit fully such derived types, an application,
in turn, needs to be able to declare what sorts of modifications it is willing
to tolerate. Furthermore, the validator, upon hearing such advice from the
application, needs to validate an instance according to the tolerances declared.
In effect, we propose to make applications responsible for telling the validator
what classes of type transformations lead to substitutable derivations from
the application's perspective. The validator, then, is responsible for guaranteeing
that the effective derivations fall within the application's tolerances.
Introduction
The principal idea here is that the designer of a modified or derived
type be able to declare the nature of the modification made. In order to exploit
fully such a declaration, an application needs to be able to declare what
sorts of modifications it is willing to tolerate. Moreover, the validator,
upon hearing such advice from the application, needs to validate an instance
according to the tolerances declared. In effect we are proposing that we make
applications responsible for telling the validator what classes of transformations
lead to substitutable derivations from
the application's perspective. The validator,
then, is responsible for guaranteeing that the effective derivations fall
within the application's tolerances.
We open by informally sketching a model of the declaration, the notion
of tolerance, and the kinds of derivations with respect to which validations
might be successfully conducted. Later, we will give a rigorous account of
the concepts in question, from which one can design a concrete specification
syntax, and implement a validator.
First, let's embellish the element declaration with a tolerance declaration:
<element type="mumble" tolerates="(attr-extension | elt-extension)*"/>
and then define a derived type:
<complexType name="grumble" base="mumble" derivedBy="attr-extension">
... </complexType>
The former declares an element that permits
2 the substitution of a type derived by any sequence of element or
attribute extensions. The latter derivation defines a subtype by adding attributes
to the base type. (See the section below on transformations.)
The example above illustrates how the original type designer might indicate
the permissibility of the further elaboration of the type. But an application
knowing only about the original type might not be as permissive as the designer.
In general, an application should be able to declare to the validator that
certain kinds of derivation are tolerated. One possibility is to make a global
declaration of such tolerance. Another possibility is to make local declarations
of the form "this application tolerates derivations of the mumble
type by any sequence of transformations of the following kinds..."
Now there are two questions:
- What kinds of derivation are there?
- How might the validation of such derivations work?
In this informal account, we only consider derivations
vis
à vis content models, and "element only" content models
at that. (It should be reasonably clear how to extend the same ideas to cover
derivations with respect to attributes.) For the time being, types will be
conflated with regular expressions, where a derived type is accessed through
a new tag.
3 Finally, we assume that all content models are representable by
regular expressions (in
[Regex]).
Validation: an informal account
First, let's consider validation without derivation. Validity is defined
recursively roughly as follows: A node of an XML tree tagged with

, which has the type definition

(

being
a regular expression in tags) is valid just in case the children of the node
tagged with

(when taken in sequence)
are in the regular set denoted by R, and each of the children is itself valid.
An XML tree, by extension, is valid just in case all of its nodes are valid.
Without saying what can count as a derivation (other than to note that
every type is a derivation of itself), we extend the definition of validity
to accommodate derivation. A node of an XML tree tagged with

, is valid just in case

has some derivation

defined by

, such that each of the children of the node tagged
with

(when taken in sequence) is
in the regular set denoted by

,
and each of the children is valid. An XML tree, by extension, is valid just
in case all of its nodes are valid.
Derivation by transformation
Suppose

,…,

are type definitions, we will define a set of immediate
transformations below yielding a type definition

such
that

is an immediate derivation
from

,…,

by one such transformation. If

,…,

is a set of transformations, there is the obvious inductive definition of
a derivation

from

by a sequence of immediate transformations taken
from that set. So what might be interesting immediate transformations?
Consider the immediate derivation of

from

,…,

by the "prepend" immediate transformation (or
inheritance transformation) illustrated as follows:
yields
where

is a new regular expression
in tags. Similarly, one might define the "postpend" immediate transformation
by
One can define a kind of "interleaving" immediate transformation by
beginning with the type definitions
one "derives" them to an inherited type definition
where the

's are all new
regular expressions. The transformations described thus far are essentially
additive. But there are lots of others.
Refining

from

by the "insertion"' immediate transformation:
becomes
Refining

from

by the "deletion" immediate transformation:
becomes
Deriving

from

by a "restriction on choices" immediate transformation::
can be restricted to any of:
Refining

from

by the "cyclic permutation" immediate transformation:.
becomes
No doubt you have noticed that we have not told you how the positions
at which the transformations occur are to be identified. Also, the only transformations
we have shown are on the immediate subexpressions of regular expressions.
There are endless grammar-based processors that allow the identification of
subexpressions and modification thereof, so we'll defer providing a solution.
By composing the transformations described, a wide variety of insertions,
deletions and rearrangements can be accommodated. (Indeed, if we allow the
transformations above to occur at any level, we get a universal editor.) While
each immediate transformation is eminently understandable, their composition
can rapidly become opaque, so one may want to declare the number of immediate
transformations that are tolerable in refining

from

. An application's robustness
against transformations is more likely, however, to be determined by the nature
of the transformations tolerated rather than their number.
Imagine that an application can declare that it will tolerate derivations

from

by any sequence of transformations in a set of immediate transformations

. Thus, in the validation definition above, we
amend it to look for not just derivations, but "tolerated" derivations. The
validator, of course, should leave a record of what derivations were used
at each node during the validation process. Moreover, the application can
have access to the transformational relationship between the type associated
with a node's actual tag and the type that led to the validation of the node's
children.
The formalities
Now we are ready to provide mathematical models of schemas, instances
and derivation wherein:
- Schemas embed context-free grammars the bodies of whose productions
are regular expressions.
- The heads of those productions are the names of types.
- The types are associated with attribute definitions.
- There is a partial order on types that maximizes refinable content
models.
- Instances are ordered, labeled, attributed trees.
- Basic validity is defined in terms of types that are maximal in
the partial order.
- Validity with derivation is defined relative to lesser types in
the partial order.
- The partial order on types is segmented in such a way as to mimic
various classes of derivation.
- Validity is further sharpened to allow for tolerance of instances
according to whether they fall into certain classes of derivation or not.
Schemas
A
schema4 S is a structure

where
-
is a finite set of element names
-
is a finite set of type names
-
is a finite set of attribute names
-
is a finite set of values
-
is the distinguished root element
-
is the set of regular
expressions over
-
is the content
modeling function5
-
is the attribution
function6
-
is the type
declaration function
- immediate derivation
,
is a partial order where
, and
whenever 
-
-
- if
then
- derivation is the transitive closure of immediate derivation
A type

is
basic
if there does not exist a distinct type

such that

.
7 Also, we will sometimes call

the
signature of

.
Instances
An
instance I of a schema S is a
structure

where
-
is a finite set of nodes
-
is the distinguished root node
-
and
are discrete partial orders8 over
such that if
and
then
and
are children of
(and siblings
of one another), and either
(
is the elder sibling of
),
or
(
is the elder sibling of
)
- for every node
- for every pair of siblings
there is no
such that
and
- for nodes
,
only if there exists a node
such that
and
- a labeling is a function
where9
A labeling, being an ordered 4-tuple, has a first component called its
tag10, a second component, called its
attribution,
a third component called its
type; and
a fourth component called its
value. When
the type is

, it is said to be
implicit, and otherwise
explicit.
The only nodes of an instance with a value component of

are the non-minimal nodes of the instance. The partial order

induces a tree on the nodes of I rooted at

.
Basic validity
An instance I is S
=-
valid
(read "S-basic valid") with respect to a schema S iff all of its nodes have
implicit types,

's tag is

and each node

is S
=-valid with respect to a schema S. A node

is S
=-valid with respect to a schema
S iff either

, or

and
- 1. for every
in the domain
of
's attribution
, either
,
or
with
;
and
- 2. either
- 1.
;
- 2. the value component of
is
in
11; and
- 3.
has no children
- 3. or
- 1.
;
- 2. the value component of
is
;
- 3. the sequence
, where
are all the
children of
taken in
order whose tags are in
,
is in the regular set defined by
;
and
- 4. each of the
is S=-valid
with respect to S.
Validity with derivation
The intent of derivation is to control the circumstances under which
a tag might occur at a node where another kind of tag is expected.
12 Our task here is to define the circumstances under which such a
substitution might be valid. Regular expressions
offer us an obvious mechanism by which we might achieve this--namely,
alternation. If

,

, then

is a

-
alternative
of

. Let

be
all of the

-alternatives
of

, then

is
the

-
closure
of

. By extension, for

the

-closure
of

(denoted

)
is obtained from

by replacing
each element in the latter with its

-closure.
So any occurrence of the

above
would be replaced with

.
13 Finally, if

,

, then

is a

-alternative of

. Let

be
all of the

-alternatives

, then the

-
effective content model of

is
An instance I is

-
valid (read "

-derivation
valid") with respect to a schema S iff

's
tag is either

or a

-alternative of

,
and each node

is

-valid with respect to a schema S. A node

is

-valid
with respect to a schema S iff either

,
or

and
- 1. if the type component of
is
, then
and the base type for validation is
- 2. if the type component of
is
, then the base type for validation is
- 3. for every
in the domain
of
's attribution
, either
,
or if
and
then
; and
- 4. either
- 1.
;
- 2. the value component of
is
in the
-effective content model
of
; and
- 3.
has no children
- 5. or
- 1.
;
- 2. the value component of
is
;
- 3. the sequence
, where
are all the
children of
taken in
order whose tags are in
,
is in the regular set defined by the
-effective
content model of
; and
- 4. each of the
is S=-valid
with respect to S.
Validity with derivation and tolerance
Recalling that

is a subset
of

, consider a segmentation of

into a finite collection of segments

such that

.
Since we are free to construct

and partition it any way we like, we could think of each of the segments as
corresponding to a particular atomic derivation transformation on types. (In
the next section we'll examine some particular kinds of atomic transformations.)
Furthermore, we can abuse the notation of juxtaposition and consider composing
the atomic transformations as relations,
e.g.
. Indeed, we can consider regular expressions
over atomic transformations, and note that in the usual notation for regular
expressions

.
Let S and I be as above. Suppose

is segmented into a finite collection of segments

such that

. Let

be a regular expression over the segments of

. Putting

in the place of

, we can define
the

-alternatives, the

-closure, and the

-effective
signature by analogy with the

-alternatives,
the

-closure, and the

-effective signature above. An instance I is

-
valid (read
"S -derivation valid tolerating transformations in the regular set

") with respect to a schema S iff

's tag is either

or an

-alternative of

, and each node

is

-valid with respect to a schema
S. A node

is

-valid with respect to a schema S iff either

, or

and
- 1. if the type component of
is
, then
and the base type for validation is
- 2. if the type component of
is
, then the base type for validation is
- 3. for every
in the domain
of
's attribution
, either
,
or if
and
then
; and
- 4. either
- 1.
;
- 2. the value component of
is
in the
-effective content model
of
; and
- 3.
has no children
- 5. or
- 1.
;
- 2. the value component of
is
;
- 3. the sequence
, where
are all the
children of
taken in
order whose tags are in
,
is in the regular set defined by the
-effective
content model of
; and
- 4. each of the
is
-valid with respect to S.
Modeling particular derivation transformations
We now explore how some of the forms of modification described above
fit into the formal model. Generally speaking, we proceed by taking a schema
S and defining a new schema S' where the components of the latter structure
are defined by making incremental changes to the components of the former,
but leaving the root element the same in both structures. We will subscript
the components of S and S' with their respective schema names,
e.g.
and

.
Adding a new tag
From schema S with type

we create a new schema S' with a new element

:
and

may now order types
with content models that are regular expressions over

.
Creating a new type by adding an attribute
From schema S with type

we create a new schema S' having a new type

with an additional attribute

assigned from the subspace

:
Creating a new type by restricting choices
From schema S with type

we create a new schema S' having a new type

such that

is derived from

by restriction of choices (see the example in
the
derivation by transformation section):
Creating a new type by cyclic permutation
From schema S with type

we create a new schema S’ having a new type

such that

is derived from

by cyclic permutation (see the example in the
derivation
by transformation section):
Conclusions
In the foregoing we have abstracted
- types of elements
- types of attributes
- openness
- type derivation
- element equivalence
- tolerance
- schema validation both with and without tolerance
In doing so we assume only that attributes have simple values, and that
elements are either simple values or structures characterized by regular expressions.
Indeed the model of types (in
[Types]) that we employ
is essentially the extensional one traditionally studied by logicians. On
the surface it would appear that this is not quite the same as the typical
computer science model of types where two named types having the same type
signature represent distinct types. It happens that we can encode the name
of a type both in a regular expression and in an attribution, producing, in
effect, distinct copies of the same structure. Using this device together
with imposing various restrictions on the derivation partial order, we can
reproduce the validation semantics of the XML Schema candidate recommendation.
More importantly, however, we have a much richer notion of type derivation
and validation.
While implementation is outside the scope of this paper, we should observe
that validation, as presented, depends only on a set of regular expressions
that can be statically constructed from the definition of a schema. Adding
derivation and/or tolerance simply increases the size and/or number of the
regular expressions. It is easy to imagine a compiler that undertakes such
a construction. Validation at a node reduces to checking if a sequence of
tokens associated with each of the child nodes is in the regular set (in
[Regex]) defined by a particular regular expression.
Bibliography
| [Regex] | Alfred V. Aho. Algorithms for finding patterns in strings.
In Jan van Leeuwen, editor, Handbook of Theoretical Computer Science.MIT Press
Elsevier, Cambridge, Massachusetts, 1 990. |
| [Types] | Solomon Feferman. Theories of finite type related to
mathematical practice. In Jon Barwise, editor, Handbook of Mathematical Logic.Elsevier
North Holland, Amsterdam, 1 977. |
| [XSD1] | Paul V. Biron and Ashok Malhotra, editors. XML Schema
Part 2: Datatypes. World Wide Web Consortium, Cambridge, Massachusetts, 2000. |
| [XSD2] | Paul V. Biron and Ashok Malhotra, editors. XML Schema
Part 2: Datatypes. World Wide Web Consortium, Cambridge, Massachusetts, 2000. |