|
From markup to object model
the XML abstraction problem
and XML property objects
|
 |
Mechanisms for building abstractions over XML documents tend to be more
complex and less flexible than techniques available in domains such as relational
databases and object models. This paper reviews several existing strategies
and suggests a new one. XML Property Objects allow a flexible, user-defined
mapping from complex XML attributed element tree structures to directed labeled
graph structures.
Overview
Software engineering is dominated by two tasks. The first is the design
of algorithms (and necessary data structures) required to automate the solutions
to particular problems. The second is the design of abstractions. Abstractions
allow us to reuse software code and thus make software solutions that can
grow and be maintained over time.
In a world where only implementation and algorithms mattered, everything
could be programmed in assembly language and every project would be approached
as if from scratch. There would be no operating systems, no programming languages,
no code libraries and no relational databases. A programmer's job would be
analogous to that of a carpenter. The fact that a carpenter has hammered a
nail a thousand times before does not remove the requirement to do it again.
Reuse is at the level of ideas and skills, not implementation.
To some extent, creators of tiny "embedded systems" live in this world.
Thankfully, the rest of us can use the ever-expanding RAM in our computers
to build abstractions on top of abstractions on top of abstractions: programs
on top of programming languages on top of interpreters on top of other programming
languages on top of operating systems. Each level can itself be decomposed
into many abstractions. The popular UML diagramming standard exists precisely
to help manage these levels of abstraction.
In a world where only abstraction mattered, everything would be programmed
in highly declarative mathematical notations built on top of other mathematical
formalisms. Extensibility and reusability would be foremost considerations.
Code performance and time to market would be minor considerations. Programmers
would be essentially mathematicians and data modelers. Before creating anything
new, a programmer would build a theory, notation and model for the new concept,
document it rigorously and build a "black box" that implemented the theory.
Clearly, we live between these extremes. In the markup world, this tension
is described as a dichotomy between "syntax" and "semantics." A particular
markup language or vocabulary is a concrete interchange format for an abstract
information model. Unfortunately there is a large gap between the XML tree
model and the models traditionally used in software engineering, logic and
information modelling. XML's model is more natural for serialization but less
natural for software manipulation and querying. In other words, it is perfect
for interchanging information but inefficient for working with the information
once it has been parsed.
Markup-based models revolve around order and hierarchy whereas the more
popular graph-based data models revolve around linking relationships and roles.
A linked data structure (directed graph) can more directly express sophisticated
information than can a simple tree. Unfortunately directed graphs are typically
second class citizens in markup based standards and systems. Most schema languages
and XML editors do not support link type-checking. It is time for a civil
rights movement for hyperlinked information!
Markup-based systems are usually also weak in their ability to build
abstractions. There is no concept of "subclassing" in XML DTDs and arguably
the concepts of architectural forms and XML Schema archetypes are not sufficient
for many common tasks. I believe that this problem stems directly from the
SGML/XML tree model. It is much easier and more productive to build abstractions
based on a relationship model than on an ordered, hierarchical tree structure.
This paper describes a mechanism for asserting the hidden relationships in
XML documents so that they can be exploited for abstraction building.
Imagine an Internet service titled "whereisthathomepage.com". The service
might map people's email addresses to their web sites. It would collect this
information from various web pages and databases on the Web. The problem is
that the representation of this information is not likely to be standardized.
There might be six or seven dominant ways that people describe this relationship
on the Web. HTML has meta tags that can point from web pages to email addresses.
A Dublin Core extension might have another syntax. A copyright protection
standard might have yet another...and so forth, and so on, and so it goes,
etc.
It is not feasible to expect them all to use the same namespace, vocabulary,
schema or syntax. They might be designed without knowledge of each other,
in varying human languages or might have different syntactic constraints and
conventions to address. Independent invention of similar semantics is absolutely
unavoidable and must be handled by any system designed to scale.
A subtle but important point is that even with complete knowledge of
another vocabulary with similar semantics, there are completely valid reasons
to choose not to follow an existing standard slavishly. The devil is in the
details and the details of an XML vocabulary are important to the people who
must work with it. The details of a standard schema are often not exactly
appropriate to specific situations.
Vocabulary and terminology will always be chosen based on the needs
of particular user communities and will never be globally standardized. It
is no easier to invent a static, universally acceptable vocabulary framework
than to develop a static universal ontology. Differences between human languages
and jargon-sharing communities can not be resolved in our lifetime if at all.
Terminology is not the only problem. The structure of the XML expressing
the structure may also vary. Consider our email/web page mapping problem.
In some schemas the actual relationship may be expressed in terms of an explicit
link from a web page URL to an email URL, in some from an email address to
a web page and in some from a "person" element to both other addresses. In
yet other cases the relationship might be expressed entirely in terms of containment.
In that case the fact that a person element contained both an email
and a web page address might implicitly assert that the two go
together.
XML vocabularies invariably serve as user interfaces to the underlying
data models. Even fancy, expensive tools do not usually hide the vocabulary.
Most sophisticated XML editing tools work primarily in terms of the terminology
and structure in the XML DTD and schema. Some level of customization is possible
but if the distance between the "default" environment and the one you want
is more than just element renaming, the customizations start to get expensive.
Changing attributes to elements and vice versa, or reordering or moving elements
is very difficult to implement in an editing. Most existing element "subtyping"
mechanisms have no provision for this level of syntactic variance, even outside
of XML editors.
The details of schemas are also important in systems that use XML without
any ongoing human interaction. It is still necessary for programmers to understand
the syntactic structures defined by the DTD or schema. It is arguably the
case that XML is eating into the markets of CORBA and EDI precisely because
programmers feel comfortable that they understand it. A particular schema
could become complex because it embeds separately designed fragments from
dozens of other languages and uses terminology that is foreign to particular
programmer groups. That schema will be replaced in those groups by something
simpler and more familiar, even if it is "less standards conformant."
Rather than a war between the standards-bearers and the minimalists,
we want to map the structures in the domain-specific specification back into
multiple, separately-designed public specifications. We want to build standard
abstractions over proprietary schemas. And we want to do it without neglecting
the needs of hyperlinked information.
XML Property Objects
XML Property Objects are a proposed solution to these problems. Property
objects are a simple notation for identity-preserving mappings from XML elements,
attributes and other nodes to a more traditional property/value/link model.
This model is highly similar to that employed by Java, Python, CORBA and all
major object databases. Just as in those models, objects consist of property/value
pairs. Here is how the Dublin Core organization describes directed labelled
graph model (as expressed in RDF):
- "things have properties", where each property has a name and a value.
The "atom" of RDF, therefore, has three components: the thing (resource, in
web terminology), a property name, and a property value.
- each resource may have several of these properties.
- the value may itself be a resource.
This model is universally used and understood in software development.
It is also highly amenable to abstraction.
Mapping XML trees to graphs
XPO is like XSLT in that it is template-based. Each
xpo:template
element has a
match attribute that matches some set of nodes.
The attribute contains an XPath
[XPath Specification] that functions
as a match pattern, just as in XSLT. Unlike XSLT, all matching rules apply,
not just one. Within the template there are a series of property value assignment
(
xpo:property) elements. These properties and values are associated
with nodes that match the selection criteria. Each property definition has
a name/value pair. The name is a property name (expressed as a relative or
absolute URI). The value of the property is defined with a
select
attribute. This attribute uses XPath to select some other node in the document
or in some other document.
<xpo:template match="html">
<xpo:property name="doctitle" select="head/title"/>
<xpo:property name="content" select="body/*"/>
</xpo:template>
This example generates two extra properties on each "html" element node.
They are called "doctitle" and "content" locally but their entire URI-name
is prefixed by the URI for the schema. In this way, the tree is transformed
into a property-value graph that is roughly similar but not identical in structure
to the original tree. The first property will have a value of a single node
(because valid HTML documents have only one title) and the second property
will be a nodelist of many different kinds of elements (headings, paragraphs,
tables, etc.)
<xpo:template match="table">
<xpo:property name="caption" select="caption"/>
<xpo:property name="width" select="@width"/>
<xpo:property name="height" select="@height"/>
<xpo:property name="rows" select="tr"/>
<xpo:property name="cells" select="tr/td"/>
</xpo:template>
The graph that would result from this rule has a node with an "element
type name" with properties named "caption", "width", "height", "rows" and
"cells". Each of these would refer to another node or set of nodes. The other
nodes might have their own rules and properties.
<xpo:template match="tr">
<xpo:property name="cells" select="td"/>
<xpo:property name="table_owner" select=".."/>
</xpo:template>
Note that property-based models frequently have redundant properties
for navigational convenience. In this case, the cells property on the table
is (strictly speaking) redundant because it can be accessed through the rows
property. One of the benefits of mapping from the XML structure to the property
objects is that it provides an opportunity to introduce redundancy. This is
important because these property-based models are designed to be used as the
basis for programming APIs and query languages.
The select attribute gives full access to XPath syntax.
All operators and steps are available. In addition to string operations and
basic math, it is also possible for properties to traverse around and between
documents across hyperlinks:
<xpo:template match="/doc">
<xpo:property name="french_title"
select="document(translations/french/@href)/title"/>
</xpo:template>
One virtue of the property/object view of the world is that an application
examining a property need not have any knowledge of whether the property value
was syntactically in the same document as the referent or not. Once the XPath
has been resolved, the relative location of the original elements is irrelevant.
Inter-node property relationships (links) are the central concept in
the property object view, just as in groves or the RDF model. The difference
between property objects and the XML property set (groves) or the RDF XML
syntax is that the schema-writer completely controls the mapping from syntactic,
XML tree representation to semantic, directed labelled graph.
Other details:
- Because multiple xpo:templates may apply to the same
element, there are no template conflicts or priority rules to avoid them.
- No provision is made for state-changing methods because "XML philosophy"
typically separates data from its behavior. This separation has been lauded
and applauded in many other papers. I will not defend it further here.
- The mapping maintains identity. It is possible to follow a series
of links and then ask where the target node resided in a physical XML document.
- I do not believe that the mapping is reversible across modifications.
If you change a property value in an application, it is not necessarily clear
how to go back and modify the XML document. XPath is a little too powerful
to allow easy reversibility. In computer science it is typically the case
that power has a price.
Extensibility is provided through XPath. It is possible to add new functions
to XPath and those are automatically available to XPO. Consider an extreme
example: an XPath extension could allow the addressing of elements that overlap
particular other elements in an SVG diagram. XPO could implicitly build links
between those elements.
Abstraction with XPO
Building graphs over XML documents allows a single level of abstraction.
With the XPath extensibility, this level may be quite sophisticated, but it
is nevertheless a single level. This is useful but not sufficient. To solve
the difficult problems in electronic commerce and technical documentation,
it is necessary to build multiple levels of abstraction. A single department's
purchase order model must be mapped into an enterprise purchase order which
must be mapped into an industry wide purchase order which may need to be mapped
into a governmental purchase order. The problem is even harder for companies
in multiple industries, in many different governmental jurisdictions.
The mechanism for building abstractions on top ofexisting properties
is very similar to the basic mechanism. Instead of using traditional XPath
syntax, we use an XPath extension with an arrow (->) notation.
This extension means to traverse a named property. It can be represented in
the XPath data model as a new kind of axis with a minimized syntax.
<xpo:abstract match="html">
<xpo:property name="dublin_core:title"
select=".->doctitle"/>
<xpo:property name="dublin_core:subject"
select=".->keywords"/>
<xpo:property name="dublin_core:creator"
select=".->author->name"/>
</xpo:template>
This mechanism is simple both in syntax and in concept. It is highly
analogous to delegation in a simple (imaginary) programming language:
class html_2_dublin_core extends html:
def get_dublin_core_title():
return this.get_doctitle()
def get_dublin_core_content():
return this.get_content()
def get_dublin_core_author():
return this.get_creator().get_author().get_name()
The abstraction mechanism can easily be replaced with a mechanism built-in
to another environment such as Java method delegation, or property inheritance
in RDF. It could also be used in concert with them. XPO never prevents the
use of language or application-specific abstraction mechanisms "on top" of
it.
XPO views
When building software on top of ordinary XML, it is important to know
what sub-elements will be available in a particular element. XML uses DTDs
and schemas to make document structures predictable. XPO has an integrated
but separable constraint language that allows static declarations of types
of properties and node types that combine them. This language is separable
because in some environments it is likely to be replaced with property set
definitions (part of HyTime), RDF Schemas (W3C world), Express schemas, ODMG
object definitions or Java interfaces.
Rather than duplicate the sophisticated type checking of these languages,
XPO's constraint language is currently very simple. There are three types
of declarations: property type declarations, view declarations and view conformance
declarations. Property type declarations do exactly what they sound like they
would do:
<xpo:prop-def name="first_name" type="STRING"/>
<xpo:prop-def name="initial" type="STRING?"/>
<xpo:prop-def name="last_name" type="STRING"/>
<xpo:prop-def name="purchases" type="purchase+"/>
<xpo:prop-def name="visits" type="INTEGER"/>
As in XML, "+" means one or more, "*", zero or more and "?" one or zero.
The property names and types are actually URIs. When they are not completely
spelled out, they are considered to be IDs which can be appended to the base
URI of the document.
View declarations are similar to the interface declarations in Java
or CORBA. They define a set of properties that are always guaranteed to appear
on certain nodes of those properties.
<xpo:view name="customer"/>
<xpo:prop href="#first_name/">
<xpo:prop href="#last_name/">
<xpo:prop href="#inital/">
</xpo:view>
View conformance declarations state which nodes adhere to which views.
<xpo:conforms
match="purchase_order/buyer"
view="customer"/>
This declaration states that elements of type buyer within purchase_order
will always have the properties declared by the customer view
and they will always have the appropriate types.
An XPO processor must confirm these assertions at runtime. It might
be able to do so efficiently by examining DTDs and schemas. It could also
calculate the types of abstracted views by examining the declared types of
the more basic views. XPO differentiates between missing properties and properties
with null values. A missing property (relative to a declared view) is an error
even if the property value was optional.
For performance and safety, one can envision a software processor known
as an XPO schema validator which can report whether the combination of an
XML schema and an XPO schema guarantees runtime type safety. Where XML Schema
is lacking (particularly in type checking of links), it may be necessary for
property definitions and assignments to include some type assertions directly.
Related technologies
There are various technologies that solve similar, but not identical
problems. Transformations were not designed for element-level abstraction
and cannot typically be applied to a particular element in an efficient and
well-defined manner. RDF does not have a mechanism that allows arbitrary XML
documents to be mapped into the abstraction mechanism. Architectural forms
and archetypes work only on the tree-model and cannot abstract links. This
section looks at these techniques in detail.
Transformation languages
Historically, when XML users need a level of abstraction, they typically
use transformation languages such as XSLT
[XSLT], DSSSL
[DSSSL] or Omnimark
[Omnimark]. They
do not think of what they are doing in terms of abstraction but that is actually
what they are getting. XSL formatting objects are abstractions for page definition.
Instead of embedding formatting directly into an XML document, you use the
transformation language to build a layer of page definition on top of the
domain specific XML abstraction. These languages are good at what they do
but they do not eliminate the need for a simpler, more declarative notation.
The first limitation of these languages in this context is that they
are designed to work on an entire document at a time, not on a single element
or attribute. This means that many perfectly valid stylesheets do not have
rules for every element that they transform. In these languages, the decision
to split the processing of two adjacent elements into two separate rules is
typically made on the basis of readability and simplicity, not element-level
application. In some circumstances it might be reasonable to process an entire
XML document in a single template rule.
Another issue is that these languages are not designed to maintain a
concept of node identity across transformations. If you use XSLT to translate
DocBook into HTML, there is no well-defined way to translate an operation
(e.g. a highlight and delete) described in terms of the abstracted version
into an operation in terms of the DocBook source. Perhaps an annotator wants
to highlight an element so that he or she can make an RDF assertion or XLink
hyperlink. The user typically intends to make the link into the XML source,
not a transient HTML rendition of it. It is not possible to run a transformation
backwards so there is no way to get back to the XML source and generate a
proper locator.
RDF
RDF
[RDF Model and Syntax]allows XML documents to be represented
as property/value pairs with arcs between the nodes. Therefore it meets the
criteria of supporting first-class linking structures. RDF schemas also allows
RDF classes to subtype each other. Therefore the schemas provide an abstraction
mechanism.
RDF is not appropriate in many situations. The standardized RDF XML
syntax is somewhat intrusive and cannot be directly embedded into a traditionally
structured XML document. This is a serious usability problem which has arguably
hampered RDF's popuarlity. Intrusive specifications tend not to become popular.
It is not hard to define an XML syntax that is compatible with RDF, but it
requires you to design your entire language around the syntactic rules of
RDF, rather than having RDF "adapt" to your languages' needs.
RDF also defines no mechanism for asserting that a single subtree of
an XML document conforms to RDF nor for mapping complex XML syntaxes into
an RDF model.
I believe that to achieve its potential, the RDF data model needs to
be reachable from some alternate syntaxes that are less intrusive. Ideally,
it should be possible to take existing languages like HTML and DocBook and
map them into RDF through an external notation. XML Property Objects could
be viewed as that bootstrapping mechanism. It allows documents encoded according
to arbitrary XML vocabularies and varying syntactic conventions to be viewed
with the RDF data model. From there it would be possible to use the abstraction
mechanisms already in RDF Schemas or those in XPO itself.
Architectural forms and archetypes
HyTime
[HyTime] contains an abstract model for
hyperlinks but it also contains mechanisms for defining abstractions over
markup: architectural forms and groves. Architectural forms are abstractions
described in terms of the XML model. The HyTime architectural grove is basically
an in-memory XML document conforming to the HyTime architectural DTD. The
mapping from an arbitrary DTD to the HyTime "meta-"DTD is done through the
architectural mechanism. This mechanism is generalized so that it is possible
to map from arbitrary DTDs to arbitrary meta-DTDs.
The most serious limitation of the architectural mechanism is that it
can only describe abstractions in terms of parent/child relationships. Link-expressed
relationships cannot be abstracted. For instance, there is no way to assert
that the "CEO/CIO" relationship is a subtype of the "employer/employee relationship"
which is a subtype of the "works with" relationship. XML Schema archetypes
have the same limitations. They have a more direct syntax but they are not
much more expressive than architectural forms.
In addition to the concrete weakness around links, I believe that mappings
which work on the element/attribute data model will always be somewhat underpowered
and non-intuitive. That model is just not amenable to concepts such as inheritance
and subtyping.
It is possible to build abstractions at the grove level using auxiliary
groves but the mechanism for doing so is not specified in any standardized
syntax. An auxiliary grove's relationship to its source grove is only specified
in prose and implementing software. There is no formal grove mapping definition
language. From the HyTime point of view, XML Property Objects could be seen
as an extension to the grove model to allow this formal specification: a property
set->property set mapping language.
Data binding
There are various proposals for binding
[Data Binding]
XML into application objects. Most of these build Java objects.
This is great if your problem is to get XML into Java but it is not designed
as a generalized XML abstraction mechanism. If you need to manage six different
views of a purchase order, data binding only helps if these views are defined
in terms of Java interfaces. If they are meant for other programming languages,
for direct insertion into relational databases or for querying with a graph-based
query language, data binding does not help. It is not clear whether data binding
mechanisms will be as flexible in allowing mappings based on XPaths or a similarly
sophisticated mechanism. Data binding mechanisms are designed to also allow
Java objects to map back into XML. XML Property Objects do not support this.
For some problems, XML Property Objects may be a sufficient mechanism
for building application objects into programming languages. In highly dynamic
programming languages such as Python, Smalltalk, Javascript, Visual Basic
and Perl, property object nodes can be handed directly to the application.
In statically typed programming languages they can either be passed dynamically
as dictionaries or statically by generating interface definitions from view
declarations.
XLink
XLink
[XLink] and XPO both allow the assertion
of links. As with RDF, the primary difference is that XLinks require you to
use XLink syntax inside of your document. XML Property Objects are designed
to allow mapping creators to bring out the hidden links underlying the XML
structure whereas XLink allows the declaration of links inline. In a situation
where well-formed documents are being exchanged without any schema or DTD,
XLink is most appropriate.
Where negotiating a schema in advance is possible or the price of downloading
it is not too high, XPO is probably a better way to assert relationships.
It is less verbose, less intrusive and more "general" in that it allows the
easy creation of links to adjacent, contained or otherwise related elements
automatically.
XML Schemas
Grammar-based schema languages have many uses. They can be used to drive
XML editors and to enforce XML content ordering. The enforced ordering can
make efficient sequential processing possible. Unfortunately, they typically
have no concept of link typing and the abstraction mechanisms are therefore
link-unaware. More experience is necessary to determine whether these abstraction
mechanisms are sufficient for strictly tree-based abstraction creation.
On the other hand, XML Schemas have the virtue that they are extensible.
XPO elements can be embedded in with schema elements for purposes of describing
the abstract, link-based representation for the information.
Conclusions
The XML Property Objects notation is in the early stages of its development
and promotion. It shows great promise as a potential standard notation for
expressing the underlying "object-oriented" structures embedded in XML documents.
There is currently a prototype Data Binding implementation for mapping XML
documents into Python objects and a Java implementation will be created if
there is demand. Further development and standardization depends upon community
interest. Contact the author if the paper piqued yours.
Bibliography