From markup to object model
the XML abstraction problem and XML property objects
Paul Prescod
Find


Abstract
Mechanisms for building abstractions over XML documents tend to be more complex and less flexible than techniques available in domains such as relational databases and object models. This paper reviews several existing strategies and suggests a new one. XML Property Objects allow a flexible, user-defined mapping from complex XML attributed element tree structures to directed labeled graph structures.

Contents
  1. Overview
  2. XML Property Objects
    1. Mapping XML trees to graphs
    2. Abstraction with XPO
    3. XPO views
  3. Related technologies
    1. Transformation languages
    2. RDF
    3. Architectural forms and archetypes
    4. Data binding
    5. XLink
    6. XML Schemas
  4. Conclusions
  5. Bibliography

Overview
Software engineering is dominated by two tasks. The first is the design of algorithms (and necessary data structures) required to automate the solutions to particular problems. The second is the design of abstractions. Abstractions allow us to reuse software code and thus make software solutions that can grow and be maintained over time.
In a world where only implementation and algorithms mattered, everything could be programmed in assembly language and every project would be approached as if from scratch. There would be no operating systems, no programming languages, no code libraries and no relational databases. A programmer's job would be analogous to that of a carpenter. The fact that a carpenter has hammered a nail a thousand times before does not remove the requirement to do it again. Reuse is at the level of ideas and skills, not implementation.
To some extent, creators of tiny "embedded systems" live in this world. Thankfully, the rest of us can use the ever-expanding RAM in our computers to build abstractions on top of abstractions on top of abstractions: programs on top of programming languages on top of interpreters on top of other programming languages on top of operating systems. Each level can itself be decomposed into many abstractions. The popular UML diagramming standard exists precisely to help manage these levels of abstraction.
In a world where only abstraction mattered, everything would be programmed in highly declarative mathematical notations built on top of other mathematical formalisms. Extensibility and reusability would be foremost considerations. Code performance and time to market would be minor considerations. Programmers would be essentially mathematicians and data modelers. Before creating anything new, a programmer would build a theory, notation and model for the new concept, document it rigorously and build a "black box" that implemented the theory.
Clearly, we live between these extremes. In the markup world, this tension is described as a dichotomy between "syntax" and "semantics." A particular markup language or vocabulary is a concrete interchange format for an abstract information model. Unfortunately there is a large gap between the XML tree model and the models traditionally used in software engineering, logic and information modelling. XML's model is more natural for serialization but less natural for software manipulation and querying. In other words, it is perfect for interchanging information but inefficient for working with the information once it has been parsed.
Markup-based models revolve around order and hierarchy whereas the more popular graph-based data models revolve around linking relationships and roles. A linked data structure (directed graph) can more directly express sophisticated information than can a simple tree. Unfortunately directed graphs are typically second class citizens in markup based standards and systems. Most schema languages and XML editors do not support link type-checking. It is time for a civil rights movement for hyperlinked information!
Markup-based systems are usually also weak in their ability to build abstractions. There is no concept of "subclassing" in XML DTDs and arguably the concepts of architectural forms and XML Schema archetypes are not sufficient for many common tasks. I believe that this problem stems directly from the SGML/XML tree model. It is much easier and more productive to build abstractions based on a relationship model than on an ordered, hierarchical tree structure. This paper describes a mechanism for asserting the hidden relationships in XML documents so that they can be exploited for abstraction building.
Imagine an Internet service titled "whereisthathomepage.com". The service might map people's email addresses to their web sites. It would collect this information from various web pages and databases on the Web. The problem is that the representation of this information is not likely to be standardized. There might be six or seven dominant ways that people describe this relationship on the Web. HTML has meta tags that can point from web pages to email addresses. A Dublin Core extension might have another syntax. A copyright protection standard might have yet another...and so forth, and so on, and so it goes, etc.
It is not feasible to expect them all to use the same namespace, vocabulary, schema or syntax. They might be designed without knowledge of each other, in varying human languages or might have different syntactic constraints and conventions to address. Independent invention of similar semantics is absolutely unavoidable and must be handled by any system designed to scale.
A subtle but important point is that even with complete knowledge of another vocabulary with similar semantics, there are completely valid reasons to choose not to follow an existing standard slavishly. The devil is in the details and the details of an XML vocabulary are important to the people who must work with it. The details of a standard schema are often not exactly appropriate to specific situations.
Vocabulary and terminology will always be chosen based on the needs of particular user communities and will never be globally standardized. It is no easier to invent a static, universally acceptable vocabulary framework than to develop a static universal ontology. Differences between human languages and jargon-sharing communities can not be resolved in our lifetime if at all.
Terminology is not the only problem. The structure of the XML expressing the structure may also vary. Consider our email/web page mapping problem. In some schemas the actual relationship may be expressed in terms of an explicit link from a web page URL to an email URL, in some from an email address to a web page and in some from a "person" element to both other addresses. In yet other cases the relationship might be expressed entirely in terms of containment. In that case the fact that a person element contained both an email and a web page address might implicitly assert that the two go together.
XML vocabularies invariably serve as user interfaces to the underlying data models. Even fancy, expensive tools do not usually hide the vocabulary. Most sophisticated XML editing tools work primarily in terms of the terminology and structure in the XML DTD and schema. Some level of customization is possible but if the distance between the "default" environment and the one you want is more than just element renaming, the customizations start to get expensive. Changing attributes to elements and vice versa, or reordering or moving elements is very difficult to implement in an editing. Most existing element "subtyping" mechanisms have no provision for this level of syntactic variance, even outside of XML editors.
The details of schemas are also important in systems that use XML without any ongoing human interaction. It is still necessary for programmers to understand the syntactic structures defined by the DTD or schema. It is arguably the case that XML is eating into the markets of CORBA and EDI precisely because programmers feel comfortable that they understand it. A particular schema could become complex because it embeds separately designed fragments from dozens of other languages and uses terminology that is foreign to particular programmer groups. That schema will be replaced in those groups by something simpler and more familiar, even if it is "less standards conformant."
Rather than a war between the standards-bearers and the minimalists, we want to map the structures in the domain-specific specification back into multiple, separately-designed public specifications. We want to build standard abstractions over proprietary schemas. And we want to do it without neglecting the needs of hyperlinked information.
Previous Previous Table of Contents
XML Property Objects
XML Property Objects are a proposed solution to these problems. Property objects are a simple notation for identity-preserving mappings from XML elements, attributes and other nodes to a more traditional property/value/link model. This model is highly similar to that employed by Java, Python, CORBA and all major object databases. Just as in those models, objects consist of property/value pairs. Here is how the Dublin Core organization describes directed labelled graph model (as expressed in RDF):
This model is universally used and understood in software development. It is also highly amenable to abstraction.
Mapping XML trees to graphs
XPO is like XSLT in that it is template-based. Each xpo:template element has a match attribute that matches some set of nodes. The attribute contains an XPath [XPath Specification] that functions as a match pattern, just as in XSLT. Unlike XSLT, all matching rules apply, not just one. Within the template there are a series of property value assignment (xpo:property) elements. These properties and values are associated with nodes that match the selection criteria. Each property definition has a name/value pair. The name is a property name (expressed as a relative or absolute URI). The value of the property is defined with a select attribute. This attribute uses XPath to select some other node in the document or in some other document.
<xpo:template match="html">
<xpo:property name="doctitle" select="head/title"/>
<xpo:property name="content" select="body/*"/>
</xpo:template>
This example generates two extra properties on each "html" element node. They are called "doctitle" and "content" locally but their entire URI-name is prefixed by the URI for the schema. In this way, the tree is transformed into a property-value graph that is roughly similar but not identical in structure to the original tree. The first property will have a value of a single node (because valid HTML documents have only one title) and the second property will be a nodelist of many different kinds of elements (headings, paragraphs, tables, etc.)
<xpo:template match="table">
<xpo:property name="caption" select="caption"/>
<xpo:property name="width" select="@width"/>
<xpo:property name="height" select="@height"/>
<xpo:property name="rows" select="tr"/>
<xpo:property name="cells" select="tr/td"/>
</xpo:template>
The graph that would result from this rule has a node with an "element type name" with properties named "caption", "width", "height", "rows" and "cells". Each of these would refer to another node or set of nodes. The other nodes might have their own rules and properties.
<xpo:template match="tr">
<xpo:property name="cells" select="td"/>
<xpo:property name="table_owner" select=".."/>
</xpo:template>
Note that property-based models frequently have redundant properties for navigational convenience. In this case, the cells property on the table is (strictly speaking) redundant because it can be accessed through the rows property. One of the benefits of mapping from the XML structure to the property objects is that it provides an opportunity to introduce redundancy. This is important because these property-based models are designed to be used as the basis for programming APIs and query languages.
The select attribute gives full access to XPath syntax. All operators and steps are available. In addition to string operations and basic math, it is also possible for properties to traverse around and between documents across hyperlinks:
<xpo:template match="/doc">
<xpo:property name="french_title"
select="document(translations/french/@href)/title"/>
</xpo:template>
One virtue of the property/object view of the world is that an application examining a property need not have any knowledge of whether the property value was syntactically in the same document as the referent or not. Once the XPath has been resolved, the relative location of the original elements is irrelevant.
Inter-node property relationships (links) are the central concept in the property object view, just as in groves or the RDF model. The difference between property objects and the XML property set (groves) or the RDF XML syntax is that the schema-writer completely controls the mapping from syntactic, XML tree representation to semantic, directed labelled graph.
Other details:
Extensibility is provided through XPath. It is possible to add new functions to XPath and those are automatically available to XPO. Consider an extreme example: an XPath extension could allow the addressing of elements that overlap particular other elements in an SVG diagram. XPO could implicitly build links between those elements.
Abstraction with XPO
Building graphs over XML documents allows a single level of abstraction. With the XPath extensibility, this level may be quite sophisticated, but it is nevertheless a single level. This is useful but not sufficient. To solve the difficult problems in electronic commerce and technical documentation, it is necessary to build multiple levels of abstraction. A single department's purchase order model must be mapped into an enterprise purchase order which must be mapped into an industry wide purchase order which may need to be mapped into a governmental purchase order. The problem is even harder for companies in multiple industries, in many different governmental jurisdictions.
The mechanism for building abstractions on top ofexisting properties is very similar to the basic mechanism. Instead of using traditional XPath syntax, we use an XPath extension with an arrow (->) notation. This extension means to traverse a named property. It can be represented in the XPath data model as a new kind of axis with a minimized syntax.
<xpo:abstract match="html">
<xpo:property name="dublin_core:title"
select=".->doctitle"/>
<xpo:property name="dublin_core:subject"
select=".->keywords"/>
<xpo:property name="dublin_core:creator"
select=".->author->name"/>
</xpo:template>
This mechanism is simple both in syntax and in concept. It is highly analogous to delegation in a simple (imaginary) programming language:
class html_2_dublin_core extends html:
def get_dublin_core_title():
return this.get_doctitle()
def get_dublin_core_content():
return this.get_content()
def get_dublin_core_author():
return this.get_creator().get_author().get_name()
The abstraction mechanism can easily be replaced with a mechanism built-in to another environment such as Java method delegation, or property inheritance in RDF. It could also be used in concert with them. XPO never prevents the use of language or application-specific abstraction mechanisms "on top" of it.
XPO views
When building software on top of ordinary XML, it is important to know what sub-elements will be available in a particular element. XML uses DTDs and schemas to make document structures predictable. XPO has an integrated but separable constraint language that allows static declarations of types of properties and node types that combine them. This language is separable because in some environments it is likely to be replaced with property set definitions (part of HyTime), RDF Schemas (W3C world), Express schemas, ODMG object definitions or Java interfaces.
Rather than duplicate the sophisticated type checking of these languages, XPO's constraint language is currently very simple. There are three types of declarations: property type declarations, view declarations and view conformance declarations. Property type declarations do exactly what they sound like they would do:
<xpo:prop-def name="first_name" type="STRING"/>
<xpo:prop-def name="initial" type="STRING?"/>
<xpo:prop-def name="last_name" type="STRING"/>
<xpo:prop-def name="purchases" type="purchase+"/>
<xpo:prop-def name="visits" type="INTEGER"/>
As in XML, "+" means one or more, "*", zero or more and "?" one or zero. The property names and types are actually URIs. When they are not completely spelled out, they are considered to be IDs which can be appended to the base URI of the document.
View declarations are similar to the interface declarations in Java or CORBA. They define a set of properties that are always guaranteed to appear on certain nodes of those properties.
<xpo:view name="customer"/>
<xpo:prop href="#first_name/">
<xpo:prop href="#last_name/">
<xpo:prop href="#inital/">
</xpo:view>
View conformance declarations state which nodes adhere to which views.
<xpo:conforms
match="purchase_order/buyer"
view="customer"/>
This declaration states that elements of type buyer within purchase_order will always have the properties declared by the customer view and they will always have the appropriate types.
An XPO processor must confirm these assertions at runtime. It might be able to do so efficiently by examining DTDs and schemas. It could also calculate the types of abstracted views by examining the declared types of the more basic views. XPO differentiates between missing properties and properties with null values. A missing property (relative to a declared view) is an error even if the property value was optional.
For performance and safety, one can envision a software processor known as an XPO schema validator which can report whether the combination of an XML schema and an XPO schema guarantees runtime type safety. Where XML Schema is lacking (particularly in type checking of links), it may be necessary for property definitions and assignments to include some type assertions directly.
Previous Previous Table of Contents
Related technologies
There are various technologies that solve similar, but not identical problems. Transformations were not designed for element-level abstraction and cannot typically be applied to a particular element in an efficient and well-defined manner. RDF does not have a mechanism that allows arbitrary XML documents to be mapped into the abstraction mechanism. Architectural forms and archetypes work only on the tree-model and cannot abstract links. This section looks at these techniques in detail.
Transformation languages
Historically, when XML users need a level of abstraction, they typically use transformation languages such as XSLT [XSLT], DSSSL [DSSSL] or Omnimark [Omnimark]. They do not think of what they are doing in terms of abstraction but that is actually what they are getting. XSL formatting objects are abstractions for page definition. Instead of embedding formatting directly into an XML document, you use the transformation language to build a layer of page definition on top of the domain specific XML abstraction. These languages are good at what they do but they do not eliminate the need for a simpler, more declarative notation.
The first limitation of these languages in this context is that they are designed to work on an entire document at a time, not on a single element or attribute. This means that many perfectly valid stylesheets do not have rules for every element that they transform. In these languages, the decision to split the processing of two adjacent elements into two separate rules is typically made on the basis of readability and simplicity, not element-level application. In some circumstances it might be reasonable to process an entire XML document in a single template rule.
Another issue is that these languages are not designed to maintain a concept of node identity across transformations. If you use XSLT to translate DocBook into HTML, there is no well-defined way to translate an operation (e.g. a highlight and delete) described in terms of the abstracted version into an operation in terms of the DocBook source. Perhaps an annotator wants to highlight an element so that he or she can make an RDF assertion or XLink hyperlink. The user typically intends to make the link into the XML source, not a transient HTML rendition of it. It is not possible to run a transformation backwards so there is no way to get back to the XML source and generate a proper locator.
RDF
RDF [RDF Model and Syntax]allows XML documents to be represented as property/value pairs with arcs between the nodes. Therefore it meets the criteria of supporting first-class linking structures. RDF schemas also allows RDF classes to subtype each other. Therefore the schemas provide an abstraction mechanism.
RDF is not appropriate in many situations. The standardized RDF XML syntax is somewhat intrusive and cannot be directly embedded into a traditionally structured XML document. This is a serious usability problem which has arguably hampered RDF's popuarlity. Intrusive specifications tend not to become popular. It is not hard to define an XML syntax that is compatible with RDF, but it requires you to design your entire language around the syntactic rules of RDF, rather than having RDF "adapt" to your languages' needs.
RDF also defines no mechanism for asserting that a single subtree of an XML document conforms to RDF nor for mapping complex XML syntaxes into an RDF model.
I believe that to achieve its potential, the RDF data model needs to be reachable from some alternate syntaxes that are less intrusive. Ideally, it should be possible to take existing languages like HTML and DocBook and map them into RDF through an external notation. XML Property Objects could be viewed as that bootstrapping mechanism. It allows documents encoded according to arbitrary XML vocabularies and varying syntactic conventions to be viewed with the RDF data model. From there it would be possible to use the abstraction mechanisms already in RDF Schemas or those in XPO itself.
Architectural forms and archetypes
HyTime [HyTime] contains an abstract model for hyperlinks but it also contains mechanisms for defining abstractions over markup: architectural forms and groves. Architectural forms are abstractions described in terms of the XML model. The HyTime architectural grove is basically an in-memory XML document conforming to the HyTime architectural DTD. The mapping from an arbitrary DTD to the HyTime "meta-"DTD is done through the architectural mechanism. This mechanism is generalized so that it is possible to map from arbitrary DTDs to arbitrary meta-DTDs.
The most serious limitation of the architectural mechanism is that it can only describe abstractions in terms of parent/child relationships. Link-expressed relationships cannot be abstracted. For instance, there is no way to assert that the "CEO/CIO" relationship is a subtype of the "employer/employee relationship" which is a subtype of the "works with" relationship. XML Schema archetypes have the same limitations. They have a more direct syntax but they are not much more expressive than architectural forms.
In addition to the concrete weakness around links, I believe that mappings which work on the element/attribute data model will always be somewhat underpowered and non-intuitive. That model is just not amenable to concepts such as inheritance and subtyping.
It is possible to build abstractions at the grove level using auxiliary groves but the mechanism for doing so is not specified in any standardized syntax. An auxiliary grove's relationship to its source grove is only specified in prose and implementing software. There is no formal grove mapping definition language. From the HyTime point of view, XML Property Objects could be seen as an extension to the grove model to allow this formal specification: a property set->property set mapping language.
Data binding
There are various proposals for binding [Data Binding] XML into application objects. Most of these build Java objects. This is great if your problem is to get XML into Java but it is not designed as a generalized XML abstraction mechanism. If you need to manage six different views of a purchase order, data binding only helps if these views are defined in terms of Java interfaces. If they are meant for other programming languages, for direct insertion into relational databases or for querying with a graph-based query language, data binding does not help. It is not clear whether data binding mechanisms will be as flexible in allowing mappings based on XPaths or a similarly sophisticated mechanism. Data binding mechanisms are designed to also allow Java objects to map back into XML. XML Property Objects do not support this.
For some problems, XML Property Objects may be a sufficient mechanism for building application objects into programming languages. In highly dynamic programming languages such as Python, Smalltalk, Javascript, Visual Basic and Perl, property object nodes can be handed directly to the application. In statically typed programming languages they can either be passed dynamically as dictionaries or statically by generating interface definitions from view declarations.
XLink
XLink [XLink] and XPO both allow the assertion of links. As with RDF, the primary difference is that XLinks require you to use XLink syntax inside of your document. XML Property Objects are designed to allow mapping creators to bring out the hidden links underlying the XML structure whereas XLink allows the declaration of links inline. In a situation where well-formed documents are being exchanged without any schema or DTD, XLink is most appropriate.
Where negotiating a schema in advance is possible or the price of downloading it is not too high, XPO is probably a better way to assert relationships. It is less verbose, less intrusive and more "general" in that it allows the easy creation of links to adjacent, contained or otherwise related elements automatically.
XML Schemas
Grammar-based schema languages have many uses. They can be used to drive XML editors and to enforce XML content ordering. The enforced ordering can make efficient sequential processing possible. Unfortunately, they typically have no concept of link typing and the abstraction mechanisms are therefore link-unaware. More experience is necessary to determine whether these abstraction mechanisms are sufficient for strictly tree-based abstraction creation.
On the other hand, XML Schemas have the virtue that they are extensible. XPO elements can be embedded in with schema elements for purposes of describing the abstract, link-based representation for the information.
Previous Previous Table of Contents
Conclusions
The XML Property Objects notation is in the early stages of its development and promotion. It shows great promise as a potential standard notation for expressing the underlying "object-oriented" structures embedded in XML documents. There is currently a prototype Data Binding implementation for mapping XML documents into Python objects and a Java implementation will be created if there is demand. Further development and standardization depends upon community interest. Contact the author if the paper piqued yours.
Previous Previous Table of Contents
Bibliography
[Data Binding]http://java.sun.com/aboutJava/communityprocess/jsr/jsr_031_xmld.html
[DSSSL]http://www.jclark.com/dsssl/
[HyTime]http://www.hytime.org
[Omnimark]http://www.omnimark.com
[RDF Model and Syntax]http://www.w3.org/TR/1999/REC-rdf-syntax
[RDF Schema]http://www.w3.org/TR/rdf-schema
[XLink]http://www.w3.org/TR/xlink
[XPath Specification]http://www.w3.org/TR/xpath
[XPO Homepage]http://www.prescod.net/xpo
[XSLT]http://www.w3.org/TR/xslt
Previous Previous Table of Contents