XML Europe 2001 logo21-25 May 2001
Internationales Congress Centrum (ICC)
Berlin, Germany

A Logic Approach to XML Document Update Query Specifications

Peiya Liu <pliu@scr.siemens.com>
Liang H. Hsu <lh@scr.siemens.com>
 PDF version    Latest version   

ABSTRACT

Currently most XML query language proposals including W3C XQuery, focus on retrieval aspects of query specifications. While retrieval is important, XML documents that could not be modified would not be interesting in many useful applications. This presentation will show a logic approach, based on Path Predicate Calculus, to XML update query specifications and the advantages over algebraic approaches.

Table of Contents

1. Introduction

Many document query languages such as SDQL, XML-QL, Lorel, YATL, XQL, Quilt, W3C XQuery etc, have been proposed for specifying XML document retrievals. But the specifications for XML document modification are missing. While retrieval is important in XML queries, XML documents that could not be modified would not be interesting in many useful applications. The modification operations, including insert, delete and update, are used to change document structures and content. A useful XML document manipulation language should include both the ability to query and the ability to modify the XML documents.

Furthermore, formalisms for document manipulation languages are currently still underdeveloped. An adequate formalism is critical for development and standardization of XML document manipulation language. In past, two formalisms have often been used for describing query languages in relational models (1)algebraic formalism, called relational algebra, and (2) logic formalism, called relational calculus. In the algebraic formalism, the queries are expressed by applying special algebraic operators on relations. In the logical formalisms, the queries are expressed by describing predicates that relation tuples in the result must satisfy. A formalism can be used to serve as a vehicle for evaluating the expressive power and limitations of the proposed query and manipulation languages. Most of the modern relational query languages embed within themselves one of these formalisms to specifying queries. Historically, calculus-based relational query languages are more prevalent than algebra-based languages due to declarative characteristics of logic formalisms. However, since underlying relational data models are different from tree document models, these formalisms for relational query languages cannot be directly used as the formalisms for document query languages

An algebraic formalism for specifying document query languages has been proposed in W3C Query Working Group. The algebraic approach is based on a specific data model of XML documents. Currently, it only supports retrieval query specifications with limited expressive power for describing document addressing axes used in XPath. Without adequate formalisms, in our view, this would hinder the development of document manipulation language design and standardization. It seems that the logic formalisms for XML document manipulation languages have not yet received enough attention in the community.

In this paper, we extend the logic formalism, called Path Predicate Calculus [Liu1 00] [Liu2 00]to support XML document manipulation including both retrieval and modification aspects. The document modification operations are deletion, insertion, and update of XML structures and content. This logic formalism can be directly based on XML document model and XPaths for specifying XML manipulations. In this path predicate calculus, the atomic logic formulas are element predicates rather than relation predicates in relational calculus. In a path predicate calculus language, queries and document manipulations describe desired document elements by specifying path predicates that the tree document elements must satisfy. The document modification specifications can be specified in a declarative manner by path predicate assertions

This logic approach has several advantages: (1) It can directly use XML document tree model rather than specific data models, which need indirect mappings from the tree document model as in algebraic approaches. Therefore, it can avoid the potential risks in creating inconsistency with XML InfoSet document model [XML Information Set]. (2) It can smoothly integrate XML document modifications with XML document retrievals. Document update queries are logically specified in a higher-level, concise, and declarative manner. (3) It can support the XML Schema: Datatypes framework for querying multimedia XML documents. Spatial, temporal, and visual datatypes and query constructs can be included in an XML manipulation language for multimedia XML documents.

The rest of paper is organized as follows. Section 2 describes Path Predicate Calculus. It is based on a restricted form of first-order logic to specify document addresses to make assertions about document elements. Section 3 describes our XML document manipulation language, MMDOC-QL to support modification specifications including insertion, deletion and update. Examples are given for these three types of modification. Section 4 describes the related work. Section 5 concludes the paper.

2. Path Predicate Calculus

A form of logic, called path predicate calculus, has been defined and embedded within our multimedia document query language, MMDOC-QL. Formulas in path predicate calculus are restricted forms of first-order predicate. For these logic-based queries and manipulations, we have designed two important predicates: element predicates andpath predicates, for asserting logical truth statements about document elements in a document tree. In the following, we will first describe all allowable formulas in this logic by recursively defining well-formed formulas and then show several examples of XML modification manipulations specified in this formalism.

Formulas in path predicate calculus are of the form P(x1, x2, ..., xn, c1,c2,.., cm, t1, t2, ..., tp, a1,...,aq, d1,...,dr) where x1, x2, ..., xn, c1, c2,.., cm, t1, t2, tp, a1,.., aq, d1,..., dr are free logic variables for representing element attributes, element contents, tag names, element addresses, and element datatypes respectively. An occurrence of a variable in a formula is "free" if that variable has not been introduced by a "for all" or "there exists" quantifier. Otherwise, it is a "bound" variable. Queries in this logic formalism are equivalent to finding all proofs to extistential closure of P(x1, x2, ..., xn, c1,c2,.., cm, t1, t2, ..., tp, a1,...,aq, d1,...,dr), i.e., to (EX x1) (EX x2) ( ... )(EX dr)P( x1, x2, ...,xn, c1,c2,..,cm, t1, t2, ...,tp,a1,...,aq, d1,...,dr). The detailed descriptions about relationships between logic computation and specification can be seen [Gallaire 78] [Brown 85]

The atomic formula is in either of the form:

  1. E(x1, x2, ..., xn, c, t, a), where E is an element predicate and each of x1, x2, ..., xn, c, t, a is a constant or variable. The predicate E(x1, x2, ..., xn, c, t, a) stands for a logic assertion that element "t" at address "a" contains "c" with attributes x1, x2, ..., xn in a document tree. An English-like notation for element predicate is(<%t> WITH attribute1=%x1, ..., attributen=%xn AT %a CONTAINING %c). For brevity, we can also use short versions with only needed variables in logic queries such as (<%t> WITH attribute1=%x1, ... attributen=%xn), (<%t> CONTAINING %c), etc., if a full version can be implied clearly in the context.

  2. mm-operator (x1, x2, x3, ..., xn), where x1, ...xn are constants, element address variables or element datatype variables. An mm-operator(x1, x2, x3, ..., xn) asserts logic predicates about spatial, temporal, or visual relationships of document segments based on abstract datatypes in XML Schema framework. [XML Schema Part 1: Structures] [XML Schema Part 2: Datatypes] [XML 00]. The multimedia object descriptions can be specified as XML elements with spatial, temporal or visual datatypes. Examples of spatial datatypes are points, polylines, areas, etc. Examples of temporal datatypes are instants, intervals, periods, etc. Spatio-temporal data types can also be defined by combining both spatial and temporal datatypes into composite ones such as "changing area over a period of time". Examples of visual datatypes are color, texture, etc. Based on abstract data types, many spatial, temporal and visual mm-operators such as area-overlap, inside, nearby, time-before, time-after, color-similarity, etc., can be defined for querying and modifying multimedia XML documents [Liu1 00]. A similar technique for specifying moving objects was proposed by [Erwig 99] in relational databases. Current MPEG-7 [MPEG Web Site]content description scheme is also moving towards this direction to use XML datatypes to specify multimedia content.

  3. x op y, where op is an arithmetic comparison operator and x, y are either constants, element attribute variables, or element datatype variables.

  4. TYPEP(x) where x is either a constant, datatype variable, or address variable for asserting logic truth about an element datatype.

All other allowable logic formulas are recursively defined from atomic ones.

  1. (Boolean formula) If P1 and P2 are well-formed formulas in path predicate calculus, then P1 AND P2, P1 OR P2, and NOT P1 are all well-formed formulas for asserting "P1 and P2 are both true", "P1 or P2 or both are true" and "P1 is not true" respectively.

  2. (Path Predicate) If both P1 and P2 are well-formed formulas havng at least one element predicate then (P1 "axis-op" P2) is also a well-formed formula for asserting logic truths P1 with path constraint P2 about document elements in a document tree. The "axis-op" is one of W3C XPath [XPath]axis operators. Examples are (a) parent/child relationship operators such as: INSIDE, DIRECTLY INSIDE, CONTAINING, DIRECTLY CONTAINING, etc. and (b) the sibling relationship operators such as: BEFORE, IMMEDIATELY BEFORE, AFTER, IMMEDIATELY AFTER, SIBLING, IMMEDIATELY SIBLING, etc. Note that here we illustrate a logic version of axis concepts defined in XPath since path formula in Path Predicate Calculus are logical statements for asserting logical truths. An example of the path predicate is: (<bibref> INSIDE (<gcapaper> CONTAINING (<fname> CONTAINING "Peiya") AND (<surname> CONTAINING "Liu">))))for specifying all bibref elements inside Peiya Liu's gcapaper.

  3. (Quantified formula): If P is a formula, then (EX x)(P) is also a formula. The symbol EX is a quantifier read "there exists". The occurrences of x that is free in P are bound to (EX x)(P). The formula (EX x)(P) asserts that there exists a value of x such that when we substitute this value for all free occurrences of x in P, the formula P becomes true. The only other quantifier is ALL can be defined in a similar way. If P is a formula, then (ALL x)(P) is also a formula. The symbol ALL is a quantifier read " for all". The occurrences of x that are free in P are bound to (ALL x)(P). The formula (ALL x)(P) asserts that all possible values of x such that when we substitute any such a value for all free occurrences of x in P, the formula P becomes true.

Note that domains of variables in P are finite in this path predicate calculus since in a particular document instance for being quired, there are finite numbers of element attributes, element contents, tag names, element datatypes and element addresses. This "safe" property is required to avoid finding all proofs of query formula over infinite domains. In a real query language design, we can further restrict variables by using regular expressions for allowable variable patterns as shown in the next section.

3. MMDOC-QL: XML Document Manipulation Language

MMDOC-QL is our multimedia document manipulation language which embeds within it the notation of path predicate calculus to describe XML retrieval and modification. The XML document we used in the following examples is a conference paper as this one. In this XML paper document, there are many places having bibref and xref tags for cross-references to cited papers in the end, figure tags in the middle of content, etc. by using id/refloc attributes. An example of the XML insertion is in the form of "Insert HREF hypertext links for all bibref tags in my XML paper where the destinations of each link should be one of bibitems in paper references".

 
		INSERT: (<a> WITH href="#"%refloc)
           DIRECTLY CONTAINING
           (<bibref> WITH refloc=%refloc)
 PATTERN: {*[0-9][0-9]/%refloc};
 FROM:     mygcapaper.xml
 CONTEXT: {(<bibref> with refloc=%refloc)
           AND(<bibitem> with id=%refloc)};
		

In MMDOC-QL, there are four clauses: operation clause (GENERATE, INSERT, DELETE, or UPDATE) is used to describe the logic conclusions in the form of allowable element predicates and path predicates. In this paper, we focus on XML modification operation clause (INSERT, DELETE, or UPDATE). The retreival operation clause GENERATE can be seen [Liu2 00]. PATTERN clause is used to describe the domain constraints of free logical variables including tag, attribute, content, address and datatype, by using regular expressions. FROM clause is used to describe source documents for querying. CONTEXT clause is used to describe logic assertions about document elements in allowable logic formulas in path predicate calculus. FROM and CONTEXT clauses are paired together and there could be multiple pairs for describing multiple sources. The logic variables are indicated by "%" such as "%refloc".

In this example, the path formula in CONTEXT clause asserts that element “bibref” with attribute “refloc” value is equal to the attribute “id” value of an element “bibitem” in paper reference section. INSERT clause describes where to insert href link elements <a href=”#xx”> in XML documents. The domain of logic variable %bibref is a set of strings with the last 2 characters restricted to be digits. An instance of the inserted link is shown below

...
<gcapaper>
    <front> ..</>
    <body>
      ...
           <a href="#liu100> <bibref refloc="liu100"></a>
      ....
      ....
    </body>
    <rear><bibliog>
        ...
        <bibitem id="liu100"> ... </bibitem>
        ...
        </bibliog>
    </rear>
</gcapaper>

3.1. Deletion Query Specification

In deletion query specifications, we use DELETE operation clause with the same logic formula as one in INSERT operation clause for removing all href links. In this example, we show how to use element address variable “%a” to bind element addresses of found source anchors in CONTEXT clause. DELETE clause use the found addresses to describe where to remove href hyperlinks.

 
DELETE: (<a> WITH href="#"%refloc)
             DIRECTLY CONTAINING (AT %a))
 PATTERN: { *[0-9][0-9]/%refloc};
 FROM:     mygcapaper.xml
 CONTEXT: {(<bibref> with refloc=%refloc AT %a)
           AND(<bibitem> with id=%refloc)};
			  

3.2. Insertion Query Specification

Another example of XML insertion specification is to add hyperlinks to figure references for all papers. This example show a flavor of our automatic hyperlinking on large-scale documents based on link specifications [Hsu 98]

 
INSERT: (<a> WITH href="#"%refloc")
           DIRECTLY CONTAINING (AT %a))
 PATTERN: {"fig"[0-9][0-9]/%refloc};
 FROM:     mygcapaper.xml, ..., yourgcapaper.xml
 CONTEXT: {(<xref> with refloc=%refloc AT %a)
           AND (&lt;figure> with id=%refloc)};

3.3. Update Query Specification

Update specifications in MMDOC-QL are specific combinations of insertion and deletion specifications. In the following, it specifies how to change the e-mail address in mygcapaper.xml to “peiya.liu@scr.siemens.com”. We use a SET subclause to specify the update assigments in target elements.

 
UPDATE: ( {<email> DIRECTLY CONTAINING %e-mail):
            SET
           (%e-mail = "peiya.liu@scr.siemens.com")) 
 PATTERN: {};
 FROM:     mygcapaper.xml
 CONTEXT: { TRUE ):

4. Related Work

Two categories of related work are described here. One is related to document manipulation languages and the other is related to formalisms within which data manipulation languages are embedded. MMDOC-QL distinguishes itself from other work [XML-QL 99] [Lorel 00] [YATL 98] [XQL 98] [SDQL 96] [Quilt 00 ] [XQuery 01]in embedding a logic formalism for dealing with document manipulation specifications including both retrieval and modification aspects and for dealing with multimedia XML queries based on spatial, temporal, and visual relationships. In the following, we describe related formalisms in data manipulation languages.

Two popular formalisms for query languages in relational models are[Ullman 88] (1) algebraic formalism, called relational algebra, and (2) logic formalism, called relational calculus including tuple relational calculus[Codd 72]and domain relational calculus[Piottee 78]. In algebraic relational query languages, the queries are expressed by applying special algebraic operators on relations. The atomic formula in tuple relational algebra is R(u), where R is a relation predicate and u is a tuple variable. In logical relational query languages, the queries are expressed by describing predicates that relation tuples in the answers must satisfy. The atomic formula in domain relational calculus is R(x1, x2, ..., xn) where R is n-ary relation predicate and x1, ..., xn are variables over table field domains. However, since underlying data models are different from document models, these atomic formulas cannot be used in formalisms for document query languages. There is a need for a new logic formalism which can be used to assert logic truth statements about document elements in a query specification.

Two algebraic formalisms IMO and LA are described in W3C XML Query Working Group notes[IMO 99 ] [LA 99] and the latest proposed formalism is in a W3C working draft[XML Query Algebra]. The algebraic approaches use specific data models of documents. In IMO algebraic approach, an XML document is modeled as a directed graph. This query algebra provides several operations on the data models. These operations are used to select documents. Examples of the operations are: navigation operator to express the next document element to traverse, selection operator to express a selection over a collection of data, join operator to express a specific collection of document elements, etc. In LA algebra, an XML document is modeled as nested relations. The algebra is actually derived from nested relational algebra. In addition to relation operators, the query algebra provides additional operators such as regular expressions, structural recursion, etc. The proposed XML Query Algebra based on a specific data model [XML Query Data Model] is mainly derived from LA Algebra. The formalism does not support XML update and full XML document addressing mechanisms currently. One of the disadvantages in using algebraic formalisms for document query languages is that a data model of documents need to be specifically designed to have proper algebraic operators since there is no algebra system currently available on XML document model.

5. Conclusion Remarks

Current proposed XML document query languages have some limitations in specifying XML modifications. Most of these languages focus on retrieval aspects. In this paper, we have shown a different approach to describe a logic formalism for specifying XML insertion, deletion, and update operations based on document model and addresses. Although some features are not included here such as logic forms for different node types in a document tree, user-defined operators, etc., this paper intends to cover the essential features and to show the flavors of document predicates for describing XML document modification in our path predicate calculus.

The main contributions of this paper are (1) to provide a logic-based formalism, path predicate calculus, for document manipulation languages to specify XML retrieval, insertion, deletion, and update. We feel that this direction of research is important to advance XML document manipulation language design, development and standardization. (2) to formalize document paths and elements in first-order predicate forms for logic-based document modification languages. Element predicates and path formulas are used to assert fundamental logical truth statements about document elements in a document tree for specifying document modification. These predicates and path formulas can be viewed as a generalized version of relational predicates in relational models when applying to XML tree document model. Particularly, it incorporates the crucial notion of XML document address and axes used in XPath for document manipulation languages.

Bibliography

[Brown 85] IF. Brown and P. Liu, A Logic Programming and Verification System for Recursive Quantificational Logic, Proceedings of the Nth International Joint Conference on Artificial Intelligence(IJCAI-85), 1985 Los Angeles, CA
[Liu1 00] P. Liu, L. H. Hsu, Spatial and Temporal Datatypes: An Approach to Specifying and Querying Multimedia Objects and Scheduled Structures in XML Documents, XML Europe 2000, Paris, 2000
[Liu2 00] P. Liu, A. Chakraborty, L. H. Hsu, Path Predicate Calculus: Towards a Logic Formalism for Multimedia XML Query Language, Extreme Markup Languages 2000, Montral, Canada,
[IMO 99 ] David Beech, Ashok Mallotra, Michael Rys, A Formal Data Model and Algebra for XML W3C XML Query Working Group Note, September 19999
[LA 99] M. Fernandes, J. Simeon, D. Suiu, P. Wadler, A Data Model and Algebra for XML Query, W3C XML Query Working Group Note, September 19999
[Codd 72] E. F. Codd "Relational completeness of data base sublanguages", in Data base Systems[R. Rustin, ed) Prentice-Hall, Englewood Cliff, New Jersey. 1972
[Piottee 78] A Piottee."High Level Data Base Query Language", in Logic and Database, (H. Gallaire and J. Minker ed), 1978
[Gallaire 78] H. Gallaire, J. Minker and J. M. Nicolas."An Overview and Introduction to Logic and Database", in Logic and Database, (H. Gallaire and J. Minker ed), 1978
[Ullman 88] J. Ullman."Principles of Database and Knowledge-Base Systems", Volume I, Computer Science Press, 1988
[SDQL 96] ISO 10179:1996 Information Technology -Processing Languages - Document Style Semantics and Specification Language (DSSSL)
[Erwig 99] M. Erwig, R. H. Guting, M. Schneider and M. Vazirgiannis, Spatio-Temporal DataTypes: Approach to Modeling and Querying Moving Objects in Databases, GeoInformatica Vol 3, No 3, 1999 .
[XML 00] Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendations 6 October 200,
[XPath] XML Path Language (XPath) 1.0, W3C Recommendations 16 November 1999
[XML Schema Part 1: Structures] XML Schema Part 1: Structures, W3C Candidate Recommendation 24 October 2000
[XML Schema Part 2: Datatypes] XML Schema Part 2: Datatypes: W3C Candidate Recommendation 24 October 2000
[XQuery 01] XQuery: A Query Language for XML: W3C Working Draft 15 February 2001
[XML Query Algebra] XML Query Algebra: W3C Working Draft 15 February 2001
[XML Query Data Model] XML Query Data Model: W3C Working Draft 15 February 2001
[XML Information Set] XML Information Set W3C Working Draft 2 February 2001
[MPEG Web Site] http://www.cselt.it/mpeg/standards.htm
[Hsu 98] L. H Hsu, P. Liu, B. Johnson-Laird, and S. Vdaygiri, Authoring, Managering and Browsing of Large-Scale Hyperlinks in Multimedia Product Documentation, GCA Markup Technologies 98, Chicago, USA,
[XML-QL 99] A Deutsch, M. Fermandez, D. Florescu, A. Levy and D. Suciu: A Query Lanuage For XML, WWW'99
[YATL 98] Your Mediators Need Data Conversion, ACM-SIGMOD 1998
[Lorel 00] S. Abiteboul, P.Buneman, and D. Suciu, Data on the Web, Published by Morgan Kaufsmann, 2000
[Quilt 00 ] J. Robie, D. Chamberlin and D. Florescu, Quilt: an XML query language, XML Europe 2000, Paris, 2000
[XQL 98] J. Robie abd J. Lapp, XML Query Language, QL'98, http://www.w3c.org/TandS/QL/QL98/

Biography

Peiya Liu
Senior Member of Technical Staff
Siemens Corporate Research, Inc.
Princeton
New Jersey
USA
Email: pliu@scr.siemens.com Web: www.scr.siemens.com

Peiya Liu - Peiya Liu is a project manager and senior member of technical staff at Multimedia Documentation Program, Siemens Corporate Research, Inc. He has spoken at the following industry conferences: XML Europe 2000, Markup Technologies 98 and 99. IEEE Multimedia 94, 96. He has severed as a program committee member in many conferences. He has conducted tutorials on SGML/XML-based multimedia documentation in academic and industrial conferences. He is currently the standards section editor of IEEE Multimedia magazine and on the editorial board of Kluwer Journal of Multimedia Tools and Applications.

Liang H. Hsu
Manager and Distinguished Member of Technical Staff
Siemens Corporate Research, Inc.
Princeton
New Jersey
USA
Email: lh@scr.siemens.com

Liang H. Hsu - Liang H. Hsu is a distinguished member of technical staff and head of Multimedia Documentation Program, Siemens Corporate Research, Inc. Prior to joining Siemens, he worked for Digital Equipment Corporation as a field service support. He has many years of industrial experiences in SGML/XML technologies. His current interests include conversion of legacy documents into SGML, SGML-based document composition and hyperlinking for complex products, multimedia document delivery mechanisms, and browsing and navigation support for service-related applications.