|
A canonical query language & its efficient implementation
|
 |
The XML community is eagerly awaiting a W3C recommendation for a XML
query language. However, it has been estimated that it may take more than
a year until that recommendation reaches a final stage. Therefore, it may
be worthwhile to study in the meantime the intrinsic characteristics of query
languages for XML.
A number of query languages have been initially proposed for XML. Also
some simple query mechanisms are part of (proposed) recommendations, like
in XSL and Xpath.
It has been argued (e.g. in Cotton, P., and Malhotra, A., “Candidate
Requirements for XML Query”, Nov. 30, 1998) that the syntax for the
query formalism should be XML itself, that it will support full-text queries,
and that fast retrieval programs could be generated automatically based upon
the query formalism.
These requirements are so general and intuitive acceptable that it may
be conjectured that they will be the core of any future query language for
structured documents. As such, we might give it the name “canonical”
query language (“CQL”).
In this paper we show how the requirements for CQL can be met fully
by an extension of well-known techniques for a. the description of grammars,
b. pattern matching and c. the generation of finite state automata. These
techniques can straightforward be extended towards pattern grammars for natural
language.
Introduction
Two major approaches still have to be blend: one from the database community
and one from the document community. We quote from
[Fernandez99]:
“The database community is concerned with large repositories of
data, integrating data from heterogeneous sources, exporting new views of
legacy data, and transforming data into common data-exchange formats...The
document community is concerned with full-text search, queries of structured
documents, integrating full-text and structured queries.”
“The database community.. has learned about query complexity,
algebra’s, and techniques for implementing these query languages efficiently...What
is known regarding the expressive power of query languages should play a central
role in the design of a query language for XML.”
[Robie99]adds: “Many XML documents are
very text-centric, and many systems that store XML contain full-text engines.
In markup that intermixes text and structure, it can be extremely helpful
to be able to mix full text conditions with structured conditions to express
a sequence of text patterns and structures.”
[Maier98]argues that “selection, extraction,
reduction, restructuring and combination should all be possible in a single
XQuery” because “XQuery should be suitable for server-side processing”
(where XQuery is an abbreviation for a XML query language).
[XmlQuery-req]requires that “both document-oriented
and data-oriented queries may be performed on documents with embedded data,
such as catalogs, patient health records, employment records, or business
analysis documents”. The requirements ensure that the complete functionality
of XML documents is retained. As one starting point for the definition of
a query language XPath is mentioned.
The blending of the approaches of both communities requires that database
people are aware of the complexity of the integration of full-text and structured
queries and that document people are aware of the complexity of database and
query language design and implementation.
The document community has been experimenting for years with complicated
queries on structured documents. The query languages for XML that have been
proposed up till now are not sufficient for these queries.
In our approach we experimented with formalisms for
NLP and extended them with
PM facilities. As a subset of the formalism, a query language
for structured documents originated. The query language can be read as an
Extended Context Free Grammar with orthogonal extensions for Pattern Matching.
It may be expressed in XML or may be written as a superset of XPath. It is
the richest form of the selection part of any of the XML query languages proposed
so far and provides for the expression of Selection, Projection and Join.
Therefore, we call it, for the moment,
CQL.
The main difference with the approaches up till now is that a query
is expressed as a pattern grammar, akin to a DTD enriched with wild cards,
Boolean constructs and variables, to be applied to both markup and #pcdata.
This paper discusses some of the design considerations of
CQL
and the architecture of the related processes. The relation with the requirements
of
[XmlQuery-req]is indicated. The mapping onto
CQL of some proposed XML query languages and of
standard relational database queries is shown, together with the straightforward
extensibility towards pattern grammars for natural language.
The efficiency is illustrated by innovative principles for compilation
into a Finite State Automaton with skip instructions. We indicate how full
text indexes for structured documents could be exploited in order to unlock
the potential of optimization techniques of the database community.
Design considerations of CQL
Relation with NLP and PM
In our approach, a document query may be viewed as a pattern that has
to be located in a document. As such the processing of a query is akin to
syntactic pattern recognition.
This approach considers queries as the description of a language, formed
by all the possible answers. Formulated as pattern grammars they describe
a set of sentences, as does the language generated by that grammar. The processing
of a query may then be regarded as the recognition of the data base which
acts as one input-sentence to the pattern grammar. Non-determinism during
recognition accounts for the possibility of more answers.
Very powerful query languages may be constructed by adding constructs
for patterns, like wildcards and Boolean operators, to linguistic grammar
formalisms. There is a price: runtime performance may be reduced dramatically.
However, staying within the Chomsky hierarchy, more efficient programs for
formal automata can be generated.
CQL
is a pattern grammar formalism that is stripped down in such a way that queries
within CQL may be compiled into a
FSA.
A set of separate document queries, including their context, may be combined
into one pattern grammar and therefore may be compiled into one (deterministic)
FSA. In the formalism, no sequence of processing
is indicated. Partial overlapping patterns (like ‘abc’ and ‘bcd’)
do not cause problems.
Relation with database query languages
A simple link exists between syntactic recognition and querying in document
trees. In the latter case we formulate queries in which the vertical axes
in the tree are important. In procedural query-languages one navigates in
a horizontal and vertical direction through the tree-structure. However, the
expressive power of the queries can be increased considerably when we allow
also for the full power of grammatical notation along the vertical axes. A
simple start in that direction has been made by XPath.
By allowing grammar-notations on the horizontal axes, the possibility
of linguistic descriptions on free text is introduced.
The constructs for patterns allow for universal and existential quantification.
The advantage of a relational model may be the simple use of such operators
as "selection", "projection" and "join", well-known from relational algebra.
The last-mentioned operator introduces some kind of intelligent combination
of stored relations that is absent from document query languages.
CQL provides for variables which may get an assignment from the document.
Operations on the variables make possible the functionality of the “join”
operator.
Architecture of processes
The processes for CQL have the same anatomy as those for a grammatical
system.
- 1. A compiler generates code for a formal machine, to be interpreted
e.g. by a Java or C program. The compiler takes into account additional information
stemming from the schema. The compiler is optimized in several ways. For instance,
if a pattern grammar is also allowed to contain rules from a higher order
grammar (e.g. for the recognition of natural language expressions) a switch
is generated to a recognizer for that type of grammar.
- 2. The formal machine consists of a PM machine and a scanner.
- 1. The PM machine implements a FSA with skip instructions and generates
messages for subsequent result manipulation.. The messages consist of symbols
and their position in the document(tree).
- 2. The scanner has access to XML documents and their indexes, it resolves
namespaces, entities and other references, follows links, performs filtering,
allows for text that spans boundaries of floating elements, checks datatypes
(e.g. represented by the XML Query Data Model), provides for environment information
and adjusts itself to the data representation (e.g. DOM or XML text).
- 3. A report generator receives the messages from the formal machine
and combines them into document fragments.
This architecture facilitates in a modular way most of the requirements
of
[XmlQuery-req].
Notation of CQL
The syntax of CQL may get the syntactic binding of XML or that of the
forthcoming Schema language, but it may also be regarded as an extension to
XPath. The syntax is denoted as an EBNF grammar (as used for the definition
of e.g. XML itself) with extensions for patterns, variables, Boolean constructs,
document trees and notations for output.
Because we deal with pattern grammars, one might say that a document
is matched by a pattern grammar.
A document is not required to contain markup.
Grammar formalism
The grammar contains Nonterminals (starting with a capital) and terminals.
In this paper the terminals are single characters or strings, but they could
be of other datatypes. Characters and strings are delimited by single or double
quotes.
The reserved symbols of XML are not quoted in order to distinguish them
from strings in #PCDATA.
Nonterminals are rewritten in grammar rules according to the EBNF syntax.
Rules may be left and/or right recursive, but infix recursion is not allowed.
On the right-hand side of a rule, the meaning of symbols in regular
expressions is as shown below. The notation deviates slightly from the usual
notation in order to remain consistent with the syntax of DTD’s.
('a..z' | 'A..Z') (instead of [a-zA-Z])
represents any character with a value in the range(s) indicated
^'a..z' (instead of [^a-z])
represents any character with a value outside the range indicated
^('a'|'b'|'c') (instead of [^abc])
represents any character which is not 'a', 'b' or 'c'
A, 'b'
the nonterminal A followed by the character 'b'.
If r, r1 and r2 are regular expressions then the following ones are
also regular expressions (as in XML):
expression meaning
r1 r2 : concatenation of r1 and r2
r1| r2 : r1 or r2
(r)* : 0 or more repetitions of r
(r)+ : 1 or more repetitions of r
(r)? : 0 or 1 r.
Actions
At any place in a right-hand side (rhs) of a rule so called “actions”
may be placed: a sequence of operations written between braces. The most important
ones are operations for variables, Boolean constructs between rules and for
output.
The actions may be placed before a ',', a '|', a ')' and a '.'. Actions
are executed between the matching of two symbols. In
[VanderSteen88]the
details are described of the combination of actions and regular expressions.
Examples of actions are included in the next subsections.
Patterns
The reserved symbols for patterns are:
| ? | don't care
sign for one arbitrary character |
| = |
arb sign; an "arb" matches any sequence of characters,
possibly empty, but excluding the characters which may follow the arb; anyhow
the arb ends at the next end-tag |
| - |
line sign; a “line” matches any sequence
of characters, possibly empty, including the characters which follow the "line",
but only up to the next end-tag |
Table
1
The line sign has its equivalent for the vertical axes in a tree.
| <..> | any
sequence of start-tags |
| <../> |
the sequence of end-tags that corresponds to the start-tags
of the preceding <..> |
| <..title> |
any sequence of start-tags that ends in <title> |
| <../title> |
the sequence of end-tags that ends in </title> |
Table
2
At the end of a query trailing sequences of lines and end-tags may be
left out.
Examples.
- 1. Comment ::= '!' = '!'.
A sequence of characters which starts with an exclamation mark and which
ends with an exclamation mark is matched as Comment. The characters which
are matched by the "arb" will not contain a '!'.
- 2. Sentence ::= = '.'.
Sentence is defined as the maximal character sequence not containing
a dot, followed by a dot. An input such as: 'Dr. West reads his mail.' will
be matched by the nonterminal Sentence up to the dot in 'Dr.'.
- 3. Sentence ::= - '.'.
Sentence is defined as any character sequence followed by a dot. The
above sentence will be matched two times, once for ‘Dr’ and once
for 'Dr. West reads his mail’.
- 4. S ::= - P1 - | - P2 -. P1 ::= ‘a’. P2 ::= ‘b’.
S matches any string (between a start- and end-tag) that contains either
‘a’, ‘b’ or both.
Boolean negation within a rule
The negation of a string s matches all strings with the same length,
but not equal to s.
The negation of a nonterminal N matches all strings with length between
the length of the shortest and the largest string that can be matched by N,
but not equal to any string that can be matched by N.
The negation of a regular expression E is the same as for a nonterminal,
if the nonterminal rewrites to E.
Examples
^’a’ an arbitrary character, but not an 'a'
^ (‘abc’) a sequence of 3 characters, which is not 'abc'
^A where A is a nonterminal. ^A matches any sequence of
the same length as one of the sentences matched by A,
which is not equal to such a sequence
^ ('ab' | 'bcd') a sequence of 2 or 3 characters, which is not 'ab'
and not 'bcd'
^ (- a..c) any sequence of characters which does not end in 'a',
'b' or 'c'
^ (- (A | B) C -) any sequence of characters which does not contain a
string which may be matched by A C or B C
NP::= - (^Adj) N - . a N(oun) P(hrase) with a N(oun), but without an
Adj(ective) preceding that Noun. (Adj and N have to
be rewritten further.)
Boolean intersection of rules
Rules may be intersected with the symbol '&';.
1. S ::= - P1 - & - P2 -.
S matches any string (between a start- en end-tag) that contains both
P1 and P2.
This may be refined with the ‘cooperation symbol’ C, written
within actions and suffixed by a label for a cooperation point.
2. S ::= - P1 - {C:1} P3 - {C:2} | - P2 - {C:1} P4 - {C:2}.
Here we express that S matches strings w = u v such that u is matched
by - P1 - as well as by - P2 - and that v is matched by - P3,- as well as
by - P4 -. We might say that the matching of P3 and P4 has to wait for the
matching of P1 and P2, and that at least at the end of the input both alternatives
of the rhs have to be matched. The cooperation points are ‘1’
and ‘2’.
This resembles the cooperation of parallel processes. The two rhs's
may be waiting for each other at the rendezvous 1, consuming input, which
is possible because of the appearance of lines, which may match an indefinite
number of input symbols.
In general, when all rhs's which contain the same cooperation name are
matched up to the location where the name is denoted as a cooperation symbol
then all these rhs's may continue, otherwise they all fail.
The device of cooperating rules is especially useful in the specification
of complex conditions in trees in a non-procedural way.
The following pattern grammar:
S ::= - Subject - {C: SVO} | - Verb - {C: SVO} | - Object - {C: SVO}.
is equivalent to:
S ::= - Subject - & - Verb - & - Object -.
Tree structure and XML markup
In a CQL query all XML markup is denoted in the same way as it is written
in an XML document.
The end-tag </> closes the preceding start-tag.
The reserved symbols of XML are not quoted. Strings occurring in markup,
like names of tags and attributes, values of attributes, PI’s etcetera
can be treated with patterns like the text in #PCDATA.
For instance, the query:
S ::= <..> - < - ^’ing’ - > - ‘Charles.’</><../> .
returns all tags with tagnames which do not contain the string “ing”
and which #PCDATA ends on ‘Charles.’.
Attributes are denoted in the same style as in XPath: they are written
as if they are tags themselves, but they start with ‘<@’.
For instance, the above query may be extended towards:
S ::= <..> - < - ^’ing’ - ><@’title’><@ - ’number=’ 1..3> - ‘Charles.’</><../> .
where one attribute name has to be ‘title’ and another attribute
name has to end on ‘number’ and has to have a value between 1
and 3. It is up to the scanner to allow attributes to appear in any sequence.
Variables
A variable is declared by its first appearance in a rhs. The scope of
the variable is the grammar rule with that rhs. The name and the value of
a variable are strings of arbitrary length, following the syntax of attributes
in XML. With the coming Schemata, typed variables may be added (as we did
in our implementation).
The possible operations on variables, written in C-style, may be summarized
by :
- assignment to variables from the last read-in symbol of the text.
The reserved symbol '%' stands for the last symbol that is read in.
- assignment of expressions with variables and constants. Within an
expression variables may be concatenated, together with string-constants and
the last read-in character. The concatenation operator is '||'.
- tests on (expressions of) variables. A test concerns the truth-value
of a Boolean expression. If this value is "false" then the current matching
fails.
- calls to external procedures. A procedure-call consists of the name
of an external procedure which is written in some programming language, and
parameters in which (expressions of) variables may be passed. The first parameter
is a Boolean. If it returns the value "false" then the current matching fails.
Nonterminals may be enriched with parameters. The parameters of nonterminals
at a lhs have to be preceded by the declarations "I:", "O:" or "IO:", which
correspond respectively with the features "inherited", "synthesized" and the
combination of both, which are familiar in attribute grammars.
Examples
- 1. assignment to a variable from the text :
S(O:A,O:B) ::= (a..g {A = A || %} | h..z {B = B || %} )*.
After matching of the input all characters in the range ‘a..g’
are appended in variable A and all characters in the range ‘h..z’
are appended in variable B. A and B will be reported.
- 2. Call of an external procedure :
S ::= - NP(singplu1, time1, head) { Sem(O:continue, I:time1, I:head,
IO:expect) },
VP(singplu2, time2, expect) { singplu1 == singplu2; time1 == time2 }.
After the matching of an NP the routine "Sem" is called with as input
parameters "time1" and "head" and as output parameters "continue" and "expect".
If the value of "continue" upon return of Sem is "false" then further matching
of this path is inhibited (there may be more paths active because this is
a pattern grammar). Else matching continues with the nonterminal VP which
gets the variable "expect". After matching of VP the comparisons within the
actions are executed. If one of them returns false then further matching is
inhibited.
Reports
There are two report functions. The "S"-function (for "signal") reports
number which follows the S. The "R"-function (from "report") does the same,
but adds to it the last symbol that is read in. There may be more symbols
in case the notion before the action where the report-function is denoted
is an arb ('=') or a line ('-'). In that case all the symbols which are covered
by that notion are concatenated and reported. Consecutive reports with the
same number are merged into one report with all symbols concatenated.
The numbers will be reported together with their position in the document
after a pattern matches completely. The position may be a filepointer or a
tree pointer. Alternatives and wild cards are responsible for the generation,
in principle, of overlapping sequences of reports. The system may represent
the overlap efficiently in a dag structure.
Examples
1. pattern matching with 4 keywords, signaling a match at the end of
a keyword.
A ::= - 'he' {S:1} - | - 'she' {S:2} - | - 'his' {S:3} - | - 'hers' {S:4} - .
The same result is obtained by writing :
A ::= - 'he' {S:1} 'rs' {S:4} - | - 'she' {S:2} - | - 'his' {S:3} - .
If the input is 'ushers' then the output message will be '(1, pos1)
(2, pos2) (4, pos4)', where posi is a position in the document.
B ::= C {R:1} = {R:2} (d ? {R:3} )+ e .
C ::= c | f .
If the input is 'cabcdededee' then the output message will be '(1c,
pos1) (2abc, pos2) (3eee, pos3) '.
Examples of CQL queries
Estate inventories
The following is an example from estate inventories from the 17th century
in the Dutch city of Delft.
Suppose the document is :
<BOEDEL><KLASSE>D</KLASSE>
<GEZIN>G</GEZIN>
<ORG><RN>HUIS<KAT><VR><VO>HUIS</VO>
<PR>WOON</PR>
</VR>
</KAT>
LAND<KAT><VR><VO>TUIN</VO>
</VR>
</KAT>
EFFECT<KAT><VR><VO>OBLIGATIE</VO>
<BIJZ>FRANKRIJK</BIJZ>
<AA>1 GELD</AA>
<HOEV><GT>2</GT></HOEV>
</VR>
<TAX>2</TAX>
<VR><VO>OBLIGATIE</VO>
<BIJZ>PLANTERS OP DE EILANDEN</BIJZ>
<RENTE>4</RENTE>
</VR>
<TAX>1</TAX>
</KAT>
</RN>
<ORG>
</BOEDEL>
Query: "Give a report of all taxation’s
of stocks (EFFECT) for people who own houses and land"
CQL:
S ::= - <BOEDEL> - <ORG><RN>A & B & C</RN> - </BOEDEL> .
A ::= - ‘HUIS’ <KAT> - .
B ::= - ‘LAND’ <KAT> - .
C ::= - ‘EFFECT’ <KAT> - <TAX> ? {R:1} </TAX> - </KAT> - .
Result: The input will be matched, and both '2'
and '1' will be matched by the '?'.
Parse trees
The second example concerns the matching at arbitrary levels in a parse
tree.
If the tree is :
<S>NP
<VP>V
<NP>ADJ
N
<REL>...
<NP>DET
ADJ
N
</NP>
...
</REL>
</NP>
</VP>
</S>
and the pattern grammar is :
ADJ-NP ::= <S><..> - <NP> - ‘ADJ’ - </NP> - <../></S> .
which may be abbreviated to:
ADJ-NP ::= <S><..NP> - ‘ADJ’.
then the result will be two matches in the tree as shown below :
Enriched corpora
The third example concerns the querying of enriched corpora of natural
language texts.
The following DTD describes a corpus with paragraphs
of different authors. The words in the sentences are coded with their lemma
and with a morphological code (like ‘noun’ or ‘adverb’).
<!ELEMENT corpus (subcorpus*) >
<!ELEMENT subcorpus (identification, paragraph*) >
<!ELEMENT paragraph (heading, body) >
<!ELEMENT heading (date) >
<!ATTLIST heading author CDATA #REQUIRED>
<!ELEMENT body (sentence)* >
<!ELEMENT sentence (word_group)* >
<!ELEMENT word_group (word, lemma, morphological_coding) >
<!ELEMENT identification (#PCDATA) >
<!ELEMENT author (#PCDATA) >
<!ELEMENT word (#PCDATA) >
<!ELEMENT morphological_coding (#PCDATA) >
Query: retrieve all occurrences in the work of
Shakespeare where a sentence ending with a form of the lemma 'work' is followed
by a sentence ending with a noun or an adjective in 'ing'.
CQL:
S ::= - <..heading><@author=’Shakespeare’> - <../heading><body> - CONT1.
CONT1 ::= <sentence> - <word_group> - <lemma>’work’</><../sentence> CONT2.
CONT2 ::= <sentence> - <word_group><word> - ‘ing’</word> -
<morphological_coding> (‘noun’ | ‘adjective’) </><../sentence>.
Comparison with XPath
It is straightforward to show that XPath is a subset of CQL.
The difference in philosophy is that with a location path in XPath a
specific node in the document tree is reached. In CQL a query describes a
complete document (or document tree), where wildcards and Boolean operators
denote which parts are relevant or not.
Implicit in a CQL query are an indefinite number of location paths and
nodes which may be located. No precedence constructs are necessary because
all contexts are implicit in the pattern grammar. The sole mechanism for the
creation of reports are the notations for Signals and Reports which indicate
precisely a location in the document (be it implemented as a filepointer,
a pointer in a Dom or in a document tree). However, the report is only generated
when the whole pattern grammar matches the whole document. That means that
for any point in the document any context may be specified along both horizontal
and vertical axes in the document tree.
Note that the description above does not mean that a whole document
(tree) has to be matched during runtime. Optimization techniques allow for
inspection of only those parts of the document that are relevant for the query.
Mapping of standard database queries
In order to show the mapping of standard database queries we use a simple
example, taken from a standard textbook on database systems (Date, “An
Introduction to Databases”), which demonstrates the standard operations
of union, intersect, minus, select, project and join. In each case we will
show the corresponding CQL notation.
Date uses one example consistently throughout his textbook. It concerns
an education database of a commercial firm in which both teachers and students
are employees. The courses are defined by course#, title and description.
Each course has, possibly, prerequisite courses and is offered on a number
of dates and locations. It has a teacher and students, presented with their
names; in addition, students have grades.
The structure of this database may be described by the following DTD.
<!ELEMENT Data_Base (Course+) >
<!ELEMENT Course (Coursenr, Title, Description, Prereq*, Offering+) >
<!ELEMENT Prereq (Coursenr, Title) >
<!ELEMENT Offering (Date, Location, Format, Teacher, Student+) >
<!ELEMENT Teacher (Empnr, Name) >
<!ELEMENT Student (Empnr, Name, Grade) >
<!ELEMENT Coursenr (#PCDATA) > <!ELEMENT Format (#PCDATA) >
<!ELEMENT Title (#PCDATA) > <!ELEMENT Empnr (#PCDATA) >
<!ELEMENT Description (#PCDATA) > <!ELEMENT Name (#PCDATA) >
<!ELEMENT Date (#PCDATA) > <!ELEMENT Grade (#PCDATA) >
<!ELEMENT Location (#PCDATA) >
The basic retrieving functions in terms of CQL are :
UNION : retrieve "all Course#'s for courses which either have as a prerequisite
Course#=10 or which are located in Stockholm" :
S ::= - <Course><Coursenr> ? {R:1} </Coursenr> - (<Prereq><Coursenr>
‘10’ </Coursenr> - </Prereq> | <Offering> - <Location>’Stockholm’</Location>
- </Offering>) - </Course> -.
Which may be abbreviated to:
S ::= - <Course><Coursenr> ? {R:1} </> - (<Prereq><Coursenr>
‘10’ | <Offering> - <Location>’Stockholm’).
(In the sequel only the abbreviated notation is used.)
INTERSECT : retrieve "all Course#'s for courses which have as a prerequisite
Course#=10 and which are located in Stockholm"(straightforward syntactic variation
of UNION) :
S ::= - <Course><Coursenr> ? {R:1} </> - <Prereq><Coursenr>
‘10’ <../Prereq><Offering> ‘Stockholm’.
MINUS : retrieve "all Course#'s for courses which have as a prerequisite
Course#=10 and which are not located in Stockholm" (straightforward syntactic
variation of INTERSECT) :
S ::= - <Course><Coursenr> ? {R:1} </> - <Prereq><Coursenr>
‘10’ <../Prereq><Offering> ^‘Stockholm’.
SELECT : retrieve "all Course#'s for courses in Amsterdam" :
S ::= - <Course><Coursenr> ? {R:1} - </><Offering>’Amsterdam’.
PROJECT: retrieve "all Emp# of all students" :
S ::= - <Course>- <Offering>- <Student><Empnr> ? {R:1}.
JOIN (with PROJECT) : retrieve "all Course#'s for courses where the
student is the teacher of one of the prerequisite courses" :
S ::= - <Course> S1(c1, t1) & S2(c2, t2) {t1 == t2 & c1
== c2}.
S1(c1, t1) ::= <Coursenr> ? {R:1} </> - <Prereq><Coursenr>
? {c1 = %} </> - <Offering> - <Student> ? {t1 = %}.
S2(c2, t2) ::= <Coursenr> ? {c2 = %} </> - <Offering> - <Teacher>
? {t2 = %}.
Here the variables c1, c2, t1 and t2 are used. They get assigned the
last symbol that has been read in (denoted by the %-sign).
Principles of implementation
Compilation into efficient code
A query in CQL can be compiled into a Finite State Automaton (FSA).
The method is described fully in
[VanderSteen88]. A
few principles are described below.
As a starting point, we use the techniques for LR parser generation
(
[Aho86]). With these techniques, so called itemsets
are created which are associated with states in a FSA. An itemset contains
grammar rules in which a dot marks the position up to where the recognition
has proceeded. By extending the technique for itemset creation, the patterns
of CQL are compiled into a FSA.
As an example we show the itemset creation for a don’t care
Suppose we have the pattern grammar
The starting itemset contains the two items :
Itemset1:
S :: . - a a
S :: . - ? b
Based upon these two items, two types of input symbols have to be considered:
the ‘a’ and the set of all symbols that are not ‘a’,
which set we denote by the “rest symbol” ‘.’.
We construct, as the successor itemset for ‘a’ :
Itemset2:
S :: - a . a
S :: - ? . b
S :: . - a a
S :: . - ? b
As the successor itemset for ‘.’ we construct:
Itemset3:
S :: - ? . b
S :: . - a a
S :: . - ? b
In the same manner we may construct from Itemset3 successor itemsets
for ‘a’ and ‘b’. The rest symbol here denotes all
symbol which are not ‘a’ and ‘b’. Upon the rest symbol
we create an itemset which is identical to Itemset3. Therefore, Itemset3 (or,
to be precise, the associated state in the generated FSA) loops on the rest
symbol.
In order to deal with Boolean negations and their combinations with
regular expressions and nonterminals the itemdot is split into a number of
typed dots. Rules are introduced which govern transitions of the types of
itemdots.
Addition of indexes
States with a loop literally consume the input up to the point where
symbols occur on which they have a transition to another state. These looping
states are candidates for run-time optimizations.
Two optimizations suggest themselves.
- 1. If a state is waiting solely for an end-tag, the scanner may immediately
skip to that tag. The strategy may vary: if the scanner is operating on the
document tree, it can immediately jump to the parent node. If it operates
on a document it could make use of an index which contains the position of
end-tags.
- 2. If a state is waiting for a number of input symbols, an index could
be accessed which contains the positions of symbols. The type of the index
depends upon the types of symbols that the scanner will accept. For instance,
CQL could be configured as such that within a query only complete words may
be specified. In that situation a free text index may be used, eventually
extended with the ancestor context of (start-)tags. If more patterns within
words are allowed, PAT trees could be exploited.
In general it may be worthwhile to continue research on the type and
optimization of indexes when they have to cooperate with pattern matching
automata.
Implementation and experiences
CQL originated as an extension of a generalized system for recognition,
parsing and transduction of natural language texts (“Parspat”:
a parser for pattern
grammars). The extension was motivated by the shortcomings of relational database
systems for the handling of tree-structured documents and their poor facilities
for the handling of free text. A number of research projects in the Faculty
of Arts of the University of Amsterdam has made extensive use of the system.
The expressive power of the formalism of CQL proved itself rich enough for
the successful exploitation of enriched corpora of texts, encoded music, historical
estate inventories and bibliographic records.
Conclusion
With CQL it is possible to express all queries for structured documents
and standard database queries. It meets already most of the requirements of
[XmlQuery-req].
Queries can be compiled into efficient code for a FSA with skip instructions.
The formalism and formal machine of CQL is extendable towards higher grammar
formalisms for e.g. Natural Language Parsing.
Additional work has to be done to profit from traditional database indices.
In that respect, it is worth to develop more refined types of indices to optimize
the operation of the formal automaton.
Bibliography
| [Aho86] | Compilers. Principles, Techniques and Tools. Aho, A.V.,
Sethi, R. and Ullman, J.D., Addison Wesley, 1986 |
| [Cotton98] | Candidate Requirements for XML Query, Paul Cotton
and Ashok Malhotra, 1998. In Query Languages'98 (QL'98). Available
at http://www.w3.org/TandS/QL/QL98/pp/queryreq.html. |
| [Fernandez99] | XML Query Languages: Experiences and Exemplars,
Mary Fernandez, Jérôme Siméon, Philip Wadler, 1999. Available
at http://www.w3.org/1999/09/ql/docs/xquery.html. |
| [Maier98] | Database Desiderata for an XML Query Language, David
Maier, 1998. In Query Languages'98 (QL'98). Available at http://www.w3.org/TandS/QL/QL98/pp/maier.html. |
| [Robie99] | The Tree Structure of XML Queries, Jonathan Robie.
Available at http://www.w3.org/1999/10/xquery-tree.html. |
| [VanderSteen88] | A Program-Generator for Recognition, Parsing
and Transduction with Syntactic Patterns, Van der Steen, G.J., CWI Tracts
volume 55, Centre for Mathematics and Computer Science, Amsterdam, 1988, 284
pp. Available at http://www.palstar.nl. |
| [XmlQuery-req] | XML Query Requirements. W3C Working Draft 31 January
2000. Available at http://www.w3.org/TR/xmlquery-req. |