|
Text analysis tools for XML documents using regular expressions &
XSL
a Web application
|
 |
A common type of a text-analysis query is determined by several parameters,
such as: lexical material to search for; the documents to search; the ranges
of text to search within the documents; the analytical operation to perform,
such as frequency count or concordance; if context is requested, the way to
measure the amount of context.
We say that a text-analysis tool is "markup-aware if both query conditions
and context can be expressed in terms of markup. Most markup-aware text-analysis
tools are based on SGML-TEI; none of them, to the best of our knowledge, are
Web applications, and none use XSLT and XPath, two new languages codified
by W3C in November 1999. This paper describes a project with the following
design goal:
- They are Web applications, in the sense that their user interface
is a Web browser and they can be accessed over the any TCP/IP network.
- They can process queries containing both text patterns (described
by regular expressions) and markup patterns (described by XPath expressions).
- They are DTD-independent, in the sense that user interface is constructed
programmatically on the basis of the document's DTD.
- They are extensible in the sense that they can be extended by code
written in a general-purpose programming language (Java most easily).
The last item is for those queries that would be difficult or impossible
to express as an XSLT template. Fortunately, many XSLT processors, including
James Clark's xt which we are using, provide a mechanism that makes it possible
to run arbitrary Java code (packaged as a Java bean) from within xt and feed
the resulting node-set back into xt. We have, in effect, a Turing machine,
ready to compute anything computable; the challenge is to identify the required
functionality, and the user interface to it. The demo described in the end
of our paper, together with the accompanying tutorial, have been created to
solicit feedback and suggestions from our potential users.