|
Text analysis tools for XML documents using regular expressions &
XSL
a Web application
|
 |
A common type of a text-analysis query is determined by several parameters,
such as: lexical material to search for; the documents to search; the ranges
of text to search within the documents; the analytical operation to perform,
such as frequency count or concordance; if context is requested, the way to
measure the amount of context.
We say that a text-analysis tool is "markup-aware if both query conditions
and context can be expressed in terms of markup. Most markup-aware text-analysis
tools are based on SGML-TEI; none of them, to the best of our knowledge, are
Web applications, and none use XSLT and XPath, two new languages codified
by W3C in November 1999. This paper describes a project with the following
design goal:
- They are Web applications, in the sense that their user interface
is a Web browser and they can be accessed over the any TCP/IP network.
- They can process queries containing both text patterns (described
by regular expressions) and markup patterns (described by XPath expressions).
- They are DTD-independent, in the sense that user interface is constructed
programmatically on the basis of the document's DTD.
- They are extensible in the sense that they can be extended by code
written in a general-purpose programming language (Java most easily).
The last item is for those queries that would be difficult or impossible
to express as an XSLT template. Fortunately, many XSLT processors, including
James Clark's xt which we are using, provide a mechanism that makes it possible
to run arbitrary Java code (packaged as a Java bean) from within xt and feed
the resulting node-set back into xt. We have, in effect, a Turing machine,
ready to compute anything computable; the challenge is to identify the required
functionality, and the user interface to it. The demo described in the end
of our paper, together with the accompanying tutorial, have been created to
solicit feedback and suggestions from our potential users.
Several important technologies related to XML matured in 1999. In particular,
XSLT, a language for describing XML parse tree transformations, and XPath,
a language for specifying sets of tree nodes (a kind of Regular Expressions
language for paths in labeled trees), were completed and released in November
1999
1 At the same time, James Clark released a version of XT (his XSLT
processor) that is in close conformance with the Recommendations (
http://www.jclark.com/xml).
Several other XSLT processors are in the process of rapid development. Some
of them, including XT, have a built-in extension mechanism, so that if some
transformation is too hard for XSLT, it can be delegated to a general-purpose
programming language, typically Java.
These technologies can greatly influence the development of tools for
text analysis used by scholars in the humanities. Much of the functionality
of those tools has to do with searching the document for specific text patterns,
with additional search conditions specified in terms of markup. (For instance:
"find the first speech in the third act of the play that contains a word beginning
with the character sequence 'lov'".) Such searches can be thought of as transformations
of the original document into the result of the search, and they are therefore
easily expressible in XSLT and XPath. XSLT also includes functionality that
allows sorting and frequency counts.
At the same time, XML parsers, especially XML Java parsers, have also
made great improvements. There are ongoing Java parser projects at Sun, IBM,
Microsoft and Oracle, among others, with stable versions already available
and a regular schedule of new releases. The parsers have become faster and
more conformant, as described in Brownell's recent study and follow-up exchanges
(
http://www.xml.com). The rules of integrating a Java XML parser
into a Java program have also become codified (by Sun) as Java Application
Programming Interface for XML Parsing. Since most XSLT processors, including
James Clark's, are also written in Java, the same program can use an XML parser
and an XSLT processor to transform an XML document in any programmable way.
In particular, they can run any programmable query and format its result in
HTML (or XHTML).
The combined effect of these developments is that there is now a solid
foundation for text analysis tools that have these two properties:
- They are Web applications, in the sense that their user interface
is a Web browser and they can be accessed over the Internet;
- They are completely "markup-aware", in the sense that they can run
queries containing both text patterns (described by regular expressions) and
markup patterns (described by XPath expressions).
Since December 1999, a project to develop such tools has been under
way at Colgate University. The primary goals of this project are to produce
an application that retains the sophistication of existing text analysis tools
while using the platform of the World Wide Web. This allows us to provide
a simple user interface that does not require a large degree of technical
ability to execute powerful operations. The project uses the built-in internationalization
features of Java and XML to create a platform that will permit an easy transition
from English to foreign language texts. The project has also been designed
to be DTD independent in that no details of the document markup are hard-coded
into the program. The initial setup for a new DTD simply requires a text file
containing a list of all the tags in the DTD and can be conducted by the site
administrator. Following this, the production of HTML forms for the user interface
is fully automated. Making the project DTD-independent requires a slight trade-off
in the user interface in the form of increased complexity since all the DTD-specific
information is built into the HTML forms. However, it is our belief that designing
the project in this manner will allow much greater extensibility. For example,
it is conceivable that this program can be used to operate on any kind of
data that has been marked up in XML, even data that does not lend itself to
being easily marked up in popular formats such as TEI and DocBook.
The project has also been designed to return output from its queries
in XML with a simple result DTD of its own. Using another XSL stylesheet to
convert to HTML then produces the HTML output. The impact on the user is as
follows. The output can be formatted precisely the way the user would like
to see it by rewriting the conversion stylesheet or hiring a programmer to
do it. For example, the conversion stylesheet for frequency counts could return
its output in SVG so that the counts are displayed as a bar graph. It is also
possible to use a dummy stylesheet for conversion. This would return the raw
XML result from the query, which can be saved locally and used for further
processing, possibly with other tools. If the "other tools" require their
input data to use a specific DTD for markup, the conversion stylesheet can
be configured to return output marked up using that DTD. This flexibility
will fully leverage the power of XSLT and the portability and standardization
of XML. Since the conversion stylesheet is not hard-coded, there can be a
choice of conversion stylesheets on the server with the user choosing one
of them from a drop-down list at the time that a query is executed.
As of this writing (March 2000) we have a simple prototype running at
http://csproj.colgate.edu:8000/karthik/TextTools.html. The prototype
shows a specific and very simple DTD (Jon Bozak's play.dtd for Shakespeare
plays). In the final version, the first screen that the user sees will have
a list of texts that are available and DTD selection will be transparent to
the user.

Figure 1
. Screenshot 1 - Search Form
The interface(see
Figure 1) currently allows arbitrary
Perl5 regular expressions and arbitrary XPath expressions on input. It provides
drop-down selection lists of available XML elements and it also makes the
creation of XPath expressions a simple matter of choosing from drop-down lists
that contain lists of choices for the XPath axis, an XML element and text-boxes
to select on attributes or position. For users who do not know the details
of regular expression syntax, a rudimentary knowledge of how regular expressions
work will suffice. The text-box where the regular expression is entered can
also be accessed via an adjacent drop-down list that allows the user to select
options such as "any character", "whole word" and other regular expression
constructs and inserts them into the text-box to construct the search pattern.

Figure 2
. Screenshot 2 - Form configuration page
The number of elements on which searches can be conducted is unlimited
and can be set by the user(see
Figure 2). The same is
true of the number of search conditions for each element. Using the drop-down
boxes that are provided can create the majority of simple XPath expressions.
For the cases where this is not sufficient, the required expression can be
typed in a text-box by setting it as an attribute of an element. The default
behaviour is to perform a Boolean AND of all the search conditions. However,
this is easily changed to a Boolean OR by making the appropriate selection
in a drop-down list on the search page. We are also exploring the possibility
of extending XPath to allow regular expression searches in attributes and
element names. Given the current definition of XPath, this will probably require
a large tradeoff in speed. However, it may still be advantageous to enable
this option for complex searches or DTD's.
Two difficult problems facing the project are the user interface and
query optimization. For user interface development, we will seek collaboration
with an active humanities project that would use our tools and contribute
ideas for better functionality and user interface. The focus will be on retaining
functionality while increasing the user-friendliness of the interface. One
possible solution is to maintain a tiered interface with a simple but less
powerful interface for basic queries and gradually increasing the complexity
and the power of the interface for more advanced users and queries. For query
optimization, we will investigate methods of storing and indexing pre-processed
queries, trading disk space for processing time. There have been tentative
attempts recently to formulate correspondences between XPath and SQL, the
query language of relational databases. There has also been an ongoing discussion
on the xml-dev list about the need for a standard API for storage, search
and retrieval from repositories of XML documents. These may prove relevant
for the organization of large repositories of XML documents that can be accessed
over the Internet.
The project currently runs on a small NT server, but a Linux version
is also in progress. Since the program is written in Java and does not use
any native methods, porting to a variety of operating systems and environments
should not be difficult.
Bibliography
| [1] | Bozak, Jon. XML Shakespeare, in http://metalab.unc.edu/bosak/xml/eg/shaks200.zip. |
| [2] | Clark, James. The xt distribution at http://www.jclark.com/xml. |
| [3] | DeRose, Steven and C.M. Sperberg-McQueen. "A broadcast architecture
for distributed text tools". Proceedings of the ALLC/ACH conference, 1999. |
| [4] | W3C XSLT Recommendation, at http://www.w3c.org/tr |
| [5] | Nakhimovsky and Myers, Javascript Objects WROX 1998. |
| [6] | Nakhimovsky and Myers, Professional Java XML Programming
WROX 1999. |