Text analysis tools for XML documents using regular expressions & XSL
a Web application
Karthik Jayaraman
Alexander Nakhimovsky
Full Content


Abstract
A common type of a text-analysis query is determined by several parameters, such as: lexical material to search for; the documents to search; the ranges of text to search within the documents; the analytical operation to perform, such as frequency count or concordance; if context is requested, the way to measure the amount of context.
We say that a text-analysis tool is "markup-aware if both query conditions and context can be expressed in terms of markup. Most markup-aware text-analysis tools are based on SGML-TEI; none of them, to the best of our knowledge, are Web applications, and none use XSLT and XPath, two new languages codified by W3C in November 1999. This paper describes a project with the following design goal:
  • They are Web applications, in the sense that their user interface is a Web browser and they can be accessed over the any TCP/IP network.
  • They can process queries containing both text patterns (described by regular expressions) and markup patterns (described by XPath expressions).
  • They are DTD-independent, in the sense that user interface is constructed programmatically on the basis of the document's DTD.
  • They are extensible in the sense that they can be extended by code written in a general-purpose programming language (Java most easily).
The last item is for those queries that would be difficult or impossible to express as an XSLT template. Fortunately, many XSLT processors, including James Clark's xt which we are using, provide a mechanism that makes it possible to run arbitrary Java code (packaged as a Java bean) from within xt and feed the resulting node-set back into xt. We have, in effect, a Turing machine, ready to compute anything computable; the challenge is to identify the required functionality, and the user interface to it. The demo described in the end of our paper, together with the accompanying tutorial, have been created to solicit feedback and suggestions from our potential users.