Managing tokenizers in XML search
Jacek Ambroziak
Find


Abstract
We describe the indexing component of our full text search engine for XML. Our main goals are to be able to index documents of different types and to handle their structural and lexical conventions with precision. Indexing of each document type is controlled by an XSLT indexing stylesheet. One of the roles of the stylesheet is to select tokenizers to segment text components of documents into indexable tokens. Both the indexing stylesheets and tokenizers, which are represented as Java objects, can be downloaded by the indexer over the net.

Contents
  1. Introduction
  2. Tokenizers
  3. The central problem
  4. XSLT driven indexing
  5. Application and examples
  6. Implementation details
  7. Acknowledgements
  8. Bibliography

Introduction
Tokenization forms an integral part of full text indexing. In the process, character representations of source documents are segmented into tokens, which are the atomic units of information typically corresponding to natural language word forms. In addition to word forms tokenization can identify other types of meaning carrying symbols such as numbers, e-mail addresses, URLs, and, depending on the document's domain of discourse: telephone numbers, prices, measures, filenames and so on.
No single set of tokenization rules can handle all languages and/or domains. Even a single XML document may use several lexical conventions in different elements, e.g. in addresses, quoted code, tables, and so on. Instead of providing one very smart tokenizer attempting to cover all needs we designed a flexible architecture of pluggable tokenizers. Represented as Java objects they can be downloaded on demand and under the control of an indexing XSLT stylesheet applied to process text of a document in a context dependent fashion. The stylesheet's templates assign tokenizers to elements based on document structure and/or element attributes.
Previous Previous Table of Contents
Tokenizers
Tokenizers are program modules responsible for segmentation of text into broadly conceived of 'words.' Identified during indexing tokens are then often normalized to lowercase and put into an inverted file that forms the core of the text index. Lowercasing loses some potentially useful information but we won't concern ourselves with this aspect of text processing here. The inverted file index of normalized tokens, where token types are associated with documents and document positions of their occurrence, becomes the fundamental data structure used by a query engine to process user queries. The importance of tokenization stems from the fact that the output from this process determines what will be searchable: how the indexer/query engine will 'see' document collections.
Segmentation of text into tokens depends on application of tokenization rules which dictate where token boundaries should be. The simplest form of the process identifies the so called "black and white tokens"--contiguous regions of whitespace characters separating tokens formed exclusively with non-whitespace characters. This approximation of 'real' tokenization works quite well for limited purposes but it fails to acknowledge the special role of punctuation characters, which in addition to whitespace are often token delimiters and thus do not belong to tokens proper. In the next approximation we treat punctuation as whitespace and continue to apply the black and white mechanism. This variation represents one sweet point of the trade-off between mechanism simplicity and usefulness.
Previous Previous Table of Contents
The central problem
As we happen to live in the age of information flood and thus depend more and more on the quality of search engines to find information we seek with minimal effort, we need further sophistication of tokenizers (program modules responsible for tokenization).
For one thing, the simple process described so far is useless for some Asian languages that don't use whitespace for text segmentation. Token boundaries need to be identified within stretches of 'black' characters. Furthermore, when we want tokens, the atomic carriers of information in text, to represent addresses, measures, dates, prices, filenames and so on, we need genuine tokenization rules to reflect conventions employed in forming these units from ordinary and punctuation characters. For instance "4/9/2000" can be a US date. The same date in Central Europe will be represented as "9.4.2000". In both cases punctuation characters (slashes, dots) act as linking characters and not as delimiters. As we can see tokenization is a form of pattern recognition in text and can be used to classify tokens and perhaps normalize their meanings, for instance recognize dates and compute their standard representation. It will then be possible to find passages in text containing dates belonging to given intervals, sort text fragments by date, and so on.
The fundamental problem that we are addressing in this paper is the realization that no single set of tokenization rules can cover the demands of all varieties of indexable text. Individual natural languages and domains of discourse (chemistry, electronics, mathematics, music) will bring their own sets of conventions. Special document sections (keywords, bibliography, etc.) can likewise impose segmentation rules of their own.
The way we address the problem depends on a modular architecture: tokenizers sharing a common (and simple) interface can be plugged in during indexing and applied to text processing in a context dependent way.
The second main point to note is that tokenizer modules can be downloaded over the net together with the documents they are intended to process. In this way the publisher of the information can have complete control over how the their documents will be indexed. The tokenizers they supply will be reused for all documents of the same type with regard to lexical conventions. When the conventions change the tokenizers can follow the evolution. Different tokenizer versions can coexist easily since tokenizers are assigned to text sections dynamically on a per section basis. One can also imagine network accessible repositories of standard tokenizers to address common needs (news, product catalogs, and so one). Perhaps tokenizers can accompany standard document schemas in schema repositories.
The indexing system we are describing here is written completely in Java. The language makes it natural, easy and safe to download tokenizer objects (toklets?) over the net and run their code within the indexing environment.
Previous Previous Table of Contents
XSLT driven indexing
How are the tokenizers assigned to text sections? The indexing of XML documents in our system is performed under the control of XSLT [XSLT] "indexing stylesheets" [JRA01]. In a typical application stylesheets bind displayable qualities to structures within XML documents. Stylesheet templates assign font properties to runs of text, introduce purely graphical elements, copy and rearrange text and so on. Documents of differing types are transformed for a common graphical output convention.
By analogy, an indexing stylesheet transforms documents for full text indexing. The transformation involves selecting elements and attributes to be indexed, ignoring others, adding text not present in the original document to act like an annotation and to help find information through querying. One can think about the result of an indexing transformation as a document of an 'indexable' document type. Both text oriented and data oriented documents can be transformed to this type. In reality no output representation is built: the transformation engine directly controls the index building.
Of crucial importance to the present paper is that indexing stylesheets are responsible for binding tokenizers to document fragments. Both the indexing stylesheets and tokenizer objects can be downloaded over the net.
Previous Previous Table of Contents
Application and examples
The described methodology is used to index all types of documents from Shakespeare's plays (for demo purposes) to UNIX manual pages and Solaris documentation. One interesting application involves using the javadoc utility (see Java2 documentation) together with an XmlDoclet to generate XML documentation directly from Java code. javadoc extracts information on program structure (packages, classes, constructors, methods, etc.) from Java sources and associates programmer comments with program elements. All this information is available through the doclet API to the XmlDoclet which outputs valid XML documentation of the source programs. The documentation collections are then indexed and can be searched.
The search engine accepts queries expressed as sequences of query terms. It then finds text passages that contain as many as possible of the query terms (and/or their morphological variants) in close proximity, preferably in the same order and using the same morphological form. The passages are then ranked according to how close do they match the original query expression [JRA02].
The search engine additionally accepts context descriptions in the form of XPath [XPATH] expressions. These denote sets of nodes--the range of the query. Specific structural elements can be searched e.g. titles, abstracts, titles of major sections). The search range can also be specified by element attributes e.g. [@level='novice']).
Search engines, one per collection/index, can then become network services thanks to Jini [JINI] technology. Search engines are deployed as Jini services with meta information describing contents/topic of the indexed collection. The search services register with lookup services and place the meta information in lookup attributes. Based on the attributes clients can select and query collections they need. In this Federated Search Service individual services can run an arbitrary nodes of the network. Jini takes care of the decentralized administration of the federated service. Services can be plugged in or out independently. The services are available to multiple users querying them at the same time. Each user can query several collections at once: the query results are then merged.
An auxiliary service is responsible for displaying query hits. A query hit is a Java object (transmitted over RMI from remote search services) containing information on the relevant passage location. The information contains the URL of the original XML document, (possibly forked) path (XPath) location of the passage in the document structure, and token numbers of the matching terms to be highlighted. Whenever a user wants to see the matching passage together with its surrounding context she invokes the remote rendering Jini service to get the relevant document fragment. The service gets and parses (if not already cached) the requested document, locates an appropriate transformation, and runs it over the needed document fragment. In the process, the original tokenizer used to index the elements in scope of the query hit is run again to find the 'words' to highlight. The document fragment with highlighted terms is then shipped back to the user over RMI together with the document's table of contents (if not requested and shipped before). We currently use HTML as the medium of document visualization. It has the benefit that both conventional browsers and existing Java2 Swing components can be used to visualize documents.
Let's take a look at some examples from the indexing stylesheet for the javadoc/XmlDoclet application.
<xsl:template match="/|package|classDesc|interfaceDesc">
<xsl:apply-templates/>
</xsl:template>
This template simply selects document branches to be precessed.
<xsl:template match="class|interface">
<index:attribute index:nodeID="{generate-id(current())}"
index:attributeName="inPackage">
<xsl:apply-templates/>
</index:attribute>
</xml:template>
The above template matches class and interface elements and adds their 'inPackage' attribute to enable searches using this attribute to narrow down context. Please note that the index:attribute element follows xsl:apply-templates thus making the attribute range extend over all elements attributes and text recursively indexed by embedded template applications.
<xsl:template match="lead|detail|parameter|throws|returns|author">
<index:element index:tokenizer="com.sun.javasearch.util.SimpleTokenizer">
<xsl:apply-templates/>
</index:element>
</xsl:template>
Here index:element element conveys the information to the index builder that the matching elements need to be indexed, i.e. their descendant text nodes need to be tokenized and the positions of the tokens stored in the index. An attribute specifies which tokenizer to use. The tokenizer class must be available to the current class loader if not loaded already. The indexer creates and caches an instance of the named tokenizer class and applies the tokenizer object to process descendant text nodes. When another tokenizer is specified for a subelement, the current tokenizer is pushed onto a tokenizer stack and reactivated when indexing pops to the previous scope. The index builder notes which tokenizer is used for which text node and stores the tokenizer object with the index to support subsequent term highlighting during querying.
<xsl:template match="text()">
<index:text index:nodeID="{generate-id(current())}">
<xsl:value-of select="."/>
</index:text>
</xsl:template>
The index:text element causes the actual tokenization to happen using the tokenizer currently on the top of the tokenizer stack. Numbers of lowercased tokens not on the stoplist are stored in the index and linked to the document structure. The tokenizer objects need to implement a very simple interface with public void setText(String text) and public Token nextToken()methods, where Token is a class that comes with the system and has a public Token(String string, int start, int end) constructor.
Previous Previous Table of Contents
Implementation details
The indexing/query engine system described here builds on our previous experiences of developing a 100% Java search engine for JavaHelp [JavaHelp]. The original search engine code has been extended to handle XML documents and store their structure in a compressed format. The indexing stylesheets are interpreted by James Clark's XT. The indexes of XML collections occupy approximately 10% of the collection size as both structural and positional information is compressed.
Previous Previous Table of Contents
Acknowledgements
The author wishes to thank members of Sun's XML Technology Center for support and many useful suggestions.
Previous Previous Table of Contents
Bibliography
[JRA01]Jacek Ambroziak, XSLT in document indexing, XTech 2000, San Jose, CA, March 2000.
[XSLT]XSL Transformations (XSLT), Version 1.0, W3C Recommendation, 11/16/1999
[JavaHelp]http://www.java.sun.com/products/javahelp
[JRA02]Jacek Ambroziak and William A. Woods, Natural Language Technology in Precision Content Retrieval, Sun Labs Report, 1998.
[XPATH]XML Path Language (XPath), Version 1.0, W3C Recommendation, 11/16/1999
[JINI]http://www.sun.com/jini
Previous Previous Table of Contents