Managing tokenizers in XML search
Jacek Ambroziak
Full Content


Abstract
We describe the indexing component of our full text search engine for XML. Our main goals are to be able to index documents of different types and to handle their structural and lexical conventions with precision. Indexing of each document type is controlled by an XSLT indexing stylesheet. One of the roles of the stylesheet is to select tokenizers to segment text components of documents into indexable tokens. Both the indexing stylesheets and tokenizers, which are represented as Java objects, can be downloaded by the indexer over the net.