|
Managing tokenizers in XML search
|
 |
We describe the indexing component of our full text search engine for
XML. Our main goals are to be able to index documents of different types and
to handle their structural and lexical conventions with precision. Indexing
of each document type is controlled by an XSLT indexing stylesheet. One of
the roles of the stylesheet is to select tokenizers to segment text components
of documents into indexable tokens. Both the indexing stylesheets and tokenizers,
which are represented as Java objects, can be downloaded by the indexer over
the net.