Indexsheets - the "Extensible Indexing Language" (XIL)
defines indexing based on XSLT/XPath
Bennett Cookson Jr.
Find


Abstract
Given a well marked-up XML document, we can format XML elements with XSL, but how are these elements indexed? This paper demonstrates why indexing should be determined per element type rather than the same for all elements. Then I will present a system based on XSLT called XIL for creating “Indexsheets” which is like a stylesheet for indexing. It is a rule-based system based on XPath and part of XSLT. An indexsheet is much like XSL and uses XSLT to identify elements; however, rather than applying formatting with “formatting objects,” indexing rules are applied with “indexing objects.” Since selection of elements is based on XSLT/XPath, the syntax and use is both familiar and standard. XIL also separates structure from indexing attributes as structure is separated from presentation.

Keywords

Contents
  1. Introduction
    1. HTML
  2. Why index elements differently
    1. Index by fields
    2. Document table of contents structure
      1. Content network example
  3. XIL
    1. Preserves the original document
    2. Separate structure from indexing attributes
    3. Always use xsl: apply-templates element
    4. Hit transformation
  4. XML, XSL, and XIL
    1. XML sample
    2. XSL simplified sample
    3. Index all elements sample 1
    4. XIL sample 2
      1. Explanation
    5. Indexing element attribute values
    6. Protecting valuable markup from unauthorized reuse
    7. Case-insensitive comparisons
  5. Indexing objects
    1. Applied to element
    2. Applies only to tag
    3. Indexing element attribute values
  6. Conclusion
  7. Bibliography

Introduction
Given a well marked-up XML document, we can format XML elements with XSL, but how are these elements indexed? This paper demonstrates why indexing should be determined per element type rather than the same for all elements. Once this need is established, I will present a system based on XSLT called XIL for creating “Indexsheets.” Combined with rich XML markup, Indexsheets become very powerful tools for information publishing.
Indexsheets, using “Extensible Indexing Language” or XIL, are essentially a stylesheet for indexing. It is a rule-based system based on XPath and part of XSLT. It was originally developed based on an early working draft of XSL. An indexsheet is much like XSL and uses XSLT to identify elements; however, rather than applying formatting with “formatting objects,” indexing rules are applied with “indexing objects.” Therefore, just as XSL can potentially format each element type or even a single element instance differently, indexsheets can index each according to specified indexing attributes.
Since selection of elements is based on XSLT/XPath, the syntax and use is both familiar and standard. The XSL Transformations (XSLT) W3C Recommendation says “XSLT is also designed to be used independently of XSL … for the kinds of transformations that are needed when XSLT is used as part of XSL.” [XSLT99]Directly applying indexing properties to elements is an example of a different way XSLT can be used. Although XIL is much like XSLT, it is different in that it applies indexing attributes to original document rather than producing a new document.
HTML
In today's publishing environment, HTML as a source format must still be supported. Therefore, XIL was designed to handle HTML as well. Actually, XIL works very well for HTML as long as the elements being matched are well formed and the rest of the document is reasonably clean.
Previous Previous Table of Contents
Why index elements differently
When I first approached the problem of indexing XML and HTML, my first thought was to index all text and all elements. However, particularly with HTML and its mess of formatting elements, I soon found out that blindly indexing all elements was not the best approach.
We could simply index all the text and elements and search for a term within a specific element, but this is not sufficient because the contents of all elements cannot simply be indexed as plain text. The contents of element types are different, which is why they are marked up in the first place. For example, is text in a regular paragraph as relevant for searching purposes as text in a heading? Although a user could specifically search for a term within a title element, a full-text search would not naturally weight a term within a title any higher than the same term found within a regular paragraph, or even a footnote. Headings, abstracts, and keywords are examples of elements, which should usually be weighted higher than other elements.
Many elements contain simple text and therefore a simple word index would be appropriate. However, the power of XML is to describe data with specific element types. The type of data contained in each element type is often different. For example, an element may actually contain a date, a price, a part number, etc. Do we want to be able to do comparisons of dates? Or search for a price less than $22? Are there some elements or text which should not be indexed? Are there some elements or text that should be indexed, but then removed for security reasons?
Index by fields
A field is a set of terms grouped together for indexing and searching purposes. A field can be applied to text contained within an element or an attribute value. This provides separation of searchable fields from element names or a single element. Although fields can be named after elements, using fields is more flexible than being limited to indexing by element names. A single field instance can map to the contents of one or more elements or the value of one or more attributes. Fields are often used to group elements from different document types together in the same search set. For example, this is especially useful when data is coming from multiple sources that use different element names, but share the same data type, and should logically be searched together. Fields are nested along with element structure.
A field name can be specified specifically or XIL can use an element name or the value of an element attribute to name a field. An attribute value could even be indexed in a field named after the value of another attribute.
Fields can be typed to allow for comparison searching such as a price less than $10. The field types include plain text, integers, real numbers, date, and time.
Document table of contents structure
XML elements defines all document structure, but only specific elements define document hierarchy that belongs in the table of contents. With XIL, certain elements are matched to define the table of contents structure and other elements for headings.
A table of contents could easily be generated by applying an XSL transformation to a document, but separating this structure from formatting can be valuable to the publishing system. To build a consistent table of contents across data collections with different schemas there needs to be a consistent way to mark the structure that belongs in the table of contents. One could enforce specific elements or attributes either in the original document or a transformation, but it is much cleaner and easier to specify separately with XIL. Each schema needs an indexsheet, but each is not required to use a particular element or attribute.
Content network example
Content networks provided a particular challenge in building a consistent table of contents across multiple sites linked into a content network. The latest NextPage Internet publishing products support content networks where the hierarchical structure from multiple sites across the Internet are merged into one. This integration includes a single table of contents that appears to the user as one site. The same search fields can be used across a content network even when data schemas differ. The ability of XIL to separate search fields and table of contents structure from specific elements plays an important part of bringing the sites together into one.
Previous Previous Table of Contents
XIL
This section will discuss a few XIL issues before presenting an example in the next section.
Preserves the original document
XIL is like XSL, but applying indexing attributes rather than formatting. XIL is much like XSL in that it uses XSLT to identify elements; but rather than producing a new document, it applies indexing attributes to the original document. With XSLT, each element must be translated or copied over; all others are lost. By comparison, XIL preserves the original document unless specifically instructed otherwise, and simply performs indexing operations on existing elements. (Exceptions are hidden, remove, and marking hits for potential hit highlighting.) In other words, an empty XSL stylesheet would produce an empty document whereas an empty XIL indexsheet would still preserve the original document, but only index the plain text with no fields or anything else special.
Separate structure from indexing attributes
Why not do a simple transformation? We all believe it is critical to separate structure from presentation. I propose that it is also important to separate structure from indexing attributes. Indexsheets could be implemented as a transformation (simply an instance of XSLT). But even if indexing elements or attributes were added to a document, the indexer would still need to identify and process those elements or attributes. But, using XSLT style XPath match rules the indexer can directly process the original elements. Thus, XIL abstracts the indexing process away from specific elements or attributes, and prevents the need to transform all documents to a schema that includes indexing elements.
Always use xsl: apply-templates element
In XIL, <xsl:apply-templates/> (originally called xsl:process-children) is always required as an immediate child of the lp:index element. This element represents all of the children of the actual matched element (including the simplest case of plain text). According to XSLT section 5.4 “the xsl:apply-templates instruction processes all of the children of the current node, including text nodes.” [XSLT99] In XSL, to remove text or elements a rule is defined that doesn’t apply templates to children. This is done by simply not including the xsl:apply-templates element. All children of target element (including text) are not processed at all. The only problem here is that it is far too easy to leave out xsl:apply-templates element and not notice the text missing from the index. When formatting a document it is more obvious when a block of text is actually missing rather than when it is there but not indexed. Therefore, consistent with preserving the original document, the decision was made to always require the xsl:apply-templates element and provide other means to specifically not index children of an element.
When using the xsl:apply-templates element, template rules are applied in the default straightforward recursive fashion rather than moving text or elements around in the document via select functions. Transformations needed for presentation should only be done when delivering a document, thus preserving the structure of the indexed document. If transformations are needed for document storage and indexing, such as converting to a common schema, it should be done in a preprocessing step separate from indexing. Indexsheets can however solve some schema compatibility problems by mapping multiple elements into the same search field, or by applying the same indexing properties.
Hit transformation
The original document is indexed and stored unchanged, but when requested as part of a list of documents found by a search (hit list) a transformation can occur to mark hits for potential hit highlighting. However, the new elements are probably not in the original document DTD if it is even available or processed. Besides the obvious requirement to be well formed, something must describe where in the document hit highlights and perhaps hit anchors are desired or even allowed. For example, some elements are not user viewable data so they should not be highlighted or may be translated to an HTML anchor (A element) so since anchors cannot be nested, if a hit anchor is used it must be postponed until it is valid.
When hit markup is enabled, the added hit elements must be specifically matched by formatting XSL to highlight hits. If a DTD is referenced when formatting document, it must allow the new hit highlight elements or document will not be determined valid.
Previous Previous Table of Contents
XML, XSL, and XIL
XIL looks a lot like XSL as shown in the following examples. The xsl:template syntax is based on XSLT with the same match attribute based on XPath.
For each XML schema (or DTD) there exists an XSL stylesheet and an XIL indexsheet. Here are samples taken from the XML 99 conference proceedings published with NextPage’s LivePublish. [XML99GCA]
XML sample
<PAPER ID="young" PDF="YES">
<TITLE>Electronic Information Commerce</TITLE>
<TRACK ID="publishing">Publishing with XML</TRACK>
<SESSION ID="publishing-6">Information Commerce</SESSION>
<PRES ID="young" TYPE="PPT">Electronic Information Commerce</PRES>
<AUTHOR ID="YoungRussel">Russel W. Young</AUTHOR>
<SECT>
<TITLE>Introduction</TITLE>
<SUBSECT1>
<TITLE>Electronic Commerce</TITLE>
<PARA>Electronic commerce means...</PARA>
</SUBSECT1>
</SECT>
</PAPER>
XSL simplified sample
<xsl:template match="PAPER/TITLE">
<div class="title"><table>
<xsl:apply-templates/>
</table></div>
</xsl:template>
<xsl:template match="AUTHOR">
<xsl:element name="a"><xsl:attribute name="class">author</xsl:attribute>
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
<xsl:template match="SECT">
<div class="sect"><xsl:apply-templates/></div>
</xsl:template>
<xsl:template match="SECT/TITLE">
<h2><xsl:apply-templates/></h2>
</xsl:template>
Index all elements sample 1
Although, it is recommended to customize an indexsheet for a specific schema (DTD), let’s start with an example of the first approach of indexing all text and all elements. Below is a rule that indexes all elements in separate fields. This rule simply matches all elements and indexes each with a default text field named the same as the element name.
<?xml version='1.0'?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/XSL/Transform/1.0"
xmlns:lp="http://www.NextPage.com/ns/indexsheet/2.0" >
<xsl:template match='*'>
<lp:index field-element-name="yes">
<xsl:apply-templates/>
</lp:index>
</xsl:template>
</xsl:stylesheet>
XIL sample 2
(Explained below)
<?xml version='1.0'?>
<xsl:stylesheet case-sensitive="no"
xmlns:xsl="http://www.w3.org/XSL/Transform/1.0"
xmlns:lp="http://www.NextPage.com/ns/indexsheet/2.0" >
<xsl:template match='PAPER/TITLE'>
<lp:index field="PaperTitle" relevance="highest" >
<xsl:apply-templates/>
</lp:index>
</xsl:template>
<xsl:template match='PAPER/AUTHOR'>
<lp:index field="PaperAuthor" hit-anchor="postpone">
<xsl:apply-templates/>
</lp:index>
</xsl:template>
<xsl:template match='PAPER/SESSION'>
<lp:index field="SessionTitle" relevance="higher" >
<xsl:apply-templates/>
</lp:index>
</xsl:template>
<xsl:template match='SECT'>
<lp:index field="Section" toc-section="yes">
<xsl:apply-templates/>
</lp:index>
</xsl:template>
<xsl:template match='SECT/TITLE'>
<lp:index field="SectionTitle" toc-heading="yes" relevance="high" >
<xsl:apply-templates/>
</lp:index>
</xsl:template>
</xsl:stylesheet>
Explanation
Notice that the XIL uses some of the same matches as the XSL but applies indexing rather than formatting. Just as the formatting objects (HTML in this example) in XSL define markup around matched elements, the lp:index element in XIL defines indexing attributes. In both cases, the xsl:apply-templates element represents recursive processing of all children.
For “PAPER/TITLE” a field is applied for searching purposes (a search form is provided to search on paper titles) and the relevance weight is increased since it is the title of the paper.
For “PAPER/AUTHOR” a field is applied for use in a search form and since it will generate an A (anchor/link) element in HTML we must postpone the hit-anchor, if any.
“SECT” is indexed with the “toc-section” attribute meaning that this element represents structure that should be included in sub-document Table of Contents.
“SECT/TITLE” is indexed with the “toc-heading” attribute meaning that this element represents structure that should be included in sub-document Table of Contents.
Indexing element attribute values
So far we have looked at indexing text nodes. Attribute values can be indexed with the lp:index-attribute indexing object. This example will index the ID attribute of TRACK elements in a TRACK-ID field.
<xsl:template match='TRACK'>
<lp:index-attribute name="ID" field="TRACK-ID"/>
</xsl:template>
XIL can specify which attribute to index and even name the field by the value of another. This rule indexes the value of each content attribute in any META element with a field named after the META attribute “name.” The following is an example for a source document with META elements.
<xsl:template match='META'>
<lp:index-attribute name="content" field-name-attribute="name"/>
</xsl:template>
XIL can use the value of an attribute to name a field for indexing an element. This rule will index each SPAN element with a field named by the value of the class attribute. The following is an example for a source document with SPAN elements.
span with field "SpanClass" -->
<xsl:template match='SPAN'>
<lp:index field-name-attribute="class">
<xsl:process-children/>
</lp:index>
</xsl:template>
Protecting valuable markup from unauthorized reuse
A client side XSL transformation is insufficient to protect valuable markup from unauthorized reuse (because full XML is sent across the wire and the XSL transformation occurs locally). An additional server side XSL transformation could be used to remove markup not needed for formatting. Alternatively, XIL allows the use of an element for searching purposes only because the element tags are removed from the document rather than just not being displayed. Thus you leverage value-added metadata for searching but protect it from unauthorized reuse.
Which elements or text nodes should be indexed, but not appear in the final document sent to the user? Often, markup is the value added to an otherwise plain text “public domain” document. Several publishing companies gather public domain data and add value through organization, categorization, and other valuable markup. The markup must be there for indexing to improve searching ability, but if it is sent in the document request someone could crawl the site and steal valuable markup. If the XML is converted to HTML on the server (preserving only formatting) the markup is protected, but now the document cannot be as easily reformatted on the client side. Some elements are of course required for presentation, but many metadata elements (such as keywords or categories) may not be required once the document has been found.
Personal Note: This feature of XIL is a controversial issue among my collogues at NextPage. Many argue that this transformation should be separate from indexing attributes. For the most part, I agree and it may be removed from XIL. But I find it useful for security reasons because it removes valuable markup at earliest opportunity but still allows proper indexing. I present it here because it almost always brings an interesting discussion.
Case-insensitive comparisons
The XSLT specification lists “case-insensitive comparisons” as one of the “Features under Consideration for Future Versions of XSLT” (Appendix G) However, since indexsheets are also used on HTML, one of most common problems was the case-sensitive nature of XSLT. So even though the specification warns that extensions “must not change the behavior of XSLT elements and functions defined in this document” (XSL Transformations (XSLT) Version 1.0, section 2.2 [XSLT99]), I found a case-insensitive comparison to be an absolute requirement. The syntax for case-insensitive comparisons is to add case-sensitive="no" as attribute of xsl:stylesheet element.
Previous Previous Table of Contents
Indexing objects
The primary set of indexing objects for XIL as of March 2000 is listed below. The most important objects have already been described and are repeated here for completeness. Examples are often given for common HTML elements since these are well known.
Applied to element
Indexing objects that apply to the whole element in the normal XSLT fashion. Because an element includes all children, applied to element means applied to all children and particularly text nodes.
fieldName of search field to apply.
field-name-attributeUse value of specified attribute as name of field to apply. The field name is specified in the source document rather than in the indexsheet. A field attribute may be added to an element in source data or may use an existing attribute such as the class attribute common to SPAN and DIV elements. Without the field-name-attribute feature a separate rule would need to be written for each unique field applied which could result in far too many rules to manage well.
toc-heading(yes|no|HTML) Mark elements which contain heading text for Table of Contents (TOC). Default is ‘no’. For value “yes,” matched element must be a child of an element which was indexed with attribute toc-section="yes." Value of “HTML” is special case for HTML and generates TOC structure from H1..H6 elements since HTML only marks headings and not the whole structure.
index(yes|no) Index is yes by default, set index=”no” to not index children of an element.
hit-anchor(yes|no|postpone) The default “yes” means allow decoder to drop a hit-anchor within this element. This attribute is used for elements such as the links <A HREF=…> in HTML which do not allow an anchor <A NAME=…> within it. The value “no” ignores the hit anchor completely and the “postpone” value postpones the hit anchor until one is allowed. For example, default HTML behavior is to postpone hit anchor and therefore the anchor is usually placed just after the end tag.
hit-hilite(yes|no) The default “yes” means allow decoder to highlight hits within this element. This attribute is used for elements such as <HEAD> in HTML, which is not user viewable data so it should not be highlighted.
hidden(yes|no) The default is "no". Text within element is for indexing only. Hidden text is indexed but removed from the final document for display. Rather than use this attribute, it is usually better to do these transformations with XSL at the display level unless it must be done at a lower level for security reasons.
relevance(normal|high|higher|highest) The default is “normal.” Adjusts the relevance weight for indexed terms within element. Allowed values are: "normal," "high," "higher," and "highest." For example, text within titles, headings, and keyword lists is usually weighted higher than other text.
Applies only to tag
Indexing objects that only affect the begin and end tags rather than the whole element which means that it does not apply to children of element including text.
break-word:
(yes|no) The default is "no".
Used to explicitly break words at the begin and end tags because some element tags break words for indexing purposes and some don't. If text has whitespace around all words in addition to tags then this option is not needed. However, if the stylesheet or browser "knows" that a specific tag breaks words but with no surrounding whitespace, it will display correctly but the indexer has no way of knowing whether a tag breaks works or not unless break-word attribute is specified. For example:
&lt;BigFont>A&lt;/BigFont> is for &lt;BigFont>A&lt;/BigFont>pple
Is indexed as one term: Apple. However, there are times when it is desirable to have tags break words. For example:
&lt;Letter>A&lt;/Letter>&lt;Word>Apple&lt;/Word>
&lt;Letter>B&lt;/Letter>&lt;Word>Bat&lt;/Word>
&lt;Letter>C&lt;/Letter>&lt;Word>Cat&lt;/Word>
This example is indexed as Aapple, Bbat, Ccat. To have the text indexed as A Apple, B Bat, C Cat, you would set the break-word attribute equal to "yes" for the lp:index element in the rule that matches on the Word element.
remove:
(yes|no) Remove tag out of document. Used with HTML to remove custom tags or Protect valuable markup for security reasons.
Indexing element attribute values
lp:index-attribute element attributes

nameName of attribute in selected element. The value of the specified element will be indexed.
FieldName of field to apply to value of attribute specified in above name attribute.
field-name-attributeUse value of specified attribute as name of field to apply. As above, field is applied to value of attribute specified in the name attribute.
Previous Previous Table of Contents
Conclusion
This paper has shown why indexing should be determined per element type rather than the same for all elements. The theory, syntax, and examples of XIL “Indexsheets” have been presented. I have tried to show how XIL Indexsheets can be a very powerful tool for information publishing.
Since I'm sure that NextPage is not the only company that has faced the need for this functionality, I am interested in hearing how others have solved the same problem. Currently there is only one implementation of XIL indexsheets. However, with input and collaboration from other organizations, I think XIL could be developed into a useful standard. I would be interested in hearing from organizations or individuals interested in pursuing research in this area.
Previous Previous Table of Contents
Bibliography
[XSLT99]XSL Transformations (XSLT) Version 1.0 W3C Recommendation 16 November 1999, http://www.w3.org/TR/1999/REC-xslt-19991116
[XPath99]XML Path Language (XPath) Version 1.0 W3C Recommendation 16 November 1999 http://www.w3.org/TR/1999/REC-xpath-19991116
[XSL00]Extensible Stylesheet Language (XSL) Version 1.0 W3C Working Draft 27 March 2000, http://www.w3.org/TR/2000/WD-xsl-20000327/
[XML99GCA]XML 99 GCA conference proceeding, online version published with NextPage’s LivePublish, http://xml99.nextpage.com/
[LivePublish2]NextPage LivePublish, http://www.nextpage.com/products/livepublish/
Previous Previous Table of Contents