|
Indexsheets - the "Extensible Indexing Language" (XIL)
defines
indexing based on XSLT/XPath
|
 |
Given a well marked-up XML document, we can format XML elements with
XSL, but how are these elements indexed? This paper demonstrates why indexing
should be determined per element type rather than the same for all elements.
Then I will present a system based on XSLT called XIL for creating “Indexsheets”
which is like a stylesheet for indexing. It is a rule-based system based on
XPath and part of XSLT. An indexsheet is much like XSL and uses XSLT to identify
elements; however, rather than applying formatting with “formatting
objects,” indexing rules are applied with “indexing objects.”
Since selection of elements is based on XSLT/XPath, the syntax and use is
both familiar and standard. XIL also separates structure from indexing attributes
as structure is separated from presentation.
Introduction
Given a well marked-up XML document, we can format XML elements with
XSL, but how are these elements indexed? This paper demonstrates why indexing
should be determined per element type rather than the same for all elements.
Once this need is established, I will present a system based on XSLT called
XIL for creating “Indexsheets.” Combined with rich XML markup,
Indexsheets become very powerful tools for information publishing.
Indexsheets, using “Extensible Indexing Language” or XIL,
are essentially a stylesheet for indexing. It is a rule-based system based
on XPath and part of XSLT. It was originally developed based on an early working
draft of XSL. An indexsheet is much like XSL and uses XSLT to identify elements;
however, rather than applying formatting with “formatting objects,”
indexing rules are applied with “indexing objects.” Therefore,
just as XSL can potentially format each element type or even a single element
instance differently, indexsheets can index each according to specified indexing
attributes.
Since selection of elements is based on XSLT/XPath, the syntax and use
is both familiar and standard. The XSL Transformations (XSLT) W3C Recommendation
says “XSLT is also designed to be used independently of XSL …
for the kinds of transformations that are needed when XSLT is used as part
of XSL.”
[XSLT99]Directly applying indexing properties
to elements is an example of a different way XSLT can be used. Although XIL
is much like XSLT, it is different in that it applies indexing attributes
to original document rather than producing a new document.
HTML
In today's publishing environment, HTML as a source format must still
be supported. Therefore, XIL was designed to handle HTML as well. Actually,
XIL works very well for HTML as long as the elements being matched are well
formed and the rest of the document is reasonably clean.
Why index elements differently
When I first approached the problem of indexing XML and HTML, my first
thought was to index all text and all elements. However, particularly with
HTML and its mess of formatting elements, I soon found out that blindly indexing
all elements was not the best approach.
We could simply index all the text and elements and search for a term
within a specific element, but this is not sufficient because the contents
of all elements cannot simply be indexed as plain text. The contents of element
types are different, which is why they are marked up in the first place. For
example, is text in a regular paragraph as relevant for searching purposes
as text in a heading? Although a user could specifically search for a term
within a title element, a full-text search would not naturally weight a term
within a title any higher than the same term found within a regular paragraph,
or even a footnote. Headings, abstracts, and keywords are examples of elements,
which should usually be weighted higher than other elements.
Many elements contain simple text and therefore a simple word index
would be appropriate. However, the power of XML is to describe data with specific
element types. The type of data contained in each element type is often different.
For example, an element may actually contain a date, a price, a part number,
etc. Do we want to be able to do comparisons of dates? Or search for a price
less than $22? Are there some elements or text which should not be indexed?
Are there some elements or text that should be indexed, but then removed for
security reasons?
Index by fields
A field is a set of terms grouped together for indexing and searching
purposes. A field can be applied to text contained within an element or an
attribute value. This provides separation of searchable fields from element
names or a single element. Although fields can be named after elements, using
fields is more flexible than being limited to indexing by element names. A
single field instance can map to the contents of one or more elements or the
value of one or more attributes. Fields are often used to group elements from
different document types together in the same search set. For example, this
is especially useful when data is coming from multiple sources that use different
element names, but share the same data type, and should logically be searched
together. Fields are nested along with element structure.
A field name can be specified specifically or XIL can use an element
name or the value of an element attribute to name a field. An attribute value
could even be indexed in a field named after the value of another attribute.
Fields can be typed to allow for comparison searching such as a price
less than $10. The field types include plain text, integers, real numbers,
date, and time.
Document table of contents structure
XML elements defines all document structure, but only specific elements
define document hierarchy that belongs in the table of contents. With XIL,
certain elements are matched to define the table of contents structure and
other elements for headings.
A table of contents could easily be generated by applying an XSL transformation
to a document, but separating this structure from formatting can be valuable
to the publishing system. To build a consistent table of contents across data
collections with different schemas there needs to be a consistent way to mark
the structure that belongs in the table of contents. One could enforce specific
elements or attributes either in the original document or a transformation,
but it is much cleaner and easier to specify separately with XIL. Each schema
needs an indexsheet, but each is not required to use a particular element
or attribute.
Content network example
Content networks provided a particular challenge in building a consistent
table of contents across multiple sites linked into a content network. The
latest NextPage Internet publishing products support content networks where
the hierarchical structure from multiple sites across the Internet are merged
into one. This integration includes a single table of contents that appears
to the user as one site. The same search fields can be used across a content
network even when data schemas differ. The ability of XIL to separate search
fields and table of contents structure from specific elements plays an important
part of bringing the sites together into one.
XIL
This section will discuss a few XIL issues before presenting an example
in the next section.
Preserves the original document
XIL is like XSL, but applying indexing attributes rather than formatting.
XIL is much like XSL in that it uses XSLT to identify elements; but rather
than producing a new document, it applies indexing attributes to the original
document. With XSLT, each element must be translated or copied over; all others
are lost. By comparison, XIL preserves the original document unless specifically
instructed otherwise, and simply performs indexing operations on existing
elements. (Exceptions are hidden, remove, and marking hits for potential hit
highlighting.) In other words, an empty XSL stylesheet would produce an empty
document whereas an empty XIL indexsheet would still preserve the original
document, but only index the plain text with no fields or anything else special.
Separate structure from indexing attributes
Why not do a simple transformation? We all believe it is critical to
separate structure from presentation. I propose that it is also important
to separate structure from indexing attributes. Indexsheets could be implemented
as a transformation (simply an instance of XSLT). But even if indexing elements
or attributes were added to a document, the indexer would still need to identify
and process those elements or attributes. But, using XSLT style XPath match
rules the indexer can directly process the original elements. Thus, XIL abstracts
the indexing process away from specific elements or attributes, and prevents
the need to transform all documents to a schema that includes indexing elements.
Always use xsl: apply-templates element
In XIL, <xsl:apply-templates/> (originally called xsl:process-children)
is always required as an immediate child of the lp:index element. This element
represents all of the children of the actual matched element (including the
simplest case of plain text). According to XSLT section 5.4 “the xsl:apply-templates
instruction processes all of the children of the current node, including text
nodes.”
[XSLT99] In XSL, to remove text or elements
a rule is defined that doesn’t apply templates to children. This is
done by simply not including the xsl:apply-templates element. All children
of target element (including text) are not processed at all. The only problem
here is that it is far too easy to leave out xsl:apply-templates element and
not notice the text missing from the index. When formatting a document it
is more obvious when a block of text is actually missing rather than when
it is there but not indexed. Therefore, consistent with preserving the original
document, the decision was made to always require the xsl:apply-templates
element and provide other means to specifically not index children of an element.
When using the xsl:apply-templates element, template rules are applied
in the default straightforward recursive fashion rather than moving text or
elements around in the document via select functions. Transformations needed
for presentation should only be done when delivering a document, thus preserving
the structure of the indexed document. If transformations are needed for document
storage and indexing, such as converting to a common schema, it should be
done in a preprocessing step separate from indexing. Indexsheets can however
solve some schema compatibility problems by mapping multiple elements into
the same search field, or by applying the same indexing properties.
Hit transformation
The original document is indexed and stored unchanged, but when requested
as part of a list of documents found by a search (hit list) a transformation
can occur to mark hits for potential hit highlighting. However, the new elements
are probably not in the original document DTD if it is even available or processed.
Besides the obvious requirement to be well formed, something must describe
where in the document hit highlights and perhaps hit anchors are desired or
even allowed. For example, some elements are not user viewable data so they
should not be highlighted or may be translated to an HTML anchor (A element)
so since anchors cannot be nested, if a hit anchor is used it must be postponed
until it is valid.
When hit markup is enabled, the added hit elements must be specifically
matched by formatting XSL to highlight hits. If a DTD is referenced when formatting
document, it must allow the new hit highlight elements or document will not
be determined valid.
XML, XSL, and XIL
XIL looks a lot like XSL as shown in the following examples. The xsl:template
syntax is based on XSLT with the same match attribute based on XPath.
For each XML schema (or DTD) there exists an XSL stylesheet and an XIL
indexsheet. Here are samples taken from the XML 99 conference proceedings
published with NextPage’s LivePublish.
[XML99GCA]
XML sample
<PAPER ID="young" PDF="YES">
<TITLE>Electronic Information Commerce</TITLE>
<TRACK ID="publishing">Publishing with XML</TRACK>
<SESSION ID="publishing-6">Information Commerce</SESSION>
<PRES ID="young" TYPE="PPT">Electronic Information Commerce</PRES>
<AUTHOR ID="YoungRussel">Russel W. Young</AUTHOR>
<SECT>
<TITLE>Introduction</TITLE>
<SUBSECT1>
<TITLE>Electronic Commerce</TITLE>
<PARA>Electronic commerce means...</PARA>
</SUBSECT1>
</SECT>
</PAPER>
XSL simplified sample
<xsl:template match="PAPER/TITLE">
<div class="title"><table>
<xsl:apply-templates/>
</table></div>
</xsl:template>
<xsl:template match="AUTHOR">
<xsl:element name="a"><xsl:attribute name="class">author</xsl:attribute>
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
<xsl:template match="SECT">
<div class="sect"><xsl:apply-templates/></div>
</xsl:template>
<xsl:template match="SECT/TITLE">
<h2><xsl:apply-templates/></h2>
</xsl:template>
Index all elements sample 1
Although, it is recommended to customize an indexsheet for a specific
schema (DTD), let’s start with an example of the first approach of indexing
all text and all elements. Below is a rule that indexes all elements in separate
fields. This rule simply matches all elements and indexes each with a default
text field named the same as the element name.
<?xml version='1.0'?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/XSL/Transform/1.0"
xmlns:lp="http://www.NextPage.com/ns/indexsheet/2.0" >
<xsl:template match='*'>
<lp:index field-element-name="yes">
<xsl:apply-templates/>
</lp:index>
</xsl:template>
</xsl:stylesheet>
XIL sample 2
(Explained below)
<?xml version='1.0'?>
<xsl:stylesheet case-sensitive="no"
xmlns:xsl="http://www.w3.org/XSL/Transform/1.0"
xmlns:lp="http://www.NextPage.com/ns/indexsheet/2.0" >
<xsl:template match='PAPER/TITLE'>
<lp:index field="PaperTitle" relevance="highest" >
<xsl:apply-templates/>
</lp:index>
</xsl:template>
<xsl:template match='PAPER/AUTHOR'>
<lp:index field="PaperAuthor" hit-anchor="postpone">
<xsl:apply-templates/>
</lp:index>
</xsl:template>
<xsl:template match='PAPER/SESSION'>
<lp:index field="SessionTitle" relevance="higher" >
<xsl:apply-templates/>
</lp:index>
</xsl:template>
<xsl:template match='SECT'>
<lp:index field="Section" toc-section="yes">
<xsl:apply-templates/>
</lp:index>
</xsl:template>
<xsl:template match='SECT/TITLE'>
<lp:index field="SectionTitle" toc-heading="yes" relevance="high" >
<xsl:apply-templates/>
</lp:index>
</xsl:template>
</xsl:stylesheet>
Explanation
Notice that the XIL uses some of the same matches as the XSL but applies
indexing rather than formatting. Just as the formatting objects (HTML in this
example) in XSL define markup around matched elements, the lp:index element
in XIL defines indexing attributes. In both cases, the xsl:apply-templates
element represents recursive processing of all children.
For “PAPER/TITLE” a field is applied for searching purposes
(a search form is provided to search on paper titles) and the relevance weight
is increased since it is the title of the paper.
For “PAPER/AUTHOR” a field is applied for use in a search
form and since it will generate an A (anchor/link) element in HTML we must
postpone the hit-anchor, if any.
“SECT” is indexed with the “toc-section” attribute
meaning that this element represents structure that should be included in
sub-document Table of Contents.
“SECT/TITLE” is indexed with the “toc-heading”
attribute meaning that this element represents structure that should be included
in sub-document Table of Contents.
Indexing element attribute values
So far we have looked at indexing text nodes. Attribute values can be
indexed with the lp:index-attribute indexing object. This example will index
the ID attribute of TRACK elements in a TRACK-ID field.
<xsl:template match='TRACK'>
<lp:index-attribute name="ID" field="TRACK-ID"/>
</xsl:template>
XIL can specify which attribute to index and even name the field by
the value of another. This rule indexes the value of each content attribute
in any META element with a field named after the META attribute “name.”
The following is an example for a source document with META elements.
<xsl:template match='META'>
<lp:index-attribute name="content" field-name-attribute="name"/>
</xsl:template>
XIL can use the value of an attribute to name a field for indexing an
element. This rule will index each SPAN element with a field named by the
value of the class attribute. The following is an example for a source document
with SPAN elements.
span with field "SpanClass" -->
<xsl:template match='SPAN'>
<lp:index field-name-attribute="class">
<xsl:process-children/>
</lp:index>
</xsl:template>
Protecting valuable markup from unauthorized reuse
A client side XSL transformation is insufficient to protect valuable
markup from unauthorized reuse (because full XML is sent across the wire and
the XSL transformation occurs locally). An additional server side XSL transformation
could be used to remove markup not needed for formatting. Alternatively, XIL
allows the use of an element for searching purposes only because the element
tags are removed from the document rather than just not being displayed. Thus
you leverage value-added metadata for searching but protect it from unauthorized
reuse.
Which elements or text nodes should be indexed, but not appear in the
final document sent to the user? Often, markup is the value added to an otherwise
plain text “public domain” document. Several publishing companies
gather public domain data and add value through organization, categorization,
and other valuable markup. The markup must be there for indexing to improve
searching ability, but if it is sent in the document request someone could
crawl the site and steal valuable markup. If the XML is converted to HTML
on the server (preserving only formatting) the markup is protected, but now
the document cannot be as easily reformatted on the client side. Some elements
are of course required for presentation, but many metadata elements (such
as keywords or categories) may not be required once the document has been
found.
Personal Note: This feature of XIL is a controversial issue among my
collogues at NextPage. Many argue that this transformation should be separate
from indexing attributes. For the most part, I agree and it may be removed
from XIL. But I find it useful for security reasons because it removes valuable
markup at earliest opportunity but still allows proper indexing. I present
it here because it almost always brings an interesting discussion.
Case-insensitive comparisons
The XSLT specification lists “case-insensitive comparisons”
as one of the “Features under Consideration for Future Versions of XSLT”
(Appendix G) However, since indexsheets are also used on HTML, one of most
common problems was the case-sensitive nature of XSLT. So even though the
specification warns that extensions “must not change the behavior of
XSLT elements and functions defined in this document” (XSL Transformations
(XSLT) Version 1.0, section 2.2
[XSLT99]), I found
a case-insensitive comparison to be an absolute requirement. The syntax for
case-insensitive comparisons is to add case-sensitive="no" as attribute of
xsl:stylesheet element.
Indexing objects
The primary set of indexing objects for XIL as of March 2000 is listed
below. The most important objects have already been described and are repeated
here for completeness. Examples are often given for common HTML elements since
these are well known.
Applied to element
Indexing objects that apply to the whole element in the normal XSLT
fashion. Because an element includes all children, applied to element means
applied to all children and particularly text nodes.
|
|
|
| field | Name of search field to apply.
|
| field-name-attribute | Use value of specified
attribute as name of field to apply. The field name is specified in the source
document rather than in the indexsheet. A field attribute may be added to
an element in source data or may use an existing attribute such as the class
attribute common to SPAN and DIV elements. Without the field-name-attribute
feature a separate rule would need to be written for each unique field applied
which could result in far too many rules to manage well.
|
| toc-heading | (yes|no|HTML) Mark elements which
contain heading text for Table of Contents (TOC). Default is ‘no’.
For value “yes,” matched element must be a child of an element
which was indexed with attribute toc-section="yes." Value of “HTML”
is special case for HTML and generates TOC structure from H1..H6 elements
since HTML only marks headings and not the whole structure.
|
| index | (yes|no) Index is yes by default, set
index=”no” to not index children of an element.
|
| hit-anchor | (yes|no|postpone) The default “yes”
means allow decoder to drop a hit-anchor within this element. This attribute
is used for elements such as the links <A HREF=…> in HTML which
do not allow an anchor <A NAME=…> within it. The value “no”
ignores the hit anchor completely and the “postpone” value postpones
the hit anchor until one is allowed. For example, default HTML behavior is
to postpone hit anchor and therefore the anchor is usually placed just after
the end tag.
|
| hit-hilite | (yes|no) The default “yes”
means allow decoder to highlight hits within this element. This attribute
is used for elements such as <HEAD> in HTML, which is not user viewable
data so it should not be highlighted.
|
| hidden | (yes|no) The default is "no". Text within
element is for indexing only. Hidden text is indexed but removed from the
final document for display. Rather than use this attribute, it is usually
better to do these transformations with XSL at the display level unless it
must be done at a lower level for security reasons.
|
| relevance | (normal|high|higher|highest) The
default is “normal.” Adjusts the relevance weight for indexed
terms within element. Allowed values are: "normal," "high," "higher," and
"highest." For example, text within titles, headings, and keyword lists is
usually weighted higher than other text.
|
Applies only to tag
Indexing objects that only affect the begin and end tags rather than
the whole element which means that it does not apply to children of element
including text.
break-word:
(yes|no) The default is "no".
Used to explicitly break words at the begin and end tags because some
element tags break words for indexing purposes and some don't. If text has
whitespace around all words in addition to tags then this option is not needed.
However, if the stylesheet or browser "knows" that a specific tag breaks words
but with no surrounding whitespace, it will display correctly but the indexer
has no way of knowing whether a tag breaks works or not unless break-word
attribute is specified. For example:
<BigFont>A</BigFont> is for <BigFont>A</BigFont>pple
Is indexed as one term: Apple. However, there are times when it is desirable
to have tags break words. For example:
<Letter>A</Letter><Word>Apple</Word>
<Letter>B</Letter><Word>Bat</Word>
<Letter>C</Letter><Word>Cat</Word>
This example is indexed as Aapple, Bbat, Ccat. To have the text indexed
as A Apple, B Bat, C Cat, you would set the break-word attribute equal to
"yes" for the lp:index element in the rule that matches on the Word element.
remove:
(yes|no) Remove tag out of document. Used with HTML to remove custom
tags or Protect valuable markup for security reasons.
Indexing element attribute values
lp:index-attribute element attributes
|
|
|
| name | Name of attribute in selected element.
The value of the specified element will be indexed.
|
| Field | Name of field to apply to value of attribute
specified in above name attribute.
|
| field-name-attribute | Use value of specified
attribute as name of field to apply. As above, field is applied to value of
attribute specified in the name attribute.
|
Conclusion
This paper has shown why indexing should be determined per element type
rather than the same for all elements. The theory, syntax, and examples of
XIL “Indexsheets” have been presented. I have tried to show how
XIL Indexsheets can be a very powerful tool for information publishing.
Since I'm sure that NextPage is not the only company that has faced
the need for this functionality, I am interested in hearing how others have
solved the same problem. Currently there is only one implementation of XIL
indexsheets. However, with input and collaboration from other organizations,
I think XIL could be developed into a useful standard. I would be interested
in hearing from organizations or individuals interested in pursuing research
in this area.
Bibliography