XSLTVM - an XSLT Virtual Machine
Anguel Novoselsky
K Karun
Find


Abstract
The emergence and popularity of XML helps facilitate the development and integration of business and application semantics. However, each enterprise defines their own data elements to better communicate the "meaning" of their data. Translation will therefore be key for interoperability. XSL standards for transformation are sufficient to allow exchange between business vocabularies. Currently there are many XSLT engine implementations. In this paper, we present a novel approach for implementing transformations using XSL. We describe a XSLT Virtual Machine (XSLTVM). XSLTVM is the software implementation of a "CPU" designed to run compiled XSLT code. A concept of virtual machine assumes a compiler compiling XSLT stylesheets to sequence of byte codes or machine instructions for the "XSLT CPU". This approach clearly separates compile-time from run-time computations and specifies an uniform way of data exchange between instructions by defining a common interface areas like stack or pipeline for example. The separation line between compile-time and run-time computations depends on the level of machine instructions. Splitting a heterogeneous, high-level instruction into number of atomic, low-level instructions exposes some run-time checks and allows compiler to take care of them. For example, a general CMP instruction checks operand types at run-time. If it is replaced with few type specific CMPs then compiler checks operand types at compile-time and generates appropriate type casting instructions if needed.

Contents
  1. Introduction
  2. Architecture of XSLTVM
    1. Compiling for XSLTVM
      1. Instruction description
      2. Expression and template compilation
    2. Design goals
    3. Future work
  3. Bibliography

Introduction
XSLis a language for expressing stylesheets. It consists of two parts: a language for transforming XML documents, and an XML vocabulary for specifying formatting semantics. In this paper we discuss the part for transforming XML documents. A transformation expressed in XSLT describes rules for transforming a source tree into a result tree. The transformation is achieved by associating patterns with templates. A pattern is matched against elements in the source tree. A template is instantiated to create part of the result tree.
An XSLTVM is an abstract computing machine. Like a real computing machine, it has an instruction set and manipulates various memory areas at run time. It is reasonably common to implement a programming language using a virtual machine; the best-known virtual machine may be the P-Code machine of UCSD Pascal. The described XSLTVM is an attempt to define an stack oriented VM with low-level set of instructions as a virtual model for XSLT processing. However, the virtual machine does not assume any particular implementation technology, host hardware, or host operating system. It could be a software implementation of a "CPU" designed to run compiled XSLT code. A concept of virtual machine assumes a compiler compiling XSLT stylesheets to sequence of byte codes of machine instructions for "XSLT CPU".
This approach clearly separates compile-time from run-time computations and specifies an uniform way of data exchange between instructions by defining a common interface areas like stack or pipeline for example. The separation line between compile-time and run-time computations depends on the level of machine instructions. Splitting a heterogeneous, high-level instruction into number of atomic, low-level instructions exposes some run-time checks and allows compiler to take care of them. For example, a general CMP instruction checks operand types at run-time. If it is replaced with few type specific CMPs then compiler checks operand types at compile-time and generates appropriate type casting instructions if needed.
Although not explicitly mentioning terms like "virtual machines" the first generation of XSLT processors are implementing variants of quasi high-level instruction set in a form of objects with action functions/methods and as it was pointed out this is not an optimal solution. Also their architecture is usually designed as object-oriented enforcing data encapsulation. As a result data exchange between instructions/objects is not unified so the implementations are forced to use expensive heap memory allocations for temporary results. The first section includes information on the XSLTVM architecture and instructions set. Examples illustrating the compilation of XSLT stylesheet fragments into XSLTVM instructions with more detailed explanations are shown in Section 3. The next section emphasizes the design goals of the proposed VM approach over the current implementations. Finally we discuss the related and future work.
Previous Previous Table of Contents
Architecture of XSLTVM
The XSLT Virtual Machine has a stack based architecture. TheVM stack (is analogous to the stack of a conventional language: it holds local variables and partial results, and plays a part in template invocation and return. Also, most of XSLTVM machine instructions take their operands values from the top of the VM stack and replace them with the result. Stack cell contents types are:
Descriptors are node set descriptors and template descriptors. The node set descriptors are created dynamically at run-time. Each one has fields for the current node, for node set size and index which points to the actual nodes location in a node set stack. Template descriptors are created at compile-time and reside in a template descriptor pool. They are loaded into the VM stack during template call. Each template descriptor has the following fields:
The node set stack is analogous to the VM stack but contains data which size is not known at compiler time. It grows and shrinks in parallel with the VM stack, which simplifies the run time maintenance. XSLTVM has a machine instruction pool for template body instructions. Each template has an entry instruction which is executed first when the template is instantiated. The entry instruction index in the instruction pool is treated as a template address.
XSLT templates like programming language procedures have an associated template stack frame for parameters and local variables. Variables, template parameters and loop indexes are compiled to stack offset addresses, which eliminates run-time search. The frame is allocated into the VM stack during template call with size taken from a template descriptor. The process of frame allocation and template call is explained in more details in the next section. Functions like templates take their parameters from the stack, also.
A separate template pattern pool contains the all of the template patterns. This allows speed optimization techniques like parallel or table-driven pattern matching to be applied on the pattern pool contents. The XSLTVM instruction MATCH works with the pattern pool. It finds the wining pattern and pushes the corresponding template descriptor into the stack. A constant pool is used as a storage for names, literals and constants.
Compiling for XSLTVM
The XSLTVM works with the compiled set of instructions generated by a XSLT compiler. In this section, we discuss formats of a typical XSLT instructions.
Instruction description
With only few exceptions XSLTVM instructions have the following formats:
opcode, mode, operand1 [, operand2 ]
Depending on the mode the operands could be one of following:
Machine instructions are divided into two relatively independent subsets: XSLT and XPath. This allows the XPath subset in a form of XPath engine to be used somewhere else. The XSLT instruction subset includes instructions for template call, for variable and parameter load and store, instructions for element, attribute, comment and processing-instruction creation, conditional instructions, loops, etc. The XPath instruction subset includes instructions for step search, boolean instructions, arithmetic instructions, relational instructions, function call instructions, etc. Some of those instructions and their usage are described in more details in the next section. All XSLTVM pools are created at compile-time. Their sizes don't change at run-time so memory for them is allocated only once during loading. The only dynamic XSLTVM parts which need run time maintenance are the stacks - VM stack and node set stack.
Expression and template compilation
First example illustrates the compilation of an XPath expression including steps and filter sub-expressions.
/sales/domestic[position()=1 and @attr="value"]
is translated to:
CHILD #sales // constant pool index
CHILD #domestic // constant pool index
STORE i // store current node set as a
// loop index
JMP LOOP0011
LOOP0001: INCR i
LOOP0011: FOR_EACH i, ENDL0001 // instruction pool index
LOAD_S i // loads current node
CALL_FUNC #position
COMP #1
JNE LOOP0001
LOAD_S i
LOAD_ATTR #attr
COMP #value
JNE LOOP0001
ADD_NODE i // add 'i' to result node set
JMP LOOP0001
ENDL0001:
CHILD is a step search instruction which replaces the current node set which resides on the top of the stack with the result node set. STORE moves the current node set to the template frame cell allocated by compiler for loop index 'i'. INCR increments the value of 'i'. In a case of node set descriptor it increments current node field. LOAD_S loads single value. If it is a node set it loads the current node. FOR_EACH first operand is node set address, second - loop's end address. This instruction checks if the current node number is less then the set size. If not then jumps to end loop address. ADD_NODE instruction adds the current node to the result node set on the top of the stack if both filter conditions are true.
Next example illustrates how the previous XPath expression is used in template body:
<xsl:template match="doc">
<xsl:param name="file">table.html</xsl:param>
<table href="{$file}">
<xsl:apply-templates select="/sales/domestic[position() =1
and @attr = "value"] />
<xsl:with-param name="file">
file.xsd
</xsl:with-param>
</table>
</xsl:template>
is compiled to:
T0022: TEMPLATE
PARAM #file, #table.html
MK_STAG #mytable //creates <table>
MK_ATTR #href //current fragment is
//<table href=" ">
MK_ATTRVAL file
ADD_NODE _current //creates node set for XPath
.. //expression, evaluate XPath
.. //and as a result, the current
.. //node set is on the top
STORE j //store the XPath node set
JMP LOOP0012
LOOP0002: INCR j
LOOP0012: FOR_EACH j,ENDL0002
MATCH j //match with pattern pool
//matched template descriptor
//is pushed
ALLOC_FRAME j //allocates new template frame
PARAM #file,#table.xsd //stores param value at offset
CALL_TEMP _desc //jumps to template index from
//descriptor
JMP LOOP0002
ENDL0002:
MK_ETAG #table //creates </table>
RETURN
Template starts with PARAM instructions (if any). It checks the template descriptor (offset _desc) if the corresponding parameter (first operand) is already loaded. If not then it loads the parameter from the address specified by the second operand and marks it as done. MK_... instructions create elements, attributes, comments, text and processing-instructions. MK_ATTRVAL places the contents of parameter 'file' as a value of the current output attribute. MATCH takes the node from 'j', finds the winning pattern and loads the copy of template descriptor into the stack. ALLOC_FRAME uses this descriptor to allocate the template frame. It also loads the node specified by the first operand at offset _current. Note that the frame includes the template descriptor which is already in the stack. CALL_TEMP loads the next instruction index as a return address and jumps to the template entry instruction. RETURN jumps to the current frame return address and pops the frame from the stack.
Design goals
As it was pointed out the main design goal was to define an virtual machine as a model for XSLT processing. Also, we consider the following XSLTVM features as an important design improvement over the current generation of XSLT processors:
Once compiled XSLT 'programs' like Java classes can be run many times on different XSLTVM. Also, XSLT 'programs' become implementation language independent. In other words, there should be no difference between code generated from Java XSLT compiler and code from C/C++ XSLT compiler, they both should be able to run on the same XSLTVM.
Future work
Several of the current XSLT processors implement variants of quasi high-level instruction set in a form of objects with action functions/methods. The goal of a virtual machine is to identify a optimal low-level instruction set. As part of future work, we would like to refine the current instruction set based on performance optimizations. In addition, providing a standard binary file format for the compiled XSLT stylesheet would facilitate re-use of the compiler output with different implementations XSLTVMs.
We also plan to add support for XMLSchema datatypes and archetypes, and provide extensibility mechanism in XSLT virtual machine to make callout to other programming or scripting languages.
Previous Previous Table of Contents
Bibliography
[XML]World Wide Web Consortium. Extensible Markup Language (XML) 1.0. W3C Recommendation. See http://www.w3.org/TR/1998/REC-xml-19980210.
[XML Names]World Wide Web Consortium. Extensible Markup Language (XML) 1.0. W3C Recommendation. See http://www.w3.org/TR/1998/REC-xml-19980210
[XSLT]World Wide Web Consortium. XSL Transformations (XSLT). W3C Recommendation. See http://www.w3.org/TR/xslt
[XPATH]World Wide Web Consortium. XML Path Language. W3C Recommendation. See http://www.w3.org/TR/xpath
[JVMS]Tim Lindholm, Frank Yellin. The JavaTM Virtual Machine Specification. See http://java.sun.com/docs/books/vmspec/2nd-edition/html/VMSpecTOC.doc.html
Previous Previous Table of Contents