|
XSLTVM - an XSLT Virtual Machine
|
 |
The emergence and popularity of XML helps facilitate the development
and integration of business and application semantics. However, each enterprise
defines their own data elements to better communicate the "meaning" of their
data. Translation will therefore be key for interoperability. XSL standards
for transformation are sufficient to allow exchange between business vocabularies.
Currently there are many XSLT engine implementations. In this paper, we present
a novel approach for implementing transformations using XSL. We describe a
XSLT Virtual Machine (XSLTVM). XSLTVM is the software implementation of a
"CPU" designed to run compiled XSLT code. A concept of virtual machine assumes
a compiler compiling XSLT stylesheets to sequence of byte codes or machine
instructions for the "XSLT CPU". This approach clearly separates compile-time
from run-time computations and specifies an uniform way of data exchange between
instructions by defining a common interface areas like stack or pipeline for
example. The separation line between compile-time and run-time computations
depends on the level of machine instructions. Splitting a heterogeneous, high-level
instruction into number of atomic, low-level instructions exposes some run-time
checks and allows compiler to take care of them. For example, a general CMP
instruction checks operand types at run-time. If it is replaced with few type
specific CMPs then compiler checks operand types at compile-time and generates
appropriate type casting instructions if needed.
Introduction
XSLis
a language for expressing stylesheets. It consists of two parts: a language
for transforming XML documents, and an XML vocabulary for specifying formatting
semantics. In this paper we discuss the part for transforming XML documents.
A transformation expressed in XSLT describes rules for transforming a source
tree into a result tree. The transformation is achieved by associating patterns
with templates. A pattern is matched against elements in the source tree.
A template is instantiated to create part of the result tree.
An
XSLTVM is an
abstract computing machine. Like a real computing machine, it has an instruction
set and manipulates various memory areas at run time. It is reasonably common
to implement a programming language using a virtual machine; the best-known
virtual machine may be the P-Code machine of UCSD Pascal. The described XSLTVM
is an attempt to define an stack oriented VM with low-level set of instructions
as a virtual model for XSLT processing. However, the virtual machine does
not assume any particular implementation technology, host hardware, or host
operating system. It could be a software implementation of a "CPU" designed
to run compiled XSLT code. A concept of virtual machine assumes a compiler
compiling XSLT stylesheets to sequence of byte codes of machine instructions
for "XSLT CPU".
This approach clearly separates compile-time from run-time computations
and specifies an uniform way of data exchange between instructions by defining
a common interface areas like stack or pipeline for example. The separation
line between compile-time and run-time computations depends on the level of
machine instructions. Splitting a heterogeneous, high-level instruction into
number of atomic, low-level instructions exposes some
run-time checks and allows compiler to take care of them. For example, a general
CMP instruction checks operand types at run-time. If it is replaced with few
type specific CMPs then compiler checks operand types at compile-time and
generates appropriate type casting instructions if needed.
Although not explicitly mentioning terms like "virtual machines" the
first generation of XSLT processors are implementing variants of quasi high-level
instruction set in a form of objects with action functions/methods and as
it was pointed out this is not an optimal solution. Also their architecture
is usually designed as object-oriented enforcing data encapsulation. As a
result data exchange between instructions/objects is not unified so the implementations
are forced to use expensive heap memory allocations for temporary results.
The first section includes information on the XSLTVM architecture and instructions
set. Examples illustrating the compilation of XSLT stylesheet fragments into
XSLTVM instructions with more detailed explanations are shown in Section 3.
The next section emphasizes the design goals of the proposed VM approach over
the current implementations. Finally we discuss the related and future work.
Architecture of XSLTVM
The XSLT Virtual Machine has a stack based architecture. The
VM stack
(is analogous to the stack of a conventional language: it holds local variables
and partial results, and plays a part in template invocation and return. Also,
most of XSLTVM machine instructions take their operands values from the top
of the VM stack and replace them with the result. Stack cell contents types
are:
- node
- descriptor
- boolean
- number
- string
Descriptors are node set descriptors and template descriptors. The node
set descriptors are created dynamically at run-time. Each one
has fields for the current node, for node set size and index which points
to the actual nodes location in a node set stack. Template descriptors are
created at compile-time and reside in a template descriptor pool.
They are loaded into the VM stack during template call. Each template descriptor
has the following fields:
- size: template frame size equal to number of local parameters, plus
number of local variables, plus number of temporaries like loop indexes, plus
three. These three extra stack cells are allocated for the template current
node (referenced as _current in the examples bellow), for template
descriptor index (referenced as _desc) and for the return address.
- address: template instruction pool entry index.
- list of parameters and their offsets in the frame. This list is
used by PARAM instructions for parameter binding.
The node set stack is analogous to the VM stack
but contains data which size is not known at compiler time. It grows and shrinks
in parallel with the VM stack, which simplifies the run time maintenance.
XSLTVM has a machine instruction pool for template
body instructions. Each template has an entry instruction which is executed
first when the template is instantiated. The entry instruction index in the
instruction pool is treated as a template address.
XSLT templates like programming language procedures have an associated template
stack frame for parameters and local variables. Variables, template
parameters and loop indexes are compiled to stack offset addresses, which
eliminates run-time search. The frame is allocated into the VM stack during
template call with size taken from a template descriptor.
The process of frame allocation and template call is explained in more details
in the next section. Functions like templates take their parameters from the
stack, also.
A separate template pattern pool contains the
all of the template patterns. This allows speed optimization techniques like
parallel or table-driven pattern matching to be applied on the pattern pool
contents. The XSLTVM instruction MATCH works with the pattern pool. It finds
the wining pattern and pushes the corresponding template descriptor into the
stack. A constant pool is used as a storage for names,
literals and constants.
Compiling for XSLTVM
The XSLTVM works with the compiled set of instructions generated by
a XSLT compiler. In this section, we discuss formats of a typical XSLT instructions.
Instruction description
With only few exceptions XSLTVM instructions have the following formats:
opcode, mode, operand1 [, operand2 ]
Depending on the mode the operands could be one of following:
- address - stack index
- address - constant pool index
- address - top of the stack.
- address - instruction pool index.
- immediate value
Machine instructions are divided into two relatively independent subsets:
XSLT and XPath. This allows the XPath subset in a form of XPath engine to
be used somewhere else. The XSLT instruction subset includes instructions
for template call, for variable and parameter load and store, instructions
for element, attribute, comment and processing-instruction creation, conditional
instructions, loops, etc. The XPath instruction subset includes instructions
for step search, boolean instructions, arithmetic instructions, relational
instructions, function call instructions, etc. Some of those instructions
and their usage are described in more details in the next section. All XSLTVM
pools are created at compile-time. Their sizes don't change at run-time so
memory for them is allocated only once during loading. The only dynamic XSLTVM
parts which need run time maintenance are the stacks - VM stack and node set
stack.
Expression and template compilation
First example illustrates the compilation of an XPath expression including
steps and filter sub-expressions.
/sales/domestic[position()=1 and @attr="value"]
is translated to:
CHILD #sales // constant pool index
CHILD #domestic // constant pool index
STORE i // store current node set as a
// loop index
JMP LOOP0011
LOOP0001: INCR i
LOOP0011: FOR_EACH i, ENDL0001 // instruction pool index
LOAD_S i // loads current node
CALL_FUNC #position
COMP #1
JNE LOOP0001
LOAD_S i
LOAD_ATTR #attr
COMP #value
JNE LOOP0001
ADD_NODE i // add 'i' to result node set
JMP LOOP0001
ENDL0001:
CHILD is a step search instruction which replaces the current node set
which resides on the top of the stack with the result node set. STORE moves
the current node set to the template frame cell allocated by compiler for
loop index 'i'. INCR increments the value of 'i'. In a case of node set descriptor
it increments current node field. LOAD_S loads single value. If it is a node
set it loads the current node. FOR_EACH first operand is node set address,
second - loop's end address. This instruction checks if the current node number
is less then the set size. If not then jumps to end loop address. ADD_NODE
instruction adds the current node to the result node set on the top of the
stack if both filter conditions are true.
Next example illustrates how the previous XPath expression is used in
template body:
<xsl:template match="doc">
<xsl:param name="file">table.html</xsl:param>
<table href="{$file}">
<xsl:apply-templates select="/sales/domestic[position() =1
and @attr = "value"] />
<xsl:with-param name="file">
file.xsd
</xsl:with-param>
</table>
</xsl:template>
is compiled to:
T0022: TEMPLATE
PARAM #file, #table.html
MK_STAG #mytable //creates <table>
MK_ATTR #href //current fragment is
//<table href=" ">
MK_ATTRVAL file
ADD_NODE _current //creates node set for XPath
.. //expression, evaluate XPath
.. //and as a result, the current
.. //node set is on the top
STORE j //store the XPath node set
JMP LOOP0012
LOOP0002: INCR j
LOOP0012: FOR_EACH j,ENDL0002
MATCH j //match with pattern pool
//matched template descriptor
//is pushed
ALLOC_FRAME j //allocates new template frame
PARAM #file,#table.xsd //stores param value at offset
CALL_TEMP _desc //jumps to template index from
//descriptor
JMP LOOP0002
ENDL0002:
MK_ETAG #table //creates </table>
RETURN
Template starts with PARAM instructions (if any). It checks the template
descriptor (offset _desc) if the corresponding parameter (first operand) is
already loaded. If not then it loads the parameter from the address specified
by the second operand and marks it as done. MK_... instructions create elements,
attributes, comments, text and processing-instructions. MK_ATTRVAL places
the contents of parameter 'file' as a value of the current output attribute.
MATCH takes the node from 'j', finds the winning pattern and loads the copy
of template descriptor into the stack. ALLOC_FRAME uses this descriptor to
allocate the template frame. It also loads the node specified by the first
operand at offset _current. Note that the frame includes the template descriptor
which is already in the stack. CALL_TEMP loads the next instruction index
as a return address and jumps to the template entry instruction. RETURN jumps
to the current frame return address and pops the frame from the stack.
Design goals
As it was pointed out the main design goal was to define an virtual
machine as a model for XSLT processing. Also, we consider the following XSLTVM
features as an important design improvement over the current generation of
XSLT processors:
- XSLTVM doesn't allocate heap memory at run time. The only data memory
allocated is for the virtual machine stacks.
- Variables and parameters are compiled to stack offsets which eliminates
run time search.
- Template patterns are separated from templates into a pattern pool.
Effective searching techniques (Finite Automaton for example) could be applied
to find the winning pattern.
- Current XSLT implementations perform number of run-time checks which
can be done by compiler at compile-time but this is possible only in a case
of low-level target machine like XSLTVM.
Once compiled XSLT 'programs' like Java classes can be run many times
on different XSLTVM. Also, XSLT 'programs' become implementation language
independent. In other words, there should be no difference between code generated
from Java XSLT compiler and code from C/C++ XSLT compiler, they both should
be able to run on the same XSLTVM.
Future work
Several of the current XSLT processors implement variants of quasi high-level
instruction set in a form of objects with action functions/methods. The goal
of a virtual machine is to identify a optimal low-level instruction set. As
part of future work, we would like to refine the current instruction set based
on performance optimizations. In addition, providing a standard binary file
format for the compiled XSLT stylesheet would facilitate re-use of the compiler
output with different implementations XSLTVMs.
We also plan to add support for XMLSchema datatypes and archetypes,
and provide extensibility mechanism in XSLT virtual machine to make callout
to other programming or scripting languages.
Bibliography