XML Europe 2001 logo21-25 May 2001
Internationales Congress Centrum (ICC)
Berlin, Germany

Word And XML: Making The 'Twain Meet

Helen Watchorn
Paul Daly
 PDF version    Latest version   

ABSTRACT

MS-Word is the most widely used format for common business documents today, and XML is the most suitable information storage format for the future. How can the two be reconciled? Many solutions are already available, both commercial and free, which address the problems of converting information in MS-Word format into XML . This paper describes the major strategies taken by existing solutions, and describes YAWC (Yet Another Word Converter), which is our implementation of a solution to this problem.

Two major strategies are adopted for most Word to XML conversion applications. The simpler strategy is to treat the conversion process separately from the authoring process, and the more sophisticated strategy is to treat the authoring process as part of the conversion process.

Table of Contents

1. Standalone Word To XML Converters

Tools, which take the simpler strategy of converting MS-Word into XML , provide a standalone conversion application into which one or more Word documents are fed in, and one or more XML documents emerge. Very often these tools actually convert Rich Text Format (RTF) versions of a document, rather than the native MS-Word 'doc' format. This is useful if documents have been created by a number of different word-processing tools, but an irritating extra conversion step if not.

Many of the standalone converters define a fixed XML Document Type Definition (DTD) into which all input documents are converted. This usually means that a further processing step must be carried out, to convert the raw XML output to comply with the DTD required a particular organisations requirements. However, because the input for this step is well-formed XML , eXtensible Stylesheet Language Transformation (XSLT) can be used as a very efficient method for carrying out this processing.

Standalone converters are useful in a number of situations. If there is a large repository of legacy information to be converted, a batch-oriented approach is efficient. If you wish to protect authors from the details of XML , they can send their documents to a central administrator who can take responsibility for the conversion stage. Such converters are also usually either inexpensive or free.

Examples of standalone converters are Majix, from Tetrasix (http://www.tetrasix.com) and UpCast, from Infinity Loop (http://www.infinity-loop.de). The original of the species was the Rainbow converter for RTF to SGML conversion.

Two of the major drawbacks of standalone converters are that the conversion process is not integrated with the authoring environment, and can take a number of manual steps (conversion from MS-Word to RTF, RTF to raw XML , raw XML to desired XML ). The conversion process is one-way only, so that the information is maintained in Word format only. If changes are made to the XML version, these must be re-keyed in the Word version if the document is being maintained over a long period. Finally, considerable programming skills are often required to configure an XSLT script to convert the raw XML from the converter into the final desired result.

2. Integrated Word To XML Editing And Conversion

The second major strategy for turning MS-Word into XML is to integrate the conversion process into the Word environment. With this approach, MS-Word is converted into a structured syntax-directed authoring environment, similar to a structured XML editor like SoftQuad XmetaL. Usually this is achieved by creating a mapping between named styles in Word and elements in the XML target DTD. Authors are constrained in two ways. Firstly they can use only the styles named in the document, and may not add their own new styles. Secondly, they may only be able to use particular styles in particular locations in the document, for example a 'Country' style is allowed only after a 'City' style in an address. Usually, tools taking this approach not only convert MS-Word into XML , but also read XML files into MS-Word for editing.

This approach is useful when full two-way conversion between MS-Word and XML is required. Edits can be made either in an XML editor or in Word, so if the information is subject to ongoing change, there is no requirement to maintain separate Word and XML versions. When a DTD is very complex, the ability to provide direction to authors as to what mark-up is allowed at a particular point is also useful.

Examples of integrated Word XML editing environments are S4/Text from i4i (http://www.i4i.com) and Worx SE from HyperVision (http://www. hvltd.com). Microsoft released a product called SGML Author for Word, which also used this approach, but this is no longer supported.

The major drawbacks of integrated Word XML editors are that the software is expensive to buy, and complex to configure for particular DTD's. This type of tool also makes major modifications to the default Word authoring environment, which can be disorienting for authors, and may have a significant impact on performance. Usually the underlying rationale behind using MS-Word as the XML editing tool is to offer authors a familiar environment. Changing that environment to a significant degree begs the question of why bother with MS-Word at all? Why not simply use a full What You See Is What You Get (WYSIWYG)XML editor such as XMetaL?

3. Other Word To XML Converters

Another category of converter worth mentioning are the plethora of custom scripts written in WordBasic, Visual Basic for Applications (VBA), Omnimark or Perl , which have been developed over the years by many organizations to convert their own particular documents into XML , HTML or SGML . By and large, these converters never see the light of day, as they are designed for specific niche needs and are not generalisable.

3.1. Word 2000 As An XML Converter

Finally Word 2000 offers a number of features that go part of the way towards offering the ability to convert Word documents into XML . Documents can be saved as HTML , and nearly all the information encoded in the document is saved in 'islands' of XML inside the HTML output. Unlike Word 97, Word 2000 also converts heading styles into HTML headings (H1, H2, etc).

Potentially, an XSLT stylesheet could be used to convert the HTML into XML . However, in practice, the HTML is not well-formed, and neither are the XML islands, so it would be quite difficult to do so.

The next version of MS-Office, due in late 2001, will support input and output of well-formed XML for both Excel and Access, but no mention is made of Word, so perhaps this issue is not going to be addressed in MS-Word 10 either.

4. YAWC: Yet Another Word Converter

Both the approaches described above have their advantages and drawbacks. In 2000, XML Workshop decided to develop their own contribution to the art of Word to XML conversion, partly because of increasing customer demand for solutions to this general problem, partly through dissatisfaction with the available solutions, and partly because we thought we could do better ourselves.

The basic assumption underlying Yet Another Word Converter (YAWC) is that we believe that authors can be easily trained (and trusted) to use MS-Word named styles to mark up their content in such a way that it is easy to convert to XML . Our experience is that given a small amount of training, a reasonable amount of built-in assistance, and a clear understanding of its value and importance, authors are quite willing to expend the extra effort in applying markup to MS-Word documents.

4.1. Design Goals

The design goals we set for the project included the best features of the various approaches we perceived in existing solutions, and some extra goals based on our own preferences and experiences.

In addition to the design goals listed above, a number of features were explicitly excluded from our requirements.

4.2. Architecture

YAWC is written as a VBA macro. The macro reads in an initialization file that contains a mapping between named styles defined in a Word template and XML elements. With this mapping, the macro then goes through the Word document, wrapping each named style with the required start and end tags of the element. Characters are converted into American Standard Code for Information Interchange (ASCII) equivalents, using numeric character entities where necessary. When all conversions are complete, the macro saves the resulting text as a well-formed XML document, replacing the 'doc' with a '.xml' extension. The YAWC macro generates well-formed XML and no pre-defined XML DTD is required.

After the VBA macro has completed, an optional second stage of processing is carried out, using an XSLT stylesheet. The XSLT transformation may be very simple, such as an identity transformation, which makes an exact copy, or quite complicated, such as an up-translation to a more structured DTD from the relatively simple DTD, generated by the initial conversion.

4.3. Yawc.ini Initialisation File

The initialization (or mapping) file is plain text, and can be edited in Notepad. The format to specify mappings is easy, consisting simply of one style name and one XML element name per line, separated by a tab.

[ParagraphStyles] # Style name 
<tab> element name # User-defined styles Address address Person person # Pre-defined styles
wdStyleHeading1 title wdStyleHeading2 heading wdStyleNormal para
[/ParagraphStyles]

MS-Word has a large selection of pre-defined styles defined, for headings, lists, footnotes, hyperlinks, etc. Rather than use the names of pre-defined styles as they appear in the Word style menu, we chose to use the internal enumerated names for each style instead, e.g. wdStyleHeading1. This was done with the aim of making both the macro and the initialization file work better with many different language editions of MS-Word.

One of the main design goals of YAWC was simplicity. The Yawc.ini file went through many iterations to ensure that this criteria was met. In doing so, initial design specifications were changed to keep the initialization file simple. Two examples highlight this point - transforming tables and character encodings. The reasoning behind these changes is outlined in more detail later in this paper.The initialization file is the primary means of mapping Word styles to XML elements. In addition it defines the hierarchical structure of the XML file based on Word styles by using pseudo-styles required by YAWC that are easily recognized by the prefix 'yawc', e.g. yawcLevel1ElementName. These pseudo-styles are the means by which YAWC can put Word styles into a hierarchical XML structure.

During the YAWC development cycle extensibility was added to YAWC's features. The initialization file can also call an XSLT stylesheet that converts YAWC's temporary XML file into a more customized XML file. Also, the end-user is not constricted to outputting '.xml' files but can also, for instance, output '.html' files.

4.4. Block Level Headings And Lists

Word is a linear, or sequential, document format. HTML is also a linear document format. Therefore in Word there is no concept of a 'section' that would typically contain a heading, some paragraphs, and some 'sub-sections'. XML documents are invariably hierarchical in nature, so we considered it important that the VBA macro support the addition of hierarchical elements based on the heading levels used in the Word document. Given a document containing the following markup:

YAWC can generate the following hierarchical structure:

 
<chapter>
 <title>Chapter Title</title>
 <para>Opening paragraph</para>
 <section>
  <heading>Section Title</heading> ...
 </section> 
</chapter>

The additional hierarchical elements added by the VBA macro are specified using the mappings shown below. All pseudo-styles required by YAWC are prefixed with the letters 'yawc'. Section level mappings can also map to the same element name that have different attribute values, e.g. div class='section1'.

 
[Sections] yawcLevel1ElementName chapter
yawcLevel2ElementName section [/Sections]

In principal this could be done by a post-conversion XSLT transformation, but we felt that this is such a basic XML requirement that the function should be implemented in the VBA macro. Actually achieving the same result using XSLT would require users to have reasonably good programming skills.

Lists are treated in the same way, and list items are automatically wrapped in a list element by the VBA macro.

In the course of YAWC's development it became clear to us that the XML that YAWC generates might not contain all the structure required by a DTD, for instance, list items could contain a 'para' element. This made us realize that a minimal XSLT transformation level would be required in a lot of cases. This changed the focus of our design somewhat because XSLT - a powerful programming language in it's own right - immediately infers that end-users require some programming skills. It also meant that we still had to design YAWC in such a way so that the initialization file and the macro itself still took the brunt of the XML conversion work rather than allowing XSLT to handle more difficult conversions. Maintaining this balance was in our minds critical to ensuring that YAWC remains simple to use and configure.

4.5. Character Level Markup

Character level styles are treated in a similar way to block level styles. YAWC simply wraps any text formatted using a particular style with the start and end tags defined in the initialization file.

 
[CharacterStyles] wdStyleEmphasis emph class="italic" wdStyleStrong emph class="bold"
[/CharacterStyles]

It is possible to include attributes in the mapping if two styles - whether character or paragraph - map to the same element name with different attribute values. The macro treats the first sequence of non-space characters after the tab as the element name, and copies the remaining text on the line into the start tag only.

In some cases, for example hyperlinks and footnotes, the macro generates and inserts an extra attribute into the start-tag, in order to include a cross-reference id or URL. In fact the way that YAWC uses Word's in-built hyperlink and footnotes styles is interesting. Because YAWC uses Word's in-built hyperlink and footnote styles there is no need to create user-defined styles to handle these constructs.

For hyperlink styles YAWC automatically creates a 'href' attribute with the value of the URL. This attribute is added to the mapped hyperlink XML element. In addition, the user can define additional attributes for this XML element in the initialization file, e.g. xref type='external' where href='URL' is automatically inserted by YAWC. The one thing to note is that YAWC inserts a lowercase 'href' attribute name. The DTD in question may require it to be uppercase or indeed a different name. XSLT would then be required to transform it.

Similarly YAWC automatically creates a 'href' attribute for Word's footnote reference style and an 'id' attribute for footnote text style. YAWC assigns a value for each of these attributes based on what Word generates, e.g. href='fn1' and id='s28-1fn1' respectively.

4.6. Images

Images in Word documents are converted into a web-compatible format such as Joint Photographic Experts Group (JPEG), Portable Network Graphics (PNG) or Graphics Interchange Format (GIF). This seems to be decided by Word itself when using the 'Save as HTML ...' option. Whatever format is saved, a link is created from the XML document to the image file name. The structure of the link element is configurable, so XLink notation can be used if required.

Word defines file names for embedded images, using the format imageNNN.FFF, where image is constant, and NNN is a sequential number starting with 001. If Word 97 is being used, there is a danger that a number of conversions of files in the same directory will cause images to be overwritten. In Word 2000, a special subdirectory is created for images, and this is not an issue. However, each image is saved in a separate subdirectory, which can lead to many directories being created.

YAWC automatically maps an image that is linked or embedded to an 'img' element name with a 'src' attribute that contains the absolute path of the imported image. Images can be quite complex in XML documents in the sense that they can be part of a group of elements such as a title, image and caption. YAWC doesn't attempt to handle this complexity and therefore wrapping images into a group of elements is done in XSLT. It also means that the 'img' element and additional attributes can be changed or added according to the DTD.

4.7. XSLT Support

At the onset of YAWC's development one of the design goals was to output well-formed, hierarchical XML . It has been stated earlier in this paper that as developer's we realized that YAWC's hierarchical structure might not be enough to generate XML files that conform to a particular XML vocabulary. We then decided to add XSLT support where YAWC uses the Microsoft XML parser (MSXML) 3.0 parser to automatically transform the XML it generates to an XML file that conforms with the DTD in question.

However we have found some MSXML parser issues in that the only way to create a valid XML file is to include the absolute path name to the DTD. This means that the portabability of these XML files is now an issue.

4.8. Character Encoding

All documents are converted into ASCII text in XML . Accented and special characters are converted into the appropriate numeric character entity, e.g. becomes &#255;. Originally, a number of alternative conversion options were available: Unicode, named character entities and numeric entities. However, with the inclusion of a post-conversion XSLT transformation step, it seemed better to keep the conversion as simple as possible, since Unicode, Ucs Transformation Format (UTF)-8 and other encodings supported by MSXML, can be generated by a simple XSLT script which sets the character encoding using the xsl:output instruction.

We therefore moved the character mapping from the initialization file into the conversion macro itself. There does not seem to be any easy way to generate named character entities using XSLT, but if they are really essential, it is possible to modify the VBA macro to generate them instead, as the character mapping is simply a table inside the macro which can be easily replaced.

4.9. Tables

Tables in Word are converted into the HTML table model, using Word's own support for conversion into HTML . Originally, we couldn't decide on the best approach, so we allowed the specification of a required table model in the initialization file, as well as element names for rows and cells. This provided support for the Computer-Aided Acquisition and Logistics Support (CALS), HTML or a custom table model. This seemed like a good idea, but made the initialization file more complex, and more difficult to document. We then discovered that André Blavier (http://perso.wanadoo.fr/ablavier/TidyCOM/) had packaged Dave Raggett's Tidy program (http://www.w3.org/People/Raggett/tidy) as a Component Object Model (COM) component.

Building in the TidyCOM component into the macro allowed us to clean up HTML tables created using the MS-Word 'Save as...HTML ' option. Each table in a document is copied into it's own temporary file, saved as HTML using Word, cleaned up using TidyCOM, and finally copied back into the output XML document. The major benefit of this approach is that spanned cells (either over rows or columns) in a table are properly converted into HTML . This is impossible to program directly in VBA.

If a different table model than HTML is required, then an XSLT transformation can be written to carry out this step after the conversion. One common transformation, required in a number of cases, is to convert the HTML table element names from lowercase to uppercase.

4.10. Meta-Information And Form Fields

MS-Word allows information about a document to be stored in two ways, either as custom properties, accessible to the user from the File->Properties menu item, or internal variables, accessible only from VBA macros. These are often used to store information recorded by the author through custom dialog boxes. These can optionally be extracted by the VBA macro and saved as well-formed XML , through the use of a processing instruction in the prolog section of the XML output. It is worth noting that custom properties can be sub-divided into in-built custom properties and user-defined custom properties. YAWC supports user-defined custom properties as well as internal document variables.

The initialization file contains prolog and epilog sections, which contain fixed XML markup for the beginning and end of a document. Since the start of a document is often where the most complex markup occurs, it seems like a good idea to avoid this area entirely, and simply force the user to provide the required markup directly in XML in the prolog section of the initialization file. It would be impossible to cope with all the possible permutations of how meta-information might be structured at the start of the document, so we do not attempt to do so. Instead we simply provide the meta-information using the HTML 'meta' tag conventions, and allowing users to transform this into the structure they require using XSLT.

 
[Prolog] 
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE gcapaper PUBLIC "-//GCA//DTD GCAPAP-X DTD 20010211 Vers 5.0 CD//EN" 
   "gcapaper-cd.dtd" 
[]>
<gcapaper>
<?yawc custom-properties?>
[/Prolog]

5. Conclusion And Future Developments

Overall, we believe that YAWC meets it design goals. It is integrated with Word, easy to configure, and does not require any programming knowledge to get started. It is also free.

There are a number of improvements, which would help to make YAWC even better. A more structured user interface for defining the mapping between style names and elements would help reduce the time required to configure a new DTD, and make the solution more user-friendly, hopefully leading to more widespread adoption.

Simplifying the installation process so that it can install YAWC for Word 97 and Word 2000 as well as incorporating the installation of 3rd party COM components would enhance the usability of YAWC.

Extending YAWC's capabilities to handle multiple Yawc.ini files where an organization might have multiple Word templates that get converted to XML .

A macro to do a conversion in reverse, i.e. XML into MS-Word, should be relatively straightforward to write, and would allow information to be maintained in either Word or XML format. There is no significant demand for this function as yet, but we expect that as more information is stored in XML format, this will become a desired feature.

For further information and updates about YAWC see our YAWC website, http://www.xmlw.ie/yawc/

Glossary

ASCII

American Standard Code for Information Interchange

CALS

Computer-Aided Acquisition and Logistics Support

COM

Component Object Model

DSSSL

Document Style Semantic and Specification Language

DTD

Document Type Definition

GIF

Graphics Interchange Format

HTML

HyperText Markup Language

JPEG

Joint Photographic Experts Group

MSXML

Microsoft XML parser

Perl

Practical Extraction Report Language

PNG

Portable Network Graphics

RTF

Rich Text Format

SGML

Standard Generalized Markup Language

UTF

Ucs Transformation Format

VBA

Visual Basic for Applications

WYSIWYG

What You See Is What You Get

XML

eXtensible Markup Language

XSLT

eXtensible Stylesheet Language Transformation

YAWC

Yet Another Word Converter

Biography

Helen Watchorn
XML Workshop Ltd.
Ireland

Helen Watchorn - Helen Watchorn has primary and post-graduate qualifications in Communications and Information Technology, and has spent her career working in the documentation departments of companies in the IT industry, specialising in the production of localised manuals. Before joining XML Workshop Ltd. she was the documentation manager for Informix Software Ireland.

Paul Daly
XML Workshop Ltd.
Ireland

Paul Daly - Paul Daly has 22 years commercial experience in the printing and publishing industry. Since 1998 he has worked with Standard Generalized Markup Language (SGML ), eXtensible Markup Language (XML ) and HyperText Markup Language (HTML ), designing, implementing and maintaining websites, and with Practical Extraction Report Language (Perl ), Document Style Semantic and Specification Language (DSSSL ), Java, and Visual Basic, developing XML -based solutions for database-backed electronic publishing environments.