Using Parser-Generators to Convert Legacy Data Formats to XML
ABSTRACT
A parser-generator is a program which takes a formal description of a grammar (e.g. in BNF) and outputs source code for a parser which will recognise valid strings obeying that grammar and perform actions associated with grammar rules. We show how to use parser generators for transforming legacy data into XML in the context of real-life examples, somewhat simplified for readability. We also show how to improve the efficiency of this approach by manually coding the lexical-analyzer component of the system. Two open-source Java parser-generators are presented: ANTLR to represent the top-down LL(k) approach and CUP for bottom-up LALR.
Table of Contents
1. The Legacy Data Problem
As the world of EDI lurches and stumbles towards XML, we are increasingly faced with the transitional problem of feeding legacy data into XML-based applications. In the general case, we believe this has to be done manually, by writing a program that traverses the legacy data and sends out SAX events as it does so. However, the most common cases are simpler and can be automated or partially automated using the technology called parser generators. For example, text formats representing data tables lend themselves well to this approach. Our examples include the fixed-width format, the tab-delimited format, the Comma-Separated Values (CSV) format, and a custom format in which the values of some fields indicate the hierarchical structure of the data. To illustrate the first two formats, we look at a bookstore inventory, with four fields (Author,Title,ISBN,InStock). In CSV, it looks like this (recall that in this format columns are comma-delimited, columns which contain commas or quotes are quoted, and quotes within quoted columns are doubled):
Author,Title,ISBN,InStock "Kay, Michael",XSLT Programmer's Reference,1-861003-12-9,3 "Oppenheim, Joanne F.","""Not Now!"" Said the Cow",553346911,5
We want this to be transformed into something like:
<table> <row><d>Author</d><d>Title</d><d>ISBN</d><d>InStock</d></row> <row><d>Kay, Michael</d><d>XSLT Programmer's Reference</d>.... <row><d>Oppenheim, Joanne F.</d><d>"Not Now!" Said the Cow</d>.... ... </table>
For a hierarchical structure example, we look at a file representing a directory of scanner data. Scanner output is placed into folders, folders contain documents, and documents consist of pages. The sample below shows one folder of two documents, the first having three pages and the second having one page, as follows:
2/4/01 8:26:06 PM #c:\t8\0001.Tif PageCode0 490 2/4/01 8:26:08 PM #c:\t8\0002.Tif PageCode1 392 1 2/4/01 8:26:10 PM #c:\t8\0003.Tif PageCode2 2/4/01 8:26:14 PM #c:\t8\0004.Tif PageCode2 2/4/01 8:26:16 PM #c:\t8\0005.Tif PageCode2 2/4/01 8:26:20 PM #c:\t8\0006.Tif PageCode1 392 2 2/4/01 8:26:22 PM #c:\t8\0007.Tif PageCode2
In this sample, PageCode2 marks an actual page of data, PageCode1 marks the beginning of a document, and PageCode0 marks the beginning of a group of documents in a folder. There are extra integers (from bar codes) at the end of each PageCode0 or PageCode1 line. We want this to be transformed into something like:
<doc-group date="2/4/01 8:26:06 PM" path="c:\t8\0001.Tif" > <p>490</p> <doc date="2/4/01 8:26:08 PM" path="c:\t8\0002.Tif" > <p>392</p> <p>1</p> <page date="2/4/01 8:26:10 PM" path="c:\t8\0003.Tif"/> <page date="2/4/01 8:26:14 PM" path="c:\t8\0004.Tif"/> ...</doc> ...</doc-group>
2. Grammars, a brief review
If you are comfortable with DTDs, it will not take you long to become reasonably fluent with formal grammars because that's what DTDs are: formal grammars written in a special notation, with additional machinery for attributes and entities. Consider this DTD declaration:
<!ELEMENT elt (a,(b|c),e?,(f|g)+)*>
In a formal grammar notation, this would be written as follows (with minor variations in different systems):
elt: (a (b|c) e? (f|g)+)*
Formal grammars are used for describing programming languages, and they serve as the basis for writing compilers. The best-known example of a computer program that automatically generates a compiler from a grammar is Yet Another Compiler Compiler (YACC) that is built into most Unix systems. In this paper, we present two such programs: ANother Tool for Language Recognition (ANTLR), found at http://www.antlr.org/ and Constructor of Useful Parsers (CUP), found at http://www.cs.princeton.edu/~appel/modern/java/CUP/. We start with an ANTLR-style grammar for tab-delimited lines of data:
document : ( row (EOL)+ )* ; row : ( COLUMN (DELIM COLUMN)* ) ; COLUMN : ( ~( '\t' | '\r' | '\n' ) )+ ; EOL : ( '\n' | '\r' ) ; DELIM : '\t' ;
This says (starting from the bottom) that a DELIM is a tab character, whereas an EOL may be either a carriage return or a newline, and a COLUMN is a sequence of one or more characters which are neither tab, carriage return, nor newline. These are lexical-analyzer tokens, usually named in upper-case. We can think of the input stream of characters as being broken into non-overlapping tokens. A "row" contains one or more COLUMNs separated by DELIMiters; a "document" contains zero or more rows each ending with one or more EOLs.
This is a small example, but it happens to be one of our applications, and it includes almost all of the basic grammar constructs: concatenation, alternation, repetition zero or more times, repetition one or more times, and negation. It is a precise specification: notice that there is no such thing as an empty row here, because every row contains a column and every column contains at least one character. (Is this good? If not, we change the grammar.)
How do we transform this into XML? In the ANTLR system, we can do it this way:
class DelimDOMParser extends Parser; options { k = 1; buildAST=true; }
document : ( row (EOL)+ )* ;
row : ( COLUMN (DELIM COLUMN)* ) ;
class DelimDOMLexer extends Lexer;
options { k = 1; charVocabulary = '\1'..'\377'; }
COLUMN : ( ~( '\t' | '\r' | '\n' ) )+ ;
EOL : ( '\n' | '\r' ) ;
DELIM : '\t' ;
That's all. The "antlr.Tool" generates a Java (or C++) parser from this simple file, and we can compile that code in the ordinary way. That "buildAST=true;" means that the generated parser, which will have methods "document()" and "row()" to recognize the corresponding syntax rules, will build an "Abstract Syntax Tree" containing the parsed input, and the tree can be used to generate XML directly. After parsing (i.e., after calling on parser.document() ) we can say
antlr.BaseAST tree=(antlr.BaseAST) parser.getAST(); tree.xmlSerialize(new OutputStreamWriter(System.out));
The output is XML, containing all the data in the input, and we can pass it straight to an XSLT stylesheet to transform as desired; ANTLR itself also has some excellent tools to define AST templates whose XML output will be close to what we want.
Is our problem solved? Not quite. Legacy-format data typically involves many large files. When we do it this way, we are building trees as in-memory structures, spending much more time and space than we need to spend for the problem at hand; quite often spending more than we actually have available. This problem is exactly what forced the development of SAX parsing: the Abstract Syntax Tree option gives us DOM; we want SAX, and within the parser-generator approach, we get that by specifying actions as we match the input.
In the remaining sections, we develop three examples illustrating various combinations of possibilities. The first example shows another ANTLR application, rewritten in SAX style, with a more complex grammar describing the CSV format. The second example shows an ANTLR grammar for scanner data. In conclusion, we show a CUP grammar, illustrating bottom-up LALR approach on the same scanner data, with a manually optimized lexer.
3. The ANTLR LL(k) Approach
Here is a delimited-data grammar, modified for CSV input:
document : ( row (EOL)+ )* ;
row : ( col (DELIM col)* ) ; col : COLUMN ;
COLUMN : ( ~( ',' | '\r' | '\n' | '"' ) ( ~( ',' | '\r' | '\n' ) )* ) |
('"' (~'"')* '"')+ ; EOL : ( '\n' | '\r' ) ; DELIM : ',' ;
The new element here is that there are two kinds of COLUMN: quoted columns (begin with a quote, continue with non-quote, end with a quote) and non-quoted columns. In non-quoted columns, comma is not allowed because it is the delimiter.
In addition, we now annotate the grammar with programming constructs to make an ANTLR file out of it. The class declarations go in as before, extending Parser and Lexer, but we also give this DelimParser class a collection of methods, "startDoc()", "endDoc()", "startRow()", "endRow()" and "doColumn(String txt)". These are simply inserted into the grammar where we want them to be executed, and they pass on the same kind of SAX events that SAX parsers do.
Two issues arise in writing the code. One is that the parser generator is generating code which throws and handles TokenStreamException errors, so but the org.xml.sax classes throw SAXExceptions. We don't want to lose the exception information, so each little method translates its SAXExceptions into TokenStreamExceptions. The second issue is that "doColumn" takes a COLUMN token's text as its argument, in order to pass that on to the document handler, so we have to call ANTLR's method for passing token text into included code. Otherwise, the code is completely straightforward:
class DelimParser extends Parser;
options {k = 1;}
{ private org.xml.sax.DocumentHandler handler=null;
protected org.xml.sax.AttributeList
attrs=new org.xml.sax.helpers.AttributeListImpl();
public void setHandler(org.xml.sax.DocumentHandler h){handler=h;} protected void startDoc()
throws TokenStreamException{ try{handler.startDocument();
handler.startElement("doc",attrs);} catch(Exception ex)
{throw new TokenStreamException(ex.toString());} } protected void endDoc()
throws TokenStreamException{ try{handler.endElement("doc");
handler.endDocument();}
catch(Exception ex){throw new TokenStreamException(ex.toString());} }
protected void startRow()
throws TokenStreamException{ try{handler.startElement("row",attrs);}
catch(Exception ex){throw new TokenStreamException(ex.toString());} }
protected void endRow()
throws TokenStreamException{ try{handler.endElement("row");}
catch(Exception ex){throw new TokenStreamException(ex.toString());} }
protected void doColumn(String txt)
throws TokenStreamException { try{ handler.startElement("column",attrs);
char[]chars=txt.toCharArray(); int i,j,lim=chars.length;
if(chars[0]=='\"'){ // a quoted column;
skip every other "-mark. for(i=0,j=1;j<lim;i++,j++){ chars[i]=chars[j];
if(chars[j]=='\"')j++; } lim=i-1; } handler.characters(chars,0,lim);
handler.endElement("column"); }
catch(Exception ex){throw new TokenStreamException(ex.toString());} } }
document : {startDoc();} ( row (EOL)+ )* {endDoc();} ;
row : ({startRow();} col (DELIM col)* {endRow();} ) ;
col : c:COLUMN {doColumn(c.getText());} ; class DelimLexer extends Lexer;
options { k = 1; charVocabulary = '\1'..'\377'; }
COLUMN : ( ~( ',' | '\r' | '\n' | '"' ) ( ~( ',' | '\r' | '\n' ) )* ) | ('"' (~'"')* '"')+ ;
EOL : ( '\n' | '\r' ) ;
DELIM : ',' ;
The generated parser has methods document(), row(), and column(); the generated lexer has methods mCOLUMN(), mEOL(), and mDELIM(). The code follows the grammar quite closely, though tediously; for example the quoted-column rule is
if ((LA(1)=='"')) { { int _cnt20=0; _loop20: do { if ((LA(1)=='"'))
{ match('"'); { _loop19: do { if ((_tokenSet_2.member(LA(1)))) { matchNot('"'); }
else { break _loop19; } } while (true); } match('"'); }
else { if ( _cnt20>=1 ) { break _loop20; }
else {throw new NoViableAltForCharException((char)LA(1),
getFilename(), getLine());} } _cnt20++; } while (true); } }
It takes a little practice to read this code fluently but it may be worth the effort, because the generated code can be used as a starting point for an improved lexical analyzer. Hand-crafting a lexical analyzer is fairly common practice because it can yield major improvements in performance. (Lexer speed is the main performance issue of a generated parser, for obvious reasons: the lexer sees characters one at a time, whereas the parser sees aggregated tokens.) We show a combination of a generated parser and manually-produced lexer in our last example, which also shows a different kind of grammar and parser generator. First we do a more complext example of an ANTLR grammar for scanner input data. The last example will process the same data.
4. Scanner input, LL(10)
Here is a grammar for the scanner input. As previously discussed, it puts lines of data into groups using "page codes".
document : ( docgroup ) * ;
docgroup : date WS time WS zone WS LB path WS "PageCode0" (WS num)* EOL ( doc )+ ;
doc : date WS time WS zone WS LB path WS "PageCode1" (WS num)* EOL ( page )+ ;
page : date WS time WS zone WS LB path WS "PageCode2" ( WS )? EOL ;
date : TEXT ;
time : TEXT ;
zone : TEXT ;
path : TEXT ;
num : TEXT ;
TEXT : ( ~(' ' | '\t' | '\r' | '\n' | '#' ) )+ ;
EOL! : ( '\n' | '\r' )+ ;
WS! : (' ' | '\t' )+ ; LB! : '#' ;
In order for the parser to know whether a given line is to match docgroup, doc, or page, it has to look at the page code, which is the tenth token in from the beginning of the line. That's a lookahead of 10, or LL(10). For presentation purposes, we slightly simplify the rest of the grammar: the lexical analyzer is just being asked to split its input into (1) end-of-line, (2) non-eol white-space, (3) the special marker "#" and (4) everything else, which is TEXT. The explicit strings "PageCode0", "PageCode1", and "PageCode2" are also reported by the lexical analyzer as TEXT; it is the parser's job to compare these strings against the actual text of the token.
Adding the Java code associated with grammar rules is not particularly difficult. We show the code below, with grammar rules simplified slightly.
class ScanParser extends Parser;
options { k = 10; }{ private org.xml.sax.DocumentHandler handler=null;
protected org.xml.sax.helpers.AttributeListImpl
attrs=new org.xml.sax.helpers.AttributeListImpl();
public void setHandler(org.xml.sax.DocumentHandler h){handler=h;}
protected void startDocument()
throws TokenStreamException{ try{handler.startDocument();
handler.startElement("top",attrs);}
catch(Exception ex){throw new TokenStreamException(ex.toString());} }
protected void endDocument()
throws TokenStreamException{ try{handler.endElement("top");
handler.endDocument();}
catch(Exception ex){throw new TokenStreamException(ex.toString());} }
protected void startElt(String name,String d,String t,String z,String p)
throws TokenStreamException{ try{ attrs.clear();
attrs.addAttribute("date","CDATA",d+" "+t+" "+z);
attrs.addAttribute("path","CDATA",p);
handler.startElement(name,attrs);}
catch(Exception ex){throw new TokenStreamException(ex.toString());} }
protected void endElt(String name)
throws TokenStreamException{ try{handler.endElement(name);}
catch(Exception ex){throw new TokenStreamException(ex.toString());} }
protected void doNum(String num)
throws TokenStreamException{ try{ attrs.clear();
handler.startElement("p",attrs);
handler.characters(num.toCharArray(),0,num.length());
handler.endElement("p");}
catch(Exception ex){throw new TokenStreamException(ex.toString());} } }
document : {startDocument();}( docgroup ) * {endDocument();} ;
docgroup : d:TEXT WS t:TEXT WS z:TEXT WS LB p:TEXT WS "PageCode0"
{startElt("docgroup",d.getText(),t.getText(),z.getText(),p.getText());}
(WS n:TEXT {doNum(n.getText());})* EOL ( doc )+ {endElt("docgroup");} ;
doc : d:TEXT WS t:TEXT WS z:TEXT WS LB p:TEXT WS "PageCode1"
{startElt("doc",d.getText(),t.getText(),z.getText(),p.getText());}
(WS n:TEXT {doNum(n.getText());})* EOL ( page )+ {endElt("doc");} ;
page : d:TEXT WS t:TEXT WS z:TEXT WS LB p:TEXT WS "PageCode2"
{startElt("page",d.getText(),t.getText(),z.getText(),p.getText());endElt("page");}
( WS )? EOL ; class ScanLexer extends Lexer;
options { // divide charsets into end-of-lines, inline white-space, LB='#',
and TEXT k = 1; charVocabulary = '\1'..'\377'; }
TEXT : ( ~(' ' | '\t' | '\r' | '\n' | '#' ) )+ ;
EOL! : ( '\n' | '\r' )+ ;
WS! : (' ' | '\t' )+ ; LB! : '#' ;
Given our sample data, this generates:
<?xml version='1.0' ?> <top><docgroup date='2/4/01 8:26:06 PM' path='c:\t8\0001.Tif'><p>490</p> <doc date='2/4/01 8:26:08 PM' path='c:\t8\0002.Tif'><p>392</p> <p>1</p> <page date='2/4/01 8:26:10 PM' path='c:\t8\0003.Tif'></page> <page date='2/4/01 8:26:14 PM' path='c:\t8\0004.Tif'></page> <page date='2/4/01 8:26:16 PM' path='c:\t8\0005.Tif'></page> </doc> <doc date='2/4/01 8:26:20 PM' path='c:\t8\0006.Tif'><p>392</p> <p>2</p> <page date='2/4/01 8:26:22 PM' path='c:\t8\0007.Tif'></page> </doc> </docgroup> </top>
This is exactly what we want. We now proceed to a different kind of grammar and a manually coded lexer.
5. CUP parsing and manual lexer generation
The CUP parser project is a bottom-up parser in the tradition of Unix's YACC and GNU's Bison. Coming from an old and rich tradition, it can and does assume that a lot of information about its general framework is available from other sources. This may be one reason that its documentation and example set are far less than ANTLR's.
CUP grammars are written in a different style, with no regular expression constructs. Here is a CUP grammar for the scanner data.
document ::= docgrouplist ; docgrouplist ::= docgrouplist docgroup | ; docgroup ::= DOCGROUPLINE doclist; doclist ::= doclist doc | ; doc ::= DOCLINE pagelist ; pagelist ::= pagelist page | ; page ::= PAGELINE ;
The main change from the previous section's solution is that the lexer is now being asked to do a lot more; we're asking it to present each of the three kinds of lines as a different token. So we write a lexer manually: it takes each line as a character array and notes where each part begins and ends. The lexer can send these parts to a DocumentHandler (or SAX2 ContentHandler) without ever forming them into String objects, using the SAX characters() method, which takes a character array. This minimizes memory use; done carefully, such a solution may never call on garbage collection at all.
package java_cup.scanparser;
import java_cup.runtime.*;
import org.xml.sax.*;
parser code {: public scanner lex=null;
public void setScanner(scanner s){lex=s;} :}; /*
Terminals (tokens returned by the scanner). */
terminal DOCGROUPLINE,DOCLINE,PAGELINE; /*
Non terminals */ non terminal document,docgrouplist,docgroup,doclist,doc,pagelist,page; /*
The grammar */ document ::= {: parser.lex.startElt("document"); :}
docgrouplist {: parser.lex.endElt("document"); :};
docgrouplist ::= docgrouplist {: parser.lex.startLine("docgroup"); :}
docgroup {: parser.lex.endElt("docgroup"); :} | ;
docgroup ::= DOCGROUPLINE doclist;
doclist ::= doclist {: parser.lex.startLine("doc"); :}
doc {: parser.lex.endElt("doc"); :} | ;
doc ::= DOCLINE pagelist ;
pagelist ::= pagelist {: parser.lex.startLine("page"); :}
page {: parser.lex.endElt("page"); :} | ; page ::= PAGELINE ;
The resulting lexer has a fairly complex state: there is a character buffer, individual integers for the start and length of the date and path within that buffer, and arrays of integers to keep track of the extra end-of-line parameters. It works well, in time/space terms, but it is far from straightforward. One subtle error may result from the way parsers and lexers talk to each other: it is easy to forget the the lexer is always ahead by one or more tokens. Off-by-one-token errors are therefore easy to create.
We suggest that, any time you are using a parser generator, you should first write and test a "recognition grammar" with no code at all, just as we have been doing; then develop a solution which has the lexer return string-tokens (or other pure value tokens) to the parser, as we did in our ANTLR examples. If performance is an issue, you may then write a lexer manually, and start worrying about minimizing object-creation. A systematic approach to this minimization would take more space than we have here, but the samples we have given provide some clues. We intend to give this subject a fuller treatment within the next few months.
6. Conclusions and Comparisons
It is always possible to write code that will generate SAX callbacks from a legacy data format. Using parser-generators offers two advantages. First, they automate some of the drudge work. Second, a grammar for a legacy format can be part of specifications the way program code cannot be. (The code is "procedural", the grammar is "declarative".) In our own experience of developing software for clients who are doing data processing for their own clients, it has often been impossible to get a full specification in advance of the data; use of a grammar makes it possible to identify and document requirements creep. Maintenance is also easier: dealing with small changes in the grammar is much less painful than checking through manual input routines.
As noted, the main performance issues have to do with the lexer, and it is common to replace generated lexers with handwritten ones. If you want to follow this route AND use a generated lexer as a starting point, an LL tool like ANTLR is preferable to a table-driven LALR tool like CUP, whose generated code is pretty much unreadable. On the other hand, there is a much greater literature of parsers like CUP (mostly written in C, and mostly generating code in C) which you might want some of your work to tie into. Either way, a parser-generator is a valuable tool in system integration.


