XMLTrans: a Java-based XML Transformation Language for 
Structured Data 
Derek Walker and Dominique Petitpierre and Susan Armstrong 
{Derek. Walker, Dominique. Pet it:pierre, Susan. Armsl;rong}@± ssco. unige, ch 
ISSCO, University of Genew 
40 blvd. du Pont d'Arve 
CH-1211 Genev~ 4 
Switzerland 
Abstract 
The recently completed MLIS DieoPro project 
addressed the need tbr a uniform, platform- 
independent interface for: accessing multiple dic- 
tionaries and other lexical resources via the In- 
ternet/intranets. Lexical data supplied by dic- 
tionary publishers for the project was in a vari- 
ety of SGML forn\]ats. In order to transforrn this 
data to a convenient standard format (IJTML), 
a high level transformation  was devel- 
oped. This  is simple to use, yet power- 
ful enough to perlbrm complex transformations 
not possible with similar transformation tools. 
XMLTrans provides rooted/recursive transduc- 
tions, simila.r to tr,~nsducers used for na.tura.l 
 translation. The tool is written in 
standard .lava and is available to the general 
public. 
l Introduction 
The MMS l)icoPro project 1, which ran from 
April 11998 to Sept 1999, addressed the need for 
a uniIbrm, plattbrm-indel)endent interface for 
accessing multiple dictionaries and other lexi- 
cal resources via the lnternet/intranets. One 
project deliverable was a client-server tool en- 
abling trm~slators and other  profes- 
sionals connected to an intranet to consult dic- 
tionaries and related lexica.1 data from multiple 
sources. 
Dictionary data was supplied by participat- 
ing dictionary publishers in a variety of propri- 
etary formats 2. One important DicoPro mod- 
ule wa.s a transformation  capable of 
1DicoPro was a project funded within the MullAlin- 
gum hfformation Society programme (MLIS), an EU ini- 
tiative launched by the European Commission's DG XIlI 
and the Swiss Federal OIrtce of Education and Science. 
2Project participants were: IlarperCollins, Hachette 
Livre, Oxford Unlversit~y Press. 
standardizing tile variety of lexical data. Tile 
 needed to be straightforward enough 
tbr ~ non-programnmr to master, yet powerful 
enough to perform all tile transfbrmations nec- 
essary to achieve tile desired output. The re- 
sult of our efforts, XMLTrans, takes as input 
a well-lbrmed XML file and a file containing a 
set of transformation rules and gives as output 
the.application of the rules to the input file. 
The transducer was designed tbr the processing 
of large XML files, keeping only the minimum 
necessary part of the document in memory at 
all times. This tool should be of use for: anyone 
wishing to tr~msform large amounts of (particu- 
larly lexical) data from one XML representation 
to another. 
At; the time XM1;l?rans was being developed 
(mid 11998), XML was only an emerging stan- 
dard. As a. consequence, we first looked to more 
esta.blished SGMI~ resources to find a. suitable 
trans\[brmation tool. Initial experimentation be- 
gan with I)SSSL (Binghaln, :1996) as a possible 
solution. Some time was invested in develop- 
ing a user-friendly "front-end" to the I)SSSL 
engine .jade developed by James Clark (Clark, 
1998). This turned out to be extremely cumber- 
some to implement, and was ~ba.ndoned. There 
were a number of commercial products such 
as Omnimark Light (Ominimark Corp; :1998), 
TXL (Legasys Corp; 1.998) and PatMI, (IBM 
Corp; 1998) which looked promising but could 
not be used since we wanted our transducer to 
be ill tile 1)ublic domain. 
We subsequently began to examine avail- 
able XML transduction resources. XSL (Clark, 
Deach, 11998) was still not mature enough to rely 
on as a core tbr tile . In addition, XSL 
dkl not (at the time) provide for rooted, recur- 
sive transductions needed to convert the com- 
plex data structures found in l)icoPro's lexica.1 
1136 
d a.ta. 
F, din1)llrgh's La.ngua.ge 'lhchnology Group 
ha,d l)roduced a. nun~l)er of usefi,1 SGM\]ffXMI, 
ma.nipulaCion tools (I;.I'G, 11999). Un\['ortunately 
none of these ma.tched our specific needs. \]~br 
instance, ~.qmltrans does not permit matching 
of com l)lex expressions invoh, ing elements, text, 
and a£tributes. A nether I/FG tool, ~.qu)g is more 
powerful, 1)ut its control files have (in our opin- 
ion) a. non-intuitive and COml)lex syntax 3. 
Since a, large number of standardized XML 
APIs had been developed tbr the Java. program- 
ruing  this appeared to be a. prondsing 
direction. Ill addition, Java's portal)fifty was a. 
strong dra.wing point. The API model which 
best suited our needs was 1;he "Document Oh: 
ject Model" (DOM) with an underlying "Simple 
A Pl for XMI2' (SA X) I>arser. 
The event-based SAX parser reads into lneln- 
ory only the elements in the input document 
releva.nt to the tra.nsfornl alien. In efti.'(;t, X MI,- 
Tra.ns is intended 1;o 1)recess lexicaJ entries 
which a.re indel)en(lent of ca.cA other and tha.t 
ha.ve a. few basic formats. Since only one entry 
is ever in memory at a.ny given point in time, 
extremely la.rge files can be I)rocessed wil;h low 
nmmory overhea.d. 
The \])OM AI)I is used in the tra.nsforma.tion 
l)rocess to access the the element which is cur- 
rently in menlory. The element is tra.nsformed 
a.ccording to rules sl)ecilied in a. rule tile. These 
rules a.re interpreted by XMl/l'rans as opera- 
lions to l>erfbrnl on the data through I;llo I)OM 
A.PI. 
We begin with a s\]ml)le examl>le to illus- 
tra.te the kinds of transformations l>erlbrmed by 
XMLTrans. Then we introduce the  
concepts a.nd structure of XMLTrans rules and 
rule files. A comparison of XMLT,:a.ns with 
XSLT will help situate our work with respecl; 
to the state-of-the-art in XML data processing. 
2 An example transformation 
A typical dictiona, ry entry might ha.ve a. surpris- 
ingly complex structure. The various compo- 
nents of the entry: headword, pa.rt-ofst)eech , 
pronunciation, definitions, translations, nla.y 
themselves contain complex substructures. For 
\])icoPro, these structures were interl)reted in of 
aThe UI'G have since developed another interesting 
t, ransformation tool called XMIA)erl. 
der 1;o construct IITML output for tyl)ographi- 
cal rendition and also to extract indexing inibr- 
marion. 
A fictitious source entry might be of tile form: 
<entry> 
<hw>my word</hw> 
<defs> 
<def num="l">first def.</def> 
<def num="2">second def.</def> 
</defs> 
</entry> 
\'Ve would like to convert this entry to HTML, 
extra.cling tile headword fbr indexing pnrl)oses. 
Apl)lying the rules which are shown in section 
d, XML\]'rans generates the following outl)uC: 
<HTNL> 
<!-- INDEX="my word .... > 
<HEAD> 
<TITLE>my word</TITLE> 
</HEAD> 
<BODY> 
<Hi>my word</Hl> 
<OL> 
<LI VhLUE="l">first def.</Ll> 
<LI VhLUE="2">second def.</LI> 
</OL> 
</BODY> 
</HTNL> 
If" this were an actual dictionary, the XMI/l'rans 
1,ransducer would itera.te over all the entries in 
the dictiona.ry, converting ea.(:h in turn to the 
OUtl)Ut format above. 
3 Aspects of the XMLTrans 
 
Each XMLTrans rule file contains a number of 
rule sets as described in tile next sections. 'l.'he 
transducer attempts to match each rule in tile 
set sequentially until either a rule m~tches or 
there are no more rules. 
The document I)TD is not used to check the 
validity of the input document. Consequenl;ly, 
input documents need not be valid XMI,, but 
must still be well-formed to be accel)ted by the 
parser. 
The rule syntax borrows heavily from tha.t of 
regular expressions and in so doing it allows for 
very concise and compact rule specifica.tion. As 
will be seen shortly, many simple rules can be 
expressed in a single short line. 
1137 
3.1 Rule Sets 
At tile top of an XMLTrans rule file at least 
one "trigger" is required to associate an XML 
element(e.g, an element containing a dictionary 
entry) with a collection of rules, called a "rule 
set ~" 
The syntax for a "trigger" is as follows: 
element_name : ~ rule_set_name 
Multiple triggers can be used to allow different 
kinds of rules to process different kinds of ele- 
ments. For example: 
ENTRY : 0 normalEntryRules 
COMPOUNDENTRY : @ compoundEntryRules 
The rule set itself is declared with the following 
syntax: 
© \[rule set name\] 
For examl)le4: 
normalEntryRules 
; the rules for this set follow 
; the declaration... 
The rule set: is terminated either by the end of 
the file oi: with the declaration of another rule 
set. 
3.2 Variables 
In XMLTrans rule syntax, variables (prefaced 
with "$") m:e implicitly declared with their first 
use. There are two types of variables: 
• Element varial)les: created by an assign- 
ment of a pattern of elements to a. vari- 
M)le...For example: $a = LI, where <LI> 
is an element. Element variables can con- 
tain one or more elements. If a given vari- 
able $a contains a list of elements { A, B, 
C, ...}, transforming $a will apply the 
transformation in sequence to <A>, <13>, 
<C> and so on. 
• Attribute variables: created by an assign- 
ment of a pattern of attributes to a vari- 
able. For Example: LI \[ $a=TYPE \], where 
TYPE is a standard XML attribute. 
While variables are not strongly typed (i.e. a 
list of elements is not distinguished from an in- 
dividual element), attribute variables cannot be 
used in the place of element variables and vice 
versa. 
4XML~l}'ans comments are preceded by a semicolon. 
3.3 Rules 
The basic control structure of XMLTrans is the 
rule, consisting of a left-hand side (LHS) and 
a right-hand side (RHS) separated by an arrow 
("- >"). The LHS is a pattern of XML ele- 
ment(s) to match while the RHS is a specitica- 
tion for a transfbrmation on those elements. 
a.a.1 The Left-hand Side 
The basic building block of the M tS is the ele- 
ment pattern involving a single element, its at- 
tributes and children. 
XMLTrans allows for complex regular expres- 
sions of elements on the tits to match over the 
children of the element being examined. The 
following rule will match an element <Z> which 
has exactly two children, <X> and <Y> (in the 
examples that \[Bllow "..." indicates any comple- 
tion of the rule): 
z{ x Y } -> ...; 
XMH?rans supports the notion of a logical NOT 
over an element expression. This is represented 
by the standard "\[" symbol. Support for gen- 
eral regular expressions is built into the lan- 
guage grammar: "Y*" will match 0 or more 
occurences of the element <Y>, "Y+" one or 
more occurences, and "g?" 0 o1" l occurences. 
In order to create rules of greater generality, 
elements and attributes in the LHS of a. rule 
can be assigned to variables. Per instance, we 
might want to transform a given element <X> 
in a certain way without specifying its children. 
The following rule would be used in such a case: 
; Match X with zero or more unspecified 
; children. 
X{$a*} -> ...; 
In tile rule above, the variable $a will be ei- 
ther empty (if <X> has no children), a single 
element (if <X> has one child), or a list of el- 
ements (if <X> has a series of children. Sinl- 
ilarly, the pattern X{$a} matches an dement 
<X> with exactly one child. 
If an expression contains complex patterns, 
it is often useful to assign specific parts to dif- 
ferent variables. This allows child nodes to be 
processed in groul)s on the billS, perhaps being 
re-used several times or reordered. Consider the 
following rule: 
Z{ $a = (X Y)* $b = Q} -> ... ; 
1138 
in this case $a contains a (possibly e,npty) list 
o\[' {<X>, <Y>} element l)airs. The variable Sb 
will contain exactly one <Q>. If' this pal;tern 
cannot be matched the rule will fail. 
Attribul;es may a,lso 1)e assigned to variables. 
'l"he following three rules demonstrate some l>OS- 
sibilities: 
; Match any X which has an attribute ATT 
X\[ Satt = ATT \] -> ...; 
; Match any X which has an attribute 
; ATT with the value "VALUE". 
X\[ Satt = ATT == "VALUE"\] -> ...; 
; Match any X with an attribute 
; which is NOT equal to "VALUE" 
X\[ Satt = ATT != "VALUE"\] -> ...; 
The last tyl>e of exl)ressions used <)u the IAIS 
a.re string expressions. Strings are considered 
to l)e elements in their own right, but; they ~l,re 
enclosed in (luotes and cannot have atl;ribute 
patterns like regular e,h'ments (:an. A special 
syntax,/.*/, is used to mean a, ny element which 
is a string. The following are some sample string 
matching rules: 
; Match any string 
/.,/ -> ... ; 
; Match text "suppress" & newline. 
"suppress\n"-> ...; 
3.3.2 The Right-hand Side 
The R, II,q SUl)l)lies a COllStruction pa.ttern R)r tile 
tra, nsformed 1;tee node. 
A simple rule might be used to tel)lace a,n 
demenI, and its contents wit\]l some text: 
X -> "Hello world" 
l"or the input <X>Text</X>, this rule, yiekls 
the oul;l)ut string Hello world. A more useful 
rule might strip off the enclosing element using 
a variable refhrence on the \]JIS : 
$X{$a*} -> $a 
For the input <X>Text</X>, this rule gener- 
ates glle oul;l)lll; Text. Elements lnay also be re- 
nnmed while dmir contents remain unmodified. 
The tbllowing rule demonstrates this facility: 
$X{$a*} -> Y{$a} 
\]ibr the input <X>Text</X>, the rule yields 
the outl)ut <Y>Text</Y>. Note that any chil- 
dren o\[' <X> will be reproduced, regardless of 
whether ghey are text elements or not. 
Attribute varialJes may also be ,sed in XML- 
Trans rules. The rule below shows how this is 
aecomplished: 
X \[$a=ATT\] {$b*} -> Y \[OLDATT=$a\] {$b} 
Given the input <X ATT="VAL">Text</X>, 
the r.le yields the output <Y 
OLDATT="VAL" >Text </Y >. 
l{ecursion is a fundamenta,\[ concept used 
ill writing XMLTrans rules. The exl>ression 
@set_name(variablemame) tells the XML- 
Trans transformer to continue processing on the 
elements contained ill tile indica.l;ed variable. 
l'br instance, @setl($a) indicates that the el-- 
ements contained in the va.l'ial)le $a shoukl be 
processed by the rules in the set setl. A spe: 
cial notation ¢(variable~ame) is used to tell 
t;he trausi'ormer to contin,e processing with the 
current rule set. Thus, if dm current rule set 
is set2, the expression @($a) indicates that 
\[)recessing sho,l<l coudnue on tile elelnent,s in 
Sa using the rule set set2. the following rule 
(lemonstra,tes how 1;r~llSOFlllalJOllS ca,n \])e ap- 
plied recusively to an element: 
X{$a*} -> Y{e($a)} 
"Text" -> "txeT" 
For the input element <\>Text</\>, the rule 
generai;es the output <Y>txeT</Y>. \])ifl'erent 
rule sets Call 1)e accessed as ill the following rule 
file segment: 
X : setl 
@ setl 
X{$a*} -> Y{¢set2($a)} 
"Text"-> "txeT" 
@ set2 
"Text"-> "Nothing" 
Initially, set1 is invoked to process the el<;= 
merit <X>, but then the rule set set2 is in- 
yoked to 1)recess its children. Consequently, 
for the input <\>Text</\>, the outing; is 
<Y>Nothing</Y>. 
1139 
4 Rules for the example 
transforlnation 
The transformation of the example ill section 
2 can be achieved with a few XMLTrans rules. 
The main rule treats the <entry> element, cre- 
ating a HTML document fl'om it, and copying 
the headword to several places. The subsequent 
rules generate the HTML output from section 2: 
entry : © entrySet 
@ entrySet 
entry{$a=hw Sb=defs*} 
-> HTML£"<!-- INDEX=" Sa .... >" 
HEAD{TITLE{$a} BODY{HI{$a} 
©($b)}} 
defs£$a=def*} -> 0L{@($a)} 
def \[$att=NUM\] £$a*} 
->LI \[VALUE=$att\] {$a} 
5 Colnparison with XSLT 
The advent of stable versions of XSLT (Clark, 
2000) has dramatically changed the landscape 
of XML transformations, so it is interesting to 
compare XMLTrans with recent developments 
with XSLT. 
lit is evident that the set of transformations 
described by the XMLTrans transformation lan- 
guage is a subset of those described by XSLT. In 
addition, XSLT is integrated with XSL allowing 
the style sheet author to access to the rendering 
aspects of XSL such as \[brmatting objects. 
Untbrtunately, it takes some time to learn 
the syntax of XSL and the various aspects of 
XSLT, such as XPath specifications. This task 
may be particularly difficult for those with no 
prior experience with SGML/XML documents. 
In contrast, one needs only have a knowledge of 
regular expressions to begin writing rules with 
XMLTrans. 
6 Conclusion 
The XMLTrans transducer was used to success- 
fully convert all the lexical data for the l)icoPro 
project. There were 3 bilingual dictionairies and 
one monoligual dictionary totalling 140 Mb in 
total( average size of 20 MB), each requiring its 
own rule file (and sometimes a rule file for each 
 pair direction). Original SGML files 
were preprocessed to provide XMLTrans with 
pure, well-formed XML input. Inputs were in 
a variety of XML formats, and the output was 
HTMI, Rule files had an average of 178 rules, 
and processing time per dictionary was aI)proxi- 
lnately I hour (including pre- and postprocesss- 
ing steps). 
This paper has presented the XMI,Trans 
tra.nsduction . The code is portable 
and should be executable on any platform for 
which a .\]aw~ runtime environment exists. A 
free version of XMLTrans can be downloaded 
fromS: http ://£ssco-www. unige, ch/proj ects 
/d i copro_publ i c/XMLTrans / 

References 

Bingham, 11.:1996 q)SSSL Syntax ,.qunlnlal'y In- 
dex', a.t http://www.tiac.net/uscrs/bingham/ 
dssslsyn/indcx, htm 

Clark, J.:1998 'Jade - James' \])SSSL Engine', 
at http://www.jclark.com/iadc/ 

Clark, J. Ed.:2000 'XSL Transformations 
(XSLT) Version 1.0: W3C Recommendation 
16 November 1999,' at 
http://www, w3. ow/TR /1999/l~l'~'C-a:slt- 
19991116 

Clark, J. and Deach, S. eds.:1998 'Extensible 
Stylesheet Language (XSL) Version 1.0 W3C 
Working Draft 16-December-1998' at h~p://w,~w, wa.o,~j/Tl~/i OgS/WD-.,.sl- 
19981210 

Glazman, D.:\]998 'Siml)le Tree 'l'ransformation 
Sheets 3', at htlp://www, w3. org/77~,/NO I'1~'- 
,~7"/S'3 

IBM Corp.:1999 qBM/Alphawork's l)atML ', at 
h ttp : ///www. alph, a Works. ibm. com /tech /patml 

Language Technology Group:1999 q,T XML 
version 1.1' at 
http://www. Itg. cd. ac. uk/softwarc/xml/indcx, htmI 

Legasys Corp.:1998 'The TXL Source Transfor- 
mat;ion System', at 
http://www.qucis.quccnsu.ca/ Icgasys/ 
TXL_lnJ'o findcx, h t ml 

Omnimark Corp.:1998 'Omnima.rk Corporation 
Home Page', at 
http : //www. omnimark, corn~ 

5Users will also need Sun's SAX and DOM Java 
libraries (Java Project X) available from: 
http ://j ava. sun. com/product s/j avapro j ectx/index, html: 
