DTD-Driven Bilingual Document Generation 
...... ,Arantza Casillas ..... 
Departamento de Automgtica, Universidad de Alcalgt e-mail : arantza@aut, alcaia, es 
Joseba Abaitua 
Facultad de Filosofia y Letras Universidad de Deusto, Bilbao e-amil:abaitua0fil .deusto. es 
Raquel Martinez 
Departamento de Sis. Informgticos y Programacidn, Facultad de MatemgLticas 
Universidad C0mplutense de Madrid e-mail : raquelOeucmos, sire. ucm. es 
Abstract 
Extensively annotated bilingual parallel corpora 
can be exploited to feed editing tools that in- 
tegrate the processes of document composition 
and translation. Here we discuss the archi- 
tecture of an interactive editing tool that, on 
top of techniques common to most Translation 
Memory-based systems, applies the potential of 
SGML's DTDs to guide the process of bilingual 
document generation. Rather than employing 
just simple task-oriented mark-up, we selected a 
set of TEI's highly complex and versatile collec- 
tion of tags to help disclose the underlying log- 
ical structure of documents in the test-corpus. 
DTDs were automatically induced and later in- 
tegrated in the editing tool to provide the basic 
scheme for new documents. 
1 Introduction 
This paper discusses an approach to the archi- 
tecture of an experimental interactive editing 
tool that integrates the processes of source doc- 
ument composition and translation into the tar- 
get . The tool has been conceived as an 
optimal solution for a particular case of bilin- 
gual production of legal documentation, but it 
also illustrates in a more general way how to ex- 
ploit the possibilities of SGML (ISO8879, 1986) 
used extensively to annotate a whole range of 
linguistic and extralinguistic information in spe- 
cialized bilingual corpora. 
SGML is well established as the coding 
scheme underlying most Translation Memory 
based systems (TMBS), and has been pro- 
posed as the cod-it~g scheme for the interchange 
of existing Translation Memory databases 
Translation Meinories eXchange, TMX (Melby, 
1998). The advantages of SGML have also been 
perceived by a large conmmnity of corpus lin- 
guistics researchers, and big efforts have been 
made in the development of suitable markup 
options to encode a variety of textual types and 
functions -as clearly demonstrated by the Text 
Encoding Initiative, TEI; (Burnard & Speberg- 
MacQueen, 1995). While the tag-sets employed 
by TMBS are simple and task-oriented, TEI has 
offered a highly complex and versatile collection 
of tags. The guiding hypothesis in our experi- 
ment has been the idea that it is possible to 
explore TEI/SGML markup in order to develop 
a system that carries the concept of Translation 
Memory one step further. One important lea- 
ture of SGML is the DTD. DTDs determine the 
logical structure of documents and how to tag 
them accordingly. We have concentrated on the 
accurate description of documents by means of 
TEI conformant SGML markup. The markup 
will help disclose the underlying logical struc- 
ture of documents. From annotated documen- 
tation, DTDs can be induced and these DTDs 
provide the basic scheme to produce new doc- 
uments. We have collected a corpus of official 
publications from three main institutions in the 
Basque Autonomous Region in Spain, the Bo- 
letln Oficial de Bizkaia (BOB, 1990-1995), Bo- 
tetln Oficial de Alava (BOA, 1990-1994) and 
Bolet{n Oficial del Pais Vasco (BOPV, 1995). 
Documents in the corpus were composed by Ad- 
nfinistration clerks and translated by transla- 
tors. Both clerks and translators have been us- 
ing a wide variety of word-processors, although 
since 1994 MSWord has been generalized as the 
standard editing tool. Administrative docu- 
mentation shows a regular structure, and is rich 
• in*recurrent textual patterns. For each docu- ..... 
ment type different document tokens share a 
common global distribution of elements. Of- 
ficial document composers learn these global 
structures and apply them consistently. It is 
also the case that composers tend to reuse old 
32 
Document Type 
Orden Form 
Decreto..Foral =-- 
Resolucidn 
Extracto 
Acuerdo 
Norma Foral 
Anuncio 
% 
53% 
..22%,;- 
13% 
5.4% 
3.4% 
1.9% 
0.4% 
documents, where the whole document may 
be considered the translation unit. TM3 can 
.,-., : :- :...~ .~o. ~be~,g. o~strued,as~:~i:hiling~ai,doc,,~ent-database. 
Much redundancy originates from this TM col- 
lection, although it should be noticed that they 
are all by-products derived from the same an- 
notated bitext which subsumes them all. Good 
software packages for TM1 and TM3 already ex- 
ist in the market, and hence their exploitation is 
Table I: 
document files when producing new documents 
of the same type. Despite the fact that no 
SGML software was used at the editing phase, 
texts in the corpus show regular logical struc- 
tures and consistent distribution of text seg- 
ments. Our main goal in tagging the corpus was 
to make all them explicit (Martinez, 1997). The 
most common type of document in the corpus, 
the Orden Foral, was chosen (see Table 1). We 
analysed some 100 tokens and hand-marked the 
most salient elements. The heuristics to identify 
these elements were later expressed in a collec- 
tion of recognition routines in Perl and tested 
against a set of 400 tokens, including the initial 
100. As a result of this process of automatic 
tagging of structural elements we produced a 
TEI/SGML tagged corpus with yet no corre- 
sponding overt DTD. In Section 2 we will ex- 
plain how DTDs were later induced from the 
tagged corpus. 
Once the corpus was segmented the next step 
was to align it. This was conducted at different 
levels: general document elements (DIV, SEG, 
P), as well as sentential and intra-sentential el- 
ements, such as S, ItS, NUM, DATE, etc. (Mar- 
tinez, 1998b). Aligned in this way, the corpus 
becomes an important resource for translation. 
Four complementary  databases may 
be obtained at any time from the annotated 
corpus: three translation memory databases 
(TM1, TM2, and TM3) as well as a terminology 
database (termbase). The three TMs differ in 
the nature of the translation units they contain. 
TM1 consists of aligned sentences than can feed 
commercial TM software. TM2 contains ele- 
ments which are translation segments ranging 
from whole sections of a document or multi- 
sentence paragraphs to smaller units, such as 
short phrases or proper names. TM3 simply 
hosts the whole collection of aligned bilingual 
Types of documents in the corpus beyond our interest (Trados Translator's Work- 
..... . bench, Star!s Transit,.,SDLX, D e'j£~fi,. IBM~s. 
browsing tool for TM3). The originality of our 
editing tool lies in a design which benefits from 
joining the potentiality of DTDs and the ele- 
ments in TM2, as will be shown in sections 4 
and 5. 
2 DTD abstraction 
SGML mark-up determines the logical structure 
of a document and its syntax in the form of a 
context-free grammar. This is called the Doc- 
ument Type Definition (DTD) and it contains 
specifications for: 
® Names and content for all elements that are 
permitted to appear in a document. 
o Order in which these elements must ap- 
pear. 
o Tag attributes with default values for those 
elements. 
DTDs have been abstracted away from the an- 
notations that were automatically introduced 
in the corpus. Similar experiments have been 
reported before in the literature. (Ahonen, 
1995) uses a method to build document in- 
stances from tagged texts that consists of a de- 
terministic finite automaton for each context 
model. Subsequently, these automata are gen- 
eralized and converted into regular expressions 
which are easily transcribed into SGML content 
models. (Shafer, 1995) combines docmnent in- 
stances with simplification rules. Our method 
is similar to Sharer's, but .with a.modification 
in the way rules reduce document instances. A 
tool to obtain a DTD for all document instances 
has been developed (Casillas, 1999). Given that 
source and target documents show some syn- 
tactic and structural mismatches, two different 
DTDs are induced, one for each , and 
33 
Spanish Text: 
<div0> 
<divl> ... </divl> 
<seg9 id=9ES2 corresp=gEU2> Contra dicha 
<rs type=law id=LES12 corresp=LEU10> 
Orden Foral </rs>, que agota la vfa ad- 
ministrativa podr~i interponerse recurso 
contencioso-administrativo ante la <rs 
type=organization id=0ES9 corresp=0EUl i> 
Sala de lo Contencioso-Administrativo del Tri- 
• bunal Superior de J usticia del Pais Vasco </rs >, 
en el plazo de dos meses, contado desde el d/a 
Basque Text: 
<div0> 
<divl> ... </divl> 
<seg9 i de9EU2 correspe9ES2> <rs type=law 
id=LEUi0 corresp=LESl2> Foru agindu </rs> 
horrek amaiera eman dio administrazio bideari; 
eta beraren aurka <rs type=organization 
id=0EU10> Administrazioarekiko </rs> 
auzibide-errekurtsoa jarri ahal izango zaio <rs 
type=organization id=0EUll corresp=0ES9> 
Euskal Herriko JustiziAuzitegi Nagusiko Admin- 
istrazioarekiko Auzibideetarako Salari </rs>, 
siguiente a esta~:m~t.~eaci~m~.sin~p~er~ui~i~deAu~,.~;.~aila~aetek~:~epea~;~4ja~d~mxazpen ~hatl egiten 
utilizacidn de otros medios de defensa que estime den egunaren biharamunetik zenbatuko da epe 
oportunos. </seg9> hori; halo eta guztiz ere, egokiesten diren beste 
defentsabideak ere erabil litezke. </seg9> 
<segl0 id=10ES1 corresp=10EUl> Du- 
rante el referido plazo el expediente BHI-<num 
num=10094> 100/94 </num>-P05-A quedar£ de 
manifiesto para su ex~imen en las dependencias 
de <rs type=place id=PES3 corresp=PEU2> 
Bilbao calle Alameda Rekalde </rs>, <num 
num=30> 30 </num>, <num num=5> 5.a </num> 
y <hum hum=6> 6.a </hUm> plantas. </segl0> 
</div0> 
<closer id=pESl3 corresp=pEUl3 > <name> 
El Diputado Foral de Urbanismo Pedro Hern£ndez 
Gonz~ilez. </name> </closer> 
<segl0 id=10EU1 corresp=10ESl> Epe hori 
amaitu arte BHI-<num num=10094> 100/94 
</num>-P05-A espedientea agerian egongo da, 
nahi duenak azter dezan, <rs type=place 
id=PEU2 corresp=PES3> Bilboko Errekalde zu- 
markaleko </rs> <num num=30> 30.eko </num> 
bulegoetan, <num num=5> 5 </num> eta <num 
num=6> 6.</num> solairuetan. </segl0> 
</div0> 
<closer id=pEU13 corresp=pESl3> <name> 
Hirigintzako foru diputatua. Pedro Hern/mdez 
Gonz£1ez. </name> </closer> 
Figure 1: Ilustrates a sample of the annotated bitext 
are paired through a correspondence table. Cor- 
respondences in this table can be up-dated, or 
deleted. At present, we have six DTDs, one for 
each document type in each  (there are 
three document types; Figure 2 shows a part of 
one of these DTDs). By means of these paired 
DTDs, document elements in each  are 
appropriately placed. In the process of gener- 
ating the bilingual document, a document type 
must first be selected. Each document type has 
an associated DTD. This DTD specifies which 
elements are obhgatory and.which are optional. 
With the aid of the DTD, the source document 
is generated. The target document will be gen- 
erated with .the aid of the com~esponding target 
DTD. 
3 Joining TM2 and DTD 
TM2 specifically stores a type of translation 
segment class, which we have tagged <segl>, 
<seg2>... <segn>, <title> and <rs>, and 
which is relevant to the DTD. Segments tagged 
<segn> are variable recurrent  pat- 
terns very frequent in the specialized domain 
of the corpus and whose occurrence in the text 
is well established. These <segn> tags in- 
clude two attributes: id and correspond which 
locate the aligned segment both in the cor- 
pus and in the database (Figure 1). Seg- 
ments tagged <rs> are referring expressions 
which have been recognized, tagged and aligned 
• and which correspond largely to proper names 
(Martinez, 1998a), (Martinez, 1998b). TM2 is 
managed in tile form of a relational database 
-where segments are stored, as records. .Each 
record in the database consists of four fields: 
the segment string, a counter for the occur- 
rences of that string in the corpus, the tag 
and the attributes (type, id and corresp). 
Table 2 shows how the text fragment inside 
34 
<!ELEMENT 
<!ELEMENT 
<!ELEMENT 
<!ELEMENT 
<!ELEMENT 
<!ELEMENT 
LEGE - - (TEXT)> 
TEXT - - (BODY)> 
BODY - - (OPENER, DIVO, CLOSER)> 
0PENER - - (TITLE, NUM, DATE, NAME?, SEGI)> 
SEGI - - (SEGIi, (#PCDATAIRSIDATEINUM)+)> 
(SEGii, NUM, DATE, RS, NAME, TITLE) (#PCDATA)> 
<!ELEMENT 
\[SEGI5)+, 
<!ELEMENT 
<!ELEMENT 
<!ELEMENT 
<!ELEMENT 
(DIVO) - - ( (#PCDATA \[RS INUM \[DATE ISEG4\[SEGS ISEG6\[SEG7 ISEG8 ISEGI2 \[SEGi4 
SEG9?, SEGIO?)> 
(SEG4, SEG5, SEG6) (#PCDATA) > 
(SEG9, SEGiO, SEGT, SEG8, SEGi2, SEGi4, SEGi5) - - (#PCDATA\[RS\[DATE\[NUM)+> 
(CLOSER) ii(PLACENAME?,DATE?, NAME?)> 
(PLACENAME) - : (RS)> 
<!ATTLIST RS TYPE (0RGANIZATION\[ LAW\[ PLACE\[ UNCAT) UNCAT> 
Figure 2: Part of the DTD of the type document Orden Foral 
the </divl>...</div0> tags of Figure 1 ren- 
ders three records in the database. Note how 
the content of the string field in the database 
maintains only the initial <segn> and <rs> 
tags. Furthermore, <rs> tagged segments in- 
side <segn> records are simplified so that their 
content is dismissed and only the initial tag is 
kept (Lange et al., 1997). The reason is that 
they are considered variable elements within 
the segment (dates and numbers are also these 
type of elements). The strings Orden Foral of 
record 2 marked as <rs type=law> and Sala 
de lo Contencioso-Administrativo del Tribunal 
Superior de Justicia del Pais Vasco of record 
3 <rs type=organization> are thus not in- 
cluded in record 1 <segg>, since they may dif- 
fer in other instantiations of the segment. These 
internal elements are largely proper names that 
vary from one instantiation of the segment to 
another. The <rs> tag can be considered 
to be the name of the varying element. The 
value of the type attribute <rs type=law> 
constraints the kind of referential expression 
that may be inserted in that point of the trans- 
lation segment. Table 2 shows that source 
and target records may not have straight one- 
to-one correspondences. Although this is by 
no means the general:case; only about 5.61%, 
(Martinez, 1998a), such one-to-N correspon- 
dences provide good ground to explain how 
the TM2 is designed. The asymmetry can be 
easily explained. The Spanish term recurso 
contencioso-administrativo has been translated 
into Basque by means of a category changing 
operation, where the Spanish adjective admin- 
istrativo has been translated as a Basque noun 
complement Administrazioarekiko which liter- 
ally means "Administration-the-with-of' trig- 
gering its identification as a proper noun. 
Table 3 shows the way in which source lan- 
guage units are related with their correspond- 
ing target units, which, as can be observed, can 
be one-to-one or one-to-N. This means that one 
source element can have more than one transla- 
tion. 
TM2 is created in three steps: 
® First, non-pertinent tags are filtered out 
from the annotated corpus. Tags marking 
sentence <s> and paragraph <p> align- 
ment are removed because they are of no 
interest for TM2 'recall that they are reg- 
istered in TM1). 
• Second, translation segments <segn>, 
<title> phrases and referential expres- 
sions <rs> are detected in the source doc- 
ument and looked up in the database. 
o Third, if they are not already present in 
the database, they are stored each in its 
database.and values of the id and corresp 
attributes-are~used to set the correspon- 
dence between source and target database. 
4 Composition Strategy 
Every phase in tile process is guided by the 
markup contained in TM2 and the paired DTDs 
35 
Spanish Unit 
<seg9> Contra dicha <rs type=law>, 
que agota la via 
administrativa podr~i interponerse recurso 
contencioso-administrativo ante 
la <rs type=organization>, 
en el plazo de dos meses, contado 
desde el dla siguiente 
a esta notificacidn, sin perjuicio 
de la utilizaci6n 
Basque Unit 
<seg9> <rs type=law> 
horrek amaiera eman dio 
• - adrrrinistrazio'bideari; eta:beraren aurka: " 
<rs type=organization> 
auzibide-errekurtsoa jarri 
ahal izango zaio 
<rs type=organization>, 
bi hilabeteko epean; 
jakinarazpen hau egiten den egunaren 
de otros medios de. defensa que estime oportunos, biharamunetik zenbatuko da epe hori; 
que estime oportunos, hala eta guztiz ere, egokiesten diren beste 
............ '-. . -. .~' °.~ ..... :z-:: :::~:::defemVs~ideate.~ere:erabit~htezke~,...- - .... 
<rs type=law> Orden Foral <rs type=law> Foru agindu 
<rs type=organization> Administrazioarekiko 
<rs type=organization>-S~ de lo 
Contencioso-Administrativo 
del Tribunal Superior de Justicia del Pals Vasco 
<rs type=organization> Euskal 
Herriko Justizi 
Auzitegi Nagusiko Administrazioarekiko 
Auzibideetarako Salari 
Table 2: Source and targe  record samples in TM2 
Spanish Unit Basque Unit 
<rs type=organization id= corresp=> 
Bolet/n Oficial de Bizkaia 
<rs type=organization id= corresp=> 
Bizkaiko Aldizkari Ofizialea 
<rs type=organization id= orresp=> 
Bizkaiko Engunkari Ofizialea 
<rs type=organization id= corresp=> 
Bizkaiko Boletin Ofizialea 
<seg3> dispongo <seg3> xedatu dut 
<seg3> xedatzen duen 
Table 3: Source  units related with their corresponding target  units 
which control the application of this markup. 
The composition process follows two main steps 
which correspond to the traditional source doc- 
ument generation and translation into the tar- 
get document. The markup and the paired 
DTD guides the process in the following man- 
ner: 
1. Before the user starts writing the source 
document, he must select a document type, 
i.e., a DTD. This has two consequences. On 
the one hand, the selected DTD produces 
a source document template that contains 
the logical structure of the document and 
some of its contents. On the other hand, 
the selected source DTD trigger:s .a target 
paired DTD, which will be used later to 
translate the document. There are three 
different types of elements in the source 
document template: 
® Some elements are mandatory and are 
.: A . . 
provided to the user, who must only 
choose its content among some alter- 
native usages (s/he will get a list of 
alternatives ordered by frequency, for 
example <title>). Other obligatory 
elements, such as dates and numbers, 
will also be automatically generated. 
o Some other elements in the template 
are optional (e.g., <seg9>). Again, 
a list of alternatives will be offered to 
the user. These optional elements are 
.sensitive to the .context (document or 
division type), and markup is also re- 
sponsible for constraining the valid op- 
- ~ t ion.s.g:iverlTtQ,the user:. Obligatory and 
optional elements are retrieved from 
TM2, and make a considerable part of 
the source document. 
. All documents have an important part 
of their content which is not deter- 
36 
Word/doc. 'Num. doc. TM2 
0-500 378 34.91 
500-1,000 -25 .... .M:0t - 
More 1,000 16 3.01 
Weighted mean 31.8 
Table 4: % generated by TM2 
mined by the DTD (<divl>). It is the 
most variable part, and .the system lets 
the writer input text freely. It is when 
TM2 has nothing to offer that TM1 
and TM3 may provide useful material. 
Given the recurrent style of legal doc- 
umentation, it is quite likely that the 
user will be using many of the bilin- 
gual text choices already aligned and 
available in TM1 and TM3. 
2. Once the source document has been com- 
pleted, the system derives its particular 
logical structure, which, with the aid of the 
target DTD, is projected into the resulting 
target logical structure. 
5 Evaluation 
Table 4 shows the number of words that make 
up the segments stored in TM2 from the source 
documents. There is a line for each document 
size considered. We can see that the average 
of segments contained in TM2 is 31.8%, on a 
scale from 34.91% to only 3.01%. The amount 
of segments dealt with in this way largely de- 
pends on the size of the document. Short doc- 
uments (90.21) have about 35% of their text 
composed in this way. This figure goes down to 
3% in documents larger than 1,000 words. This 
is understandable, in the sense that the larger 
the document, the smaller proportion of fixed 
sections it will contain. 
Table 5. shows the Immber of words that are 
proposed for the target document. These trans- 
lations are obtained from what is stored in TM2 
complemented by algorithms designed to trans- 
late dates and numbers. We can see that the 
average of document translated is 34%. Short 
documents have 36% of their text translated. 
falling to above 11% in t, he case of large docu- 
Illents. 
37 
Word/doc. Num. doc. TM2 Alg. Total 
0-500 378 28.3 7.7 36 
~,500-1;000 25 ' :.: 12.3 . '9.6 • -.21-3'. 
More 1,000 16 4.7 !6.41 10.7 i i 
W.M. 26.5 ' 7.6 I 34.2 
Table 5: % translated by TM2 and algorithms 
6 Conclusions 
We have shown how7 DTDs derived from de- 
• scriptive markup can~be"employed to ease the 
process of generating bilingual dedicated docu- 
mentation. On average, one third of the con- 
tents of thedocuments can be automatically ac- 
counted for. It must also be pointed out that 
the part being dealt with represents the core 
structure, lay-out and logical components of the 
text. The remaining two-thirds of untreated 
document can still be managed with the aid 
of sentence-oriented TMBS, filling in the gaps 
in the ore/all skeleton provided by the target 
template. Composers may also browse TM3 to 
retrieve whole blocks for those parts which are 
not determined by the DTD. One of the clear 
targets for the future is to extend the cover- 
age of the corpus and to test structural taggers 
against other document types. A big challenge 
we face is to develop tools that automatically 
perform the recognition of documents from less 
restricted and more open text types. However, 
we are not sure of the extent of the practicality 
of such an approach. An alternative direction 
we are presently considering is to establish a 
collection of pre-defined document types, which 
would be validated by the institutional writers 
themselves. It is a process currently being im- 
plemented in the Basque administration to de- 
fine docmnent models for writers and transla- 
tors to follow. What we have demonstrated is 
that paired DTDs, complemented with rich lan- 
guage resources of the kind defined in this pa- 
per, allow for the design of optimal editing envi- 
ronments which would combine both document 
composition and translation as one single pro- 
cess. All the resourcesneeded (DTDs.and TMs) . 
can be induced from an aligned corpus. 
7 Acknowledgements 
This research is being partially supported by the 
Spanish Research Agency, project ITEM, TIC- 
96-1243-C03-01. 

References 
H. Ahonen. Automatic Generation of SGML 
:Content Models. Electronic :Publishing, 8(2- 
3):195-206, 1995. 
L. Burnard, C. Speberg-McQueen. TEILite: 
An Introduction to Text Encoding 
for Interchange. URL://http://www- 
tel. uic. edu/orgs/tei/intros/teiu5, tei, 1995. 
Casillas A., Abaitua J., Martinez R. Extracci6n 
y aprovechamiento de DTDs emparejadas en 
corpus paralelos. Proceesamientq: deL!~enguaje 
Natural, 25:33-41, 1999. 
ISO 8879, Information Processing-Text and Of- 
fice Systems-Standard Generalized Markup 
Language (SGML). International Organiza- 
tion For Standards, 1986, Geneva. 
J. Lang6, I~ Gaussier, B. Daile. Bricks and 
Skeletons: Some Ideas for the Near Future of 
MATH. Machine Translation, 12:39-51, 1997. 
Martinez R., Abaitua J., Casillas R. Bilingual 
parallel text segmentation and tagging for 
specialized documentation. Proceedings of the 
International Conference Recent Advances in 
Natural Language Processing (RANLP'97), 
369-372, 1997. 
Martinez R., Abaitua J., Casillas A.. Bi- 
text Correspondences through Rich Mark- 
up. 36th Annual Meeting of the Association 
for Computational Linguistics abd 17 Inter- 
national Conference on Computational Lin- 
guistics (COLING-ACL'98), 812-818, 1998. 
Martinez R., Abaitua J., Casillas A.. Aligning 
tagged bitexts. Sixth Workshop on Very Large 
Corpora, 102-109, 1998. 
A. Melby. Data Exchange from OSCAR and 
MARTIF Projects. First International con- 
ference on Language Resources ~4 Evaluation, 
3-7, 1998. 
