Aligning tagged bitexts 
Raquel Martlnez 
Departamento de Inform~itica y Programaci6n. Facultad de Matem~iticas 
Universidad Complutense de Madrid, 28040 Madrid, Spain 
raquel@eucmos, sim. ucm. es 
Joseba Abaitua 
Facultad de Filosofla y Letras, Universidad de Deusto 
4808,0 Bilbao, Spain 
abaitua@fil, deusto, es 
Arantza Casillas 
Departamento de Autom~itica, Universidad de Alcal~t de Henares 
28871 Alcalh de Henares, Spain 
arantza@aut, alcala, es 
Abstract 
This paper describes how complementary tech- 
niques can be employed to align multiword ex- 
pressions in a parallel corpus or bitext. The 
bitext used for experimentation has two main 
features: (i) it contains bilingual documents 
from a dedicated domain of legal and admin- 
istrative publications rich in specialized jar- 
gon; (ii) it involves two languages, Spanish and 
Basque, which are typologically very distinct 
(both lexically and morpho-syntactically). The 
former feature provides a good basis for testing 
techniques of collocation detection. The latter 
presents quite a challange to a number of re- 
ported algorithms, in particular to the align- 
ment of sentence internal segments. 
1 Tagged bitexts as large language 
resources 
Much literature has been produced in the 
area of sentence alignment of parallel biligual 
corpora or bitexts. Fewer references con- 
cern the alignment of intra-sentential segments 
such as word or multiword collocations (Eijk 
93), (Kupiek 93), (Dagan & Church 94), and 
(Smadja et al. 96). The difficulty of aligning bi- 
texts depends of a number of factors such as the 
quality of the bitext (whether is truely parallel 
or not), the proximity between the languages 
(either structurally, morpho-syntactically or al- 
phabetically), the additional coded information 
that bitexts may have (richer or poorer mark- 
up), among others. 
While stuying bitext alignment techniques it 
was decided that an optimal approach was to 
tag the corpus. Descriptive annotations can 
account for linguistic information at all levels, 
from discourse structure to phonetic features, as 
well as semantics, syntax and morphology. The 
process of annotating the corpus in this man- 
ner is very labour intensive, even when largely 
automated, however it produces rewarding re- 
sults. Thoroughly tagged bitexts become rich 
and productive language resources (Abaitua et 
al. 98). SGML based TEI conformant mark-up 
(Ide & Veronis 95) has been the adopted mark- 
up option and it was discussed in (Martinez et 
al. 97). 
Continuing with the work of (Martinez et 
al. 98), where sentence alignment based on 
rich mark-up was described, we present here 
two further achievements. Section 2 shows how 
proper names have been aligned, and Section 3 
presents the techniques employed in attempting 
the aligning of multiword collocations. Results 
are evaluated in Section 4 and Section 5 offers 
some discussion. 
2 Proper name alignment 
2.1 Proper name tagging 
The module for the recognition of proper names 
relies on patterns of typography (capitalisation 
and punctuation) and on contextual informa- 
tion. It also makes use of lists with most com- 
mon person, organisation, law, publication and 
place names. The tagger annotates a multi- 
word chain as a proper name <rs> when each 
word in the chain is uppercase initial. A closed 
list of functional words (prepositions, conjunc- 
tions, determiners, etc.) is allowed to appear 
102 
inside the proper name chain, see examples in 
Table 1. A collection of heuristics discard up- 
percase initial words in sentence initial position 
or in other exceptional cases. 
Just us (Smadja et al. 96) distinguished be- 
tween two types of collocations, we too distin- 
guish between: 
Fixed names: Compound proper names 
labeled 'fixed', such as Boletin Oficial de 
Bizkaia, are rigid compounds. Spanish 
proper names all correspond to this type. 
Flexible names: Compound proper 
names labeled 'flexible' are compounds 
that can be separated by intervening 
text elements such as in Adrninistrazio 
Publikoetarako Ministeritzaren <date>.. 
</date> Agindua, where a date splits the 
tokens within the compound. There is a 
small but significative number of these in 
Basque, as has been previously noted by 
(Aduriz et al. 96b). 
After proper names have been successfully 
identified (Table 2), the next step is their align- 
ment. Two types of alignment can take place: 
• 1 to 1 alignment: one to one correspon- 
dence between fixed names in the source 
and target documents. 
• 1 to N alignment: one to none or more 
than one correspondences between fixed 
names in the source language and flexible 
names in the target language. 
Alignment has been achieved by resorting to: 
1. Proper name categorization, as shown in 
Table 1. 
. 
. 
. 
Reduction of the alignment space to previ- 
ously aligned sentences, 
Identification of cognate nouns, aided by 
a set of phonological rules that apply when 
Basque loan terms are directly derived from 
Spanish terms. 
The application of the TasC algorithm 
(Martinez et al. 98) adapted to proper 
name alignment. 
2.2 Identification of cognates 
Points one and two above may suffice to work up 
the alignment of fixed proper names belonging 
to a single category that shows up only once 
in the alignment space (i.e. in the sentence). 
Nevertheless, there can be sentences with flexi- 
ble proper names or more than one fixed proper 
name belonging to the same category. There- 
fore it may be necessary to determine the cor- 
rect alignment among possible candidates. As 
additional criteria in these cases we reinforce 
the identification of lexical cognates with a set 
of phonological correlation rules. 
These are two examples of phonological cor- 
relation rules: 
(i) The Spanish prefix 'tel-' always correlates 
with the prefix 'erl-' in Basque loans (e.g. 
reloj / erloju; relacidn / erlazio) 
(ii) The Spanish suffix '-cidn' often correlates 
with the suffix '-zio' in Basque loans (e.g. 
nocidn / nozio; adrninistracidn / adminis- 
trazio) 
We use a set of up to 33 rules of this type. For 
some loan terms in Basque e.g. universidad / 
unibertsitate), several of these rules may apply: 
-v- ~ -b- ;-rs- ~ -rts- ;-dad ~ -tare. 
Although the application of these phonolog- 
ical rules for identifying Basque loan words is 
quite regular, not every new term in Basque 
is derived in this way. In many other cases a 
Spanish term has a genuine Basque counterpart, 
(e.g. sociedad / elkarte). In any case, this set 
of phonological rules provides a very efficient 
aid for the identification of a high proportion of 
Spanish/Basque cognates (86.45 % on average, 
as shown in Table 3). 
Therefore, when aligning proper names, cog- 
nate identification will help not only in obvious 
cases such as personal or place names, but also 
in categories of proper names such as organiza- 
tion, law or title. 
2.3 Calculating the similarity between 
Basque and Spanish proper names 
In order to determine whether two proper 
names belonging to the same category are trans- 
lation equivalencies of each other, Dice's coeffi- 
cient (Dice 45) is applied in two phases: first at 
token level and then, at proper name level. 
103 
Categories \]1" Spanish Basque 
Person Javier Otazua Barrena Javie~" Otazua Barrena 
Place Amorebieta-Etzano Arnorebieta-Etzano 
Corredor del Cadagua Kadaguako pasabidea 
Bilbao c/ Alameda Rekalde Bilboko Errekalde Zumarkaleko 
Organization Ayuntamiento de Areatza Areatzako udalak 
Registro de la Propiedad Jabegoaren erroldaritzan 
Sala de lo Contencioso-Administrativo del Euskal Herriko Justizia Auzitegi Nagusiko 
Tribunal Superior de Justicia del Pals Vasco Administraziozko Liskarrauzietarako Salari 
Law Impuesto sobre la Renta Errentaren gaineko Zergari 
Plan Especial de Reforma Interior Barne-Eraberritzearen Plan Beretzia 
Norrnativa de Rehabilif, aei6n Birgaikuntzari buruzko Arauko 
Title Jefe del Servicio de Administraci6n de Zuzeneko Zeryen Administrazio 
Tributos Directos Zerbitzuko buruaren 
Diputado Foral de Hacienda y Finanzas Ogasun eta Finantzen foru diputatua 
Publication Boletln de Bizkaia Bizkaiko Aldizkari 
Boletin Oficial del Pals Pasco Euskal Herriko Aldizkari Ofizialean 
Uncategorized Estudio de Detalle Azterlan Zehatzarako 
Acci6n Comunitaria Erkidego Ekintzapidearen 
Documento Nacional de ldentidad Nortasun Agiri Nazionalaren 
Table 1: Examples of proper names 
Proper Name Classes \[ 
Person 
Place 
Organisation 
Law 
Title 
Publication 
Uncategorised 
Total 
Precision I Recall 1% Spanish PN Precision I rtecall 1% Basque PN 
100% 100% 4.48% 
100% 100% 6.38% 
99.2% 97.8% 23.96% 
99.2% 99.2% 47.93% 
100% 100% 6.55% 
100% 100% 2.58% 
lOO% lOO% 8.10% 
99.4% 199.1% 100% 
100% 100% 4.76% 
100% 100% 6.95% 
100% 100% 24.17% 
100% 100% 46.15% 
97.2% 97.2% 6.59% 
100% 100% 2.74% 
100% 100% 8.60 
99.8% 99.8% I 100% 
Table 2: Results of proper name identification 
1. In the first level, each token in the source 
proper name is compared with all the to- 
kens in the target proper name. In or- 
der to determine whether two tokens are 
cognates, bigrams are compared trying to 
apply, if they are not equal, the rules of 
phonological derivation. Only when the re- 
sulting coefficient is bigger than a thresh- 
old, the tokens are considered cognates. 
The threshold has been established in 0.5 
as a result of different experimental tests. 
2. In the second level, given a source and a 
target proper name, their similarity is de- 
termined according to the number of cog- 
nate tokens that exist between them. Fig- 
ure 2 we illustrates an instance of the ap- 
plication of Dice's coefficient at both levels. 
Spanish proper name -- Basque proper name 
Boletin Oficial de Bizkaia -- Bizkaiko Aldizkari 
O fizialean 
First level of similatity: 
Boletin ~ none 
OJicial --- Ofizialean (-c- --~ -z-) 
2×6 DC= ~ = 0.8 > 0.5 
Bizkaia --- Bizkaiko (no rule) DC= 2×5 
= 0.76 > 0.5 
Second level of similatity: 
Number of cognate tokens is 2, then: 
DC = ~,,2 =0.66 
Figure 2: Example of similarity calculation be- 
tween two proper name 
2.4 Algorithm for proper name 
alignment 
After similarity metrics have been set between 
candidate proper names, the alignment algo- 
104 
Spanish sentence: Basque Sentence: 
<s id=sESdocl2-2 > Segundo: Notificar la <s icl=sEUdocl2-3 > Bigarrena: Honako 
presente <rs type=law> Orden Foral </rs> erabakia <rs type=organisation > Bar- 
al <rs type=organisation > Ayuntamiento rikako Udalari </rs> jakinerazi eta <rs 
de Barrika </rs>, publicarla en el <rs typefpublication > Bizkaiko Aldizkari 
type=publication >Bolet{n Oficial de Bizkaia Ofizialean </rs> argitaratzea eta <rs type=law 
</rs> y proceder a la autentificacidn del <rs >Plan Partziala </rs> aurkeztua izan den eran 
id= type=law > Plan Parcial </seg> tal como kautotzea. </s> 
ha sido presentado. </s> 
Figure 1: Example of non-literal translation 
rithm can be applied. The alignment algorithm 
has been borrowed from (Martinez et al. 98) 
with minor differences. 
• The first difference is the criteria by means 
of which the similarity amongst alignable 
candidates is determined. While sentences 
are aligned on the basis of the similarity of 
the annotations they contain, proper names 
are aligned on the basis of their belonging 
to the same category as well as by matching 
cognate tokens. 
• The second difference is the relevance of the 
order of alignable elements in the bitext. 
While in sentence alignment there are con- 
straints regarding ordering and grouping to 
reduce the number of cases to be evaluated, 
in the aligning of proper names constraints 
cannot be applied because ordering is not 
predictable. 
Due to non-literal translations, 12% of the 
identified proper names have no exact counter- 
part in the other language (see Figure 1). In 
this case, the Basque sentence does not have 
the proper name of the law Orden Foral but the 
anaphoric nominal honako erabakia, 'this reso- 
lution'. In the corpus, there are 6% more proper 
name in the Spanish side of the bitext than in 
the Basque side. 
Table 3 shows the accuracy of this alignment 
strategy. Proper names with no counterpart 
have not been considered. Figure 3 illustrates 
an instance of how aligned proper names are 
tagged in the bitext. 
3 Alignment of collocations 
(Smadja et al. 96), (Dagan ~ Church 94), 
(Kupiek 93) and (Eijk 93) approach the align- 
ment of multiword collocations resorting to a 
number of complementary techniques: 
(i) Noun 
(ii) 
(iii) 
phrase collocations: All but 
Smadja narrow the scope of collocations to 
noun phrases. Smadja is the only one that 
attempts to treat other phrases (such as 
verb phrases as well what he labels 'flex- 
ible phrases'). 
Delimited search space: All but Church 
delimit the search space to already aligned 
sentences. Church in turn departs from a 
corpus of aligned words. 
POS tagging: All but Smadja employ 
Part of Speech (POS) taggers. 
We also employ techniques (ii) and (iii), but 
we introduce three additional resources: A 
bilingual glossary, a bilingual contrastive gram- 
mar, as well as the structural markup which 
alredy exists in the bitext. In addition, we also 
consider verb prases. 
The approach discribed below illustrates 
work in progress on how we try to optimize 
the alignment process by combining those tech- 
niques. Collocations are aligned in six steps. 
The first three steps are meant to detect can- 
didate collocations in both languages. The last 
three axe directly involved in the alignment. 
. Word cooccurrence frequency: Due to 
the specialized nature of the bitext, any 
word cooccurrence that superates a given 
threshold is considered to be a collocation 
candidate. This threshold depends on the 
size of the corpus, but even a low figure 
as 2 can be considered significative enough. 
A tool for word cooccurrence detection has 
been implemented. This tool is sensitive 
to SGML tags and it uses a window of 
maximum ten words. From a subcorpus of 
150,000 words, with a threshold of 3 and a 
windows size of 7, 2,095 candidate colloca- 
105 
Categories \] % Alignable PN Precision Recall 
Person \]LO0% 
Place 
Organisation 
Law 
Title 
Publication 
Uncategorised 
89.28% 
79.38% 
915.68% 
~;6.2% 
100% 
5.4.54% 
Total II 8,5.45% 
100% 
100% 
96.7% 
100% 
100% 
100% 
93.4% 
98.5% 
100% 
92% 
76.6% 
88.2% 
72.3% 
100% 
85.7% 
87.82% 
Table 3: Results of proper name alignment 
Spanish Sentences: Basque Sentences: 
<s id=sES734 corresp=sEU740> <num <s id=sEU740 corresp=sES734> <num 
'num=l> I. </num> Suspender la aprobaci6n num=l> I. </num> <rs type=place id=UEUI41 
definitiva de \]a <rs type=law id=LES367 corresp=LES367>Araneko Auzoan </rs>, 
corresp=UEU141,LEU342> Modificaci6n <rs type=law id=LEU342 corresp=LES367> 
Puntual de las Norrna8 Subsidiarias de Gernika-£urnoko Udal Egitarnuketazko 
Planeamiento Municipal de Gernika-Lurno Ordezko Araue. Puntuzko Aldaketaren 
en el Barrio de Arana </rs>, en base alas </rs> behin betirako onarpena etetzea, jarraian 
deficiencias que a continuaci6n se expresan y que adierazten diren eta zuzendu egin beharko diren 
debe~in subsanarse <colon> : </colon> </s> akatsetan oinarrituta <colon> : </colon> 
</s> 
Figure 3: Sample of 1 to N alignment of proper names 
tions in Spanish and 1,483 in Basque have 
been detected . 
2. POS tagging: A tagged version of the 
Spanish text was supplied by the Natural 
Language Research Group at the Univer- 
sitar Polit~cnica de Catalunya (M~quez & 
Padr6 97). The Basque text was tagged 
by the IXA group from the Euskal Herriko 
Unibertsitatea (Aduriz et al. 96a), (see Fig- 
ure 5). 
3. NP, VP grammars: Simple noun phrase 
and verb phrase patterns have been used 
to detect candidate collocations and to fil- 
ter out inapproppriate word cooccurrences. 
By means of this technique, 80% of the de- 
tected word cooccurrences are discarded. 
Basque and Spanish phrases show great di- 
vergences, and for the alignment procedure 
to succed, it has been necessary to imple- 
ment an additional resource: a correspon- 
dence table with grammatical patterns for 
Spanish and Basque phrases (see Table 4). 
4. Bilingual glossary lookup: This is a 
very useful resource containing over 15,000 
aligned entries. The glossary was devel- 
oped by the same translators that were in 
charge of the corpus we are working with. 
Yet, the glossary, although it is available 
on-line, translators have not applied it sys- 
tematically and frequent divergences arise 
(compare Figure 4 with Figure 6). 
5. Search within aligned sentences: 
Aligned senteces delimit the search space 
thereby reducing the complexity of the 
alignment. 
6. Human validation: The final step in- 
volves human intervention, so that detected 
collocations can be validated and thus in- 
corporated into the glossary. The posibility 
of enriching the glossary with contextual 
information has not yet been implemented, 
but holds great potentiallity ( <doctype>, 
<div>, <p> and <s> tags could be used 
to locate collocations in context and index 
them through their correspondig id tag at- 
tributes). 
4 Evaluation 
Scores of proper name alignment are shown in 
Table 3 and are very satisfactory. With regards 
to collocations, we expected that those candi- 
date collocations found in the bilingual glossary 
would show high alignment scores, which has 
106 
U Spanish I Basque II 
N+ 
N A+ 
N A'+ 
NPN 
(aux) V+ 
etc. 
N+ 
N A+ 
N N+ 
N+' 
v+ (aux) 
Table 4: Correspondence table 
been the case. We still do not have definite esti- 
mations on the performance of collocations not 
present in the glossary. As we discuss below, 
we are still sceptic about the results of the cor- 
respondence table with current version of the 
Basque lemmatizer. 
5 Discussion 
We have not yet calculated how many detected 
collocations are included in the glossary, al- 
though it has become clear that a high pro- 
portion of these detected collocations have not 
been considered by the translators who created 
the dictionary. These tend to include only col- 
locations which have a clear terminological ap- 
pearance. It is hard to discriminate between 
general language collocations and domain spe- 
cific terminology and this discussion is beyond 
the scope of this paper. 
The correspondence table with Spanish and 
Basque grammatical patterns is at present prob- 
lematic. This is due to the lack of morphological 
information in the output of the Basque lemma- 
tizer. Basque is an aglutinative language which 
has postpositions and other functional elements 
added as suffixes. The information such suf- 
fixes provide is not shown by the lemmatizer 
and this inevitably hinders the efficiency of the 
correspondence table. However we are confi- 
dent that future versions of both the Basque 
and Spanish lemmatizers will become closer be- 
cause they are currently developed within the 
same project team. When their output becomes 
more homogeneous, the efficiency of the corre- 
spondence table will be greately increased. 
6 Acknowledgements 
This research is being partially supported by 
the Spanish Research Agency, project ITEM, 
TIC-96-1243-C03-01. We greately appreciate 
the help given to us by Felisa Verdejo, director 
of the project. We are particulary endebeted 
" agotar" ; "agortu" 
"agotar el plazo";'epea agortu" 
"agotar lava administrativa~; 
" administrozio bidea agortu" 
" agoiarse las reservas " ; " erreserbak agortu " 
... 
"defender " ;" defendatu " 
"defender los derechos";"eskubideak defendatu" 
" defensa " ; " defentsa " 
"defense civil";ndefentsa zibil" 
... 
" interponer "; njarri " 
"interponer un recur.o" ;'errekurtsoa jarri" 
"interponer una reclamaci6n" ;"erreklamazioa jarri" 
"interpretaci6n" ;" interpretazio" 
"medidor" ;"neurgailu" 
"medio';"l) bids; 2) eskuarte" '. 
"medio audiovisualn ;"ikusentzunezko heibide" 
"medio de comunicacin n ; "komunikabide " 
" recurso administrotivo " ; " administrozio errekurts o " 
"reeurso contencioso administrativo "; 
"Administrazioarekiko auzibide-errekurtso" 
"recurso de abuso " ; " abusu errekurtso " 
"veto";" geben" 
"u(a admlnistrativa " ; " administrazio bide" 
"v(a administrativa, por~;"administrazio bidetik" 
Figure 6: Glossary sample 
to developers of the Spanish (M~irquez & Padr6 
97) and the Basque (Aduriz et al. 96a) lemma- 
tizers. We thank CRL for allowing us the use 
of their premises and to Begofia Farwell for the 
reviewing of the text. 

References 
(Abaitua et al. 97) J. Abaitua, A. Ca, ilia., R. Martfnez. 
Value Added Tagging for Multillngual Resource Man- 
agement. Proceedings of the First International Con- 
ference on Language Resources ~ Evaluation, ELRA, 
1003-1007, 1998. 
(Aduriz et al. 96a) I. Aduriz, I. Aldezabal, I. Alegrla, R. 
Urizar. EUSLEM: A lemmatiser/tagger for Basque. 
EURALEXT6, Gotteborg, Sweden, 1996. 
(Aduriz et al. 96b) I. Aduriz, t. Aldezabal, X. Artola, 
N. Ezeiza, and R. Urizar. MultiWord Lexical Units in 
EUSLEM, a lemmatiser-tagger for Basque Papers in 
Computational Lezicography COMPLEX'g6, 1-8. Bu- 
dapest 1996. 
(Dagan & Church 94) I. Dagan, K. W. Church. Ter- 
might: Identifying and 'l~'anslating Technical Termi- 
nology. Proceedings of the Fourth Conference on Ap- 
plied Natural Language Processing, ANLP-94, 34-40, 
Stuttgart, Germany, 1994. 
(Dice 45) L. R. Dice. Measures of the Amount of Ecologic 
Association Between Species. Ecology, 26, 297-302. 
(Eijk 93) P. van der Eijk. Automating the Acquisition 
of Bilingual Terminology. Proceedings Sixth Confer- 
ence of the European Chapter of the /~ssoc~ation for 
Computational Linguistic, Utrecht, The Netherlands, 
113-119, 1993. 
(Ide & Veronis 95) N. Ide, J. Veronis. The Tezt Encod- 
ing Inttiative: Backpround and Contezts. Dordrecht: 
Kluwer Academic Publishers, 1995• 
(Kupiec 93) J. Kupiec. An algorithm for finding noun 
phrase correspondences in bilingual corpora. Proceed- 
i.gs of the 31si Annual Meeting ol ihe ACL, Colum- 
bus, Ohio, 17-22. Association for Computational Lin- 
guistics 1993. 
(M~trquez & Padr6, 97) L. Mfi.rquez, L. Padr& A Flexible 
POS Tagger Using an Automatically Acquired Lan o 
guage Model. Proceedings of the joint BACL/A CL97, 
Madrid, Spain, 1997. 
(Martfnez et al. 97) R. Martfnez, A. Casillas and J. 
Abaitua. Bilingual parallel text segmentation and tag- 
ging for specialized documentation. Proceedings of the 
International Conference Recent Advances in Natural 
Language Processing, RANLP'97, 369-372, 1997. 
(Martlnez et al. 98) R. Martinez, A. Casillas and J. 
Abaitua. Bitext Correspondences through Rich Mark- 
up~ Proceedings of the 17th International Conference 
on Computational Linguistics (COLING'gS) and 36th 
Annual Meeting of the Association for Computational 
Linguistics (ACL'98), Montreal, Canada, 1998. 
(Smadja et al. 96) F. Smadja, K. McKeown, V. Hatzivas- 
siloglou. Translating Collocations for Bilingual Lexi- 
cons: A Statistical Approach. Computational Linguis- 
tics Volume 22, NJ. 1, 1996. 
