Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 65–74,
Ann Arbor, June 2005. c©Association for Computational Linguistics, 2005
Word Alignment for Languages with Scarce Resources
Joel Martin
National Research Council
Ottawa, ON, K1A 0R6
Joel.Martin@cnrc-nrc.gc.ca
Rada Mihalcea
University of North Texas
Denton, TX 76203
rada@cs.unt.edu
Ted Pedersen
University of Minnesota
Duluth, MN 55812
tpederse@umn.edu
Abstract
This paper presents the task definition,
resources, participating systems, and
comparative results for the shared task
on word alignment, which was organized
as part of the ACL 2005 Workshop on
Building and Using Parallel Texts. The
shared task included English–Inuktitut,
Romanian–English, and English–Hindi
sub-tasks, and drew the participation of ten
teams from around the world with a total of
50 systems.
1 Defining a Word Alignment Shared Task
The task of word alignment consists of finding cor-
respondences between words and phrases in parallel
texts. Assuming a sentence aligned bilingual corpus
in languages L1 and L2, the task of a word alignment
system is to indicate which word token in the corpus
of language L1 corresponds to which word token in
the corpus of language L2.
This year’s shared task follows on the success of
the previous word alignment evaluation that was or-
ganized during the HLT/NAACL 2003 workshop on
”Building and Using Parallel Texts: Data Driven Ma-
chine Translation and Beyond” (Mihalcea and Ped-
ersen, 2003). However, the current edition is dis-
tinct in that it has a focus on languages with scarce
resources. Participating teams were provided with
training and test data for three language pairs, ac-
counting for different levels of data scarceness: (1)
English–Inuktitut (2 million words training data),
(2) Romanian–English (1 million words), and (3)
English–Hindi (60,000 words).
Similar to the previous word alignment evaluation
and with the Machine Translation evaluation exercises
organized by NIST, two different subtasks were de-
fined: (1) Limited resources, where systems were al-
lowed to use only the resources provided. (2) Un-
limited resources, where systems were allowed to use
any resources in addition to those provided. Such re-
sources had to be explicitly mentioned in the system
description.
Test data were released one week prior to the dead-
line for result submissions. Participating teams were
asked to produce word alignments, following a com-
mon format as specified below, and submit their out-
put by a certain deadline. Results were returned to
each team within three days of submission.
1.1 Word Alignment Output Format
The word alignment result files had to include one line
for each word-to-word alignment. Additionally, they
had to follow the format specified in Figure 1. Note
that the a0a2a1a3 and confidence fields overlap in their
meaning. The intent of having both fields available
was to enable participating teams to draw their own
line on what they considered to be a Sure or Probable
alignment. Both these fields were optional, with some
standard values assigned by default.
1.1.1 A Running Word Alignment Example
Consider the following two aligned sentences:
[English] a4 s snum=18a5 They had gone . a4 /sa5
[French] a4 s snum=18a5 Ils ´etaient all´es . a4 /sa5
A correct word alignment for this sentence is:
18 1 1
18 2 2
18 3 3
18 4 4
65
sentence no position L1 position L2 [a0a2a1a3 ] [confidence]
where:a4
sentence no represents the id of the sentence within the
test file. Sentences in the test data already have an id as-
signed. (see the examples below)a4
position L1 represents the position of the token that is
aligned from the text in language L1; the first token in each
sentence is token 1. (not 0)a4
position L2 represents the position of the token that is
aligned from the text in language L2; again, the first token
is token 1.a4
Sa1P can be either S or P, representing a Sure or Probable
alignment. All alignments that are tagged as S are also con-
sidered to be part of the P alignments set (that is, all align-
ments that are considered ”Sure” alignments are also part of
the ”Probable” alignments set). If the a0a2a1a3 field is missing, a
value of S will be assumed by default.a4
confidence is a real number, in the range (0-1] (1 meaning
highly confident, 0 meaning not confident); this field is op-
tional, and by default confidence number of 1 was assumed.
Figure 1: Word Alignment file format
stating that: all the word alignments pertain to sen-
tence 18, the English token 1 They aligns with the
French token 1 Ils, the English token 2 had aligns with
the French token 2 ´etaient, and so on. Note that punc-
tuation is also aligned (English token 4 aligned with
French token 4), and counts toward the final evalua-
tion figures.
Alternatively, systems could also provide an a0a2a1a3
marker and/or a confidence score, as shown in the fol-
lowing example:
18 1 1 1
18 2 2 P 0.7
18 3 3 S
18 4 4 S 1
with missing a0 a1a3 fields considered by default S, and
missing confidence scores considered by default 1.
1.2 Annotation Guide for Word Alignments
The word alignment annotation guidelines are similar
to those used in the 2003 evaluation.
1. All items separated by a white space are consid-
ered to be a word (or token), and therefore have
to be aligned (punctuation included).
2. Omissions in translation use the NULL token,
i.e. token with id 0.
3. Phrasal correspondences produce multiple word-
to-word alignments.
2 Resources
The shared task included three different language
pairs, accounting for different language and data
characteristics. Specifically, the three subtasks ad-
dressed the alignment of words in English–Inuktitut,
Romanian–English, and English–Hindi parallel texts.
For each language pair, training data were provided to
participants. Systems relying only on these resources
were considered part of the Limited Resources sub-
task. Systems making use of any additional resources
(e.g. bilingual dictionaries, additional parallel cor-
pora, and others) were classified under the Unlimited
Resources category.
2.1 Training Data
Three sets of training data were made available. All
data sets were sentence-aligned, and pre-processed
(i.e. tokenized and lower-cased), with identical pre-
processing procedures used for training, trial, and test
data.
English–Inuktitut. A collection of sentence-
aligned English–Inuktitut parallel texts from the
Legislative Assembly of Nunavut (Martin et al.,
2003). This collection consists of approximately
2 million Inuktitut tokens (1.6 million words) and
4 million English tokens (3.4 million words). The
Inuktitut data was originally encoded in Unicode
representing a syllabics orthography (qaniujaaqpait),
but was transliterated to an ASCII encoding of the
standardized roman orthography (qaliujaaqpait) for
this evaluation.
Romanian–English. A set of Romanian–English
parallel texts, consisting of about 1 million Romanian
words, and about the same number of English words.
This is the same training data set as used in the 2003
word alignment evaluation (Mihalcea and Pedersen,
2003). The data consists of:
a5 Parallel texts collected from the Web using a
semi-supervised approach. The URLs format
for pages containing potential parallel transla-
tions were manually identified (mainly from the
archives of Romanian newspapers). Next, texts
were automatically downloaded and sentence
aligned. A manual verification of the alignment
was also performed. These data collection pro-
cess resulted in a corpus of about 850,000 Roma-
nian words, and about 900,000 English words.
66
a5 Orwell’s 1984, aligned within the MULTEXT-
EAST project (Erjavec et al., 1997), with about
130,000 Romanian words, and a similar number
of English words.
a5 The Romanian Constitution, for about 13,000
Romanian words and 13,000 English words.
English–Hindi. A collection of sentence aligned
English–Hindi parallel texts, from the Emille project
(Baker et al., 2004), consisting of approximately En-
glish 60,000 words and about 70,000 Hindi words.
The Hindi data was encoded in Unicode Devangari
script, and used the UTF–8 encoding. The English–
Hindi data were provided by Niraj Aswani and Robert
Gaizauskas from University of Sheffield (Aswani and
Gaizauskas, 2005b).
2.2 Trial Data
Three sets of trial data were made available at the
same time training data became available. Trial sets
consisted of sentence aligned texts, provided together
with manually determined word alignments. The
main purpose of these data was to enable participants
to better understand the format required for the word
alignment result files. For some systems, the trial data
has also played the role of a validation data set used
for system parameter tuning. Trial sets consisted of
25 English–Inuktitut and English–Hindi aligned sen-
tences, and a larger set of 248 Romanian–English
aligned sentences (the same as the test data used in
the 2003 word alignment evaluation).
2.3 Test Data
A total of 75 English–Inuktitut, 90 English–Hindi,
and 200 Romanian–English aligned sentences were
released one week prior to the deadline. Participants
were required to run their word alignment systems on
one or more of these data sets, and submit word align-
ments. Teams were allowed to submit an unlimited
number of results sets for each language pair.
2.3.1 Gold Standard Word Aligned Data
The gold standard for the three language pair align-
ments were produced using slightly different align-
ment procedures.
For English–Inuktitut, annotators were instructed to
align Inuktitut words or phrases with English phrases.
Their goal was to identify the smallest phrases that
permit one-to-one alignments between English and
Inuktitut. These phrase alignments were converted
into word-to-word alignments in the following man-
ner. If the aligned English and Inuktitut phrases
each consisted of a single word, that word pair was
assigned a Sure alignment. Otherwise, all possi-
ble word-pairs for the aligned English and Inuktitut
phrases were assigned a Probable alignment. Dis-
agreements between the two annotators were decided
by discussion.
For Romanian–English and English–Hindi, anno-
tators were instructed to assign an alignment to all
words, with specific instructions as to when to as-
sign a NULL alignment. Annotators were not asked
to assign a Sure or Probable label. Instead, we had an
arbitration phase, where a third annotator judged the
cases where the first two annotators disagreed. Since
an inter-annotator agreement was reached for all word
alignments, the final resulting alignments were con-
sidered to be Sure alignments.
3 Evaluation Measures
Evaluations were performed with respect to four dif-
ferent measures. Three of them – precision, recall,
and F-measure – represent traditional measures in In-
formation Retrieval, and were also frequently used
in previous word alignment literature. The fourth
measure was originally introduced by (Och and Ney,
2000), and proposes the notion of quality of word
alignment.
Given an alignment a0 , and a gold standard align-
ment a1 , each such alignment set eventually consist-
ing of two sets a0a3a2 , a0a5a4 , and a1a6a2 , a1a6a4 corresponding
to Sure and Probable alignments, the following mea-
sures are defined (where a7 is the alignment type, and
can be set to either S or P).
a3a9a8a11a10
a1a12 a8a14a13a16a15a17a8 a1
a1a12 a8 a1 (1)
a18 a8 a10
a1a12 a8a19a13a20a15a17a8 a1
a1a15a17a8 a1 (2)
a21 a8 a10 a22
a3a9a8 a18 a8
a3 a8a19a23 a18 a8 (3)
a12a25a24 a18 a10a27a26a29a28
a1a12a17a30 a13a16a15a32a31 a1 a23 a1a12a25a30 a13a16a15 a30 a1
a1a12a17a30a2a1 a23 a1a15a32a31 a1 (4)
Each word alignment submission was evaluated in
terms of the above measures. Given numerous (con-
structive) debates held during the previous word align-
ment evaluation, which questioned the informative-
ness of the NULL alignment evaluations, we decided
67
Team System name Description
Carnegie Mellon University SPA (Brown et al., 2005)
Information Sciences Institute / USC ISI (Fraser and Marcu, 2005)
Johns Hopkins University JHU (Schafer and Drabek, 2005)
Microsoft Research MSR (Moore, 2005)
Romanian Academy Institute of Artificial Intelligence TREQ-AL, MEBA, COWAL (Tufis et al., 2005)
University of Maryland / UMIACS UMIACS (Lopez and Resnik, 2005)
University of Sheffield Sheffield (Aswani and Gaizauskas, 2005a)
University of Montreal JAPA, NUKTI (Langlais et al., 2005)
University of Sao Paulo, University of Alicante LIHLA (Caseli et al., 2005)
University Jaume I MAR (Vilar, 2005)
Table 1: Teams participating in the word alignment shared task
to evaluate only no-NULL alignments, and thus the
NULL alignments were removed from both submis-
sions and gold standard data. We conducted there-
fore 7 evaluations for each submission file: AER,
Sure/Probable Precision, Sure/Probable Recall, and
Sure/Probable F-measure, all of them measured on
no-NULL alignments.
4 Participating Systems
Ten teams from around the world participated in the
word alignment shared task. Table 1 lists the names
of the participating systems, the corresponding insti-
tutions, and references to papers in this volume that
provide detailed descriptions of the systems and addi-
tional analysis of their results.
Seven teams participated in the Romanian–English
subtask, four teams participated in the English–
Inuktitut subtask, and two teams participated in the
English–Hindi subtask. There were no restrictions
placed on the number of submissions each team could
make. This resulted in a total of 50 submissions
from the ten teams, where 37 sets of results were
submitted for the Romanian–English subtask, 10 for
the English–Inuktitut subtask, and 3 for the English–
Hindi subtask. Of the 50 total submissions, there were
45 in the Limited resources subtask, and 5 in the Un-
limited resources subtask. Tables 2, 4 and 6 show all
of the submissions for each team in the three subtasks,
and provide a brief description of their approaches.
Results for all participating systems, including pre-
cision, recall, F-measure, and alignment error rate are
listed in Tables 3, 5 and 7. Ranked results for all sys-
tems are plotted in Figures 2, 3 and 4. In the graphs,
systems are ordered based on their AER scores. Sys-
tem names are preceded by a marker to indicate the
system type: L stands for Limited Resources, and U
stands for Unlimited Resources.
While each participating system was unique, there
were a few unifying themes. Several teams had ap-
proaches that relied (to varying degrees) on an IBM
model of statistical machine translation (Brown et al.,
1993), with different improvements brought by dif-
ferent teams, consisting of new submodels, improve-
ments in the HMM model, model combination for
optimal alignment, etc. Se-veral teams used sym-
metrization metrics, as introduced in (Och and Ney,
2003) (union, intersection, refined), most of the times
applied on the alignments produced for the two di-
rections source–target and target–source, but also as
a way to combine different word alignment systems.
Significant improvements with respect to baseline
word alignment systems were observed when the vo-
cabulary was reduced using simple stemming tech-
niques, which seems to be a particularly effective
technique given the data sparseness problems associ-
ated with the relatively small amounts of training data.
In the unlimited resources subtask, systems made
use of bilingual dictionaries, human–contributed word
alignments, or syntactic constraints derived from a de-
pendency parse tree applied on the English side of the
corpus.
When only small amounts of parallel corpora were
available (i.e. the English–Hindi subtask), the use
of additional resources resulted in absolute improve-
ments of up to 20% as compared to the case when
the word alignment systems were based exclusively
on the parallel texts. Interestingly, this was not the
case for the language pairs that had larger training
corpora (i.e. Romanian–English, English–Inuktitut),
where the limited resources systems seemed to lead
to comparable or sometime even better results than
those that relied on unlimited resources. This suggests
68
that the use of additional resources does not seem to
contribute to improvements in word alignment quality
when enough parallel corpora are available, but they
can make a big difference when only small amounts
of parallel texts are available.
Finally, in a comparison across language pairs, the
best results are obtained in the English–Inuktitut task,
followed by Romanian–English, and by English–
Hindi, which corresponds to the ordering of the sizes
of the training data sets. This is not surprising since,
like many other NLP tasks, word alignment seems to
highly benefit from large amounts of training data, and
thus better results are obtained when larger training
data sets are available.
5 Conclusion
A shared task on word alignment was organized as
part of the ACL 2005 Workshop on Building and
Using Parallel Texts. The focus of the task was
on languages with scarce resources, with evalua-
tions of alignments for three different language pairs:
English–Inuktitut, English–Hindi, and Romanian–
English. The task drew the participation of ten teams
from around the world, with a total of 50 systems.
In this paper, we presented the task definition, re-
sources involved, and shortly described the partici-
pating systems. Comparative evaluations of results
led to insights regarding the development of word
alignment algorithms for languages with scarce re-
sources, with performance evaluations of (1) various
algorithms, (2) different amounts of training data, and
(3) different additional resources. Data and evalua-
tion software used in this exercise are available online
at http://www.cs.unt.edu/˜rada/wpt05.
Acknowledgments
There are many people who contributed greatly to
making this word alignment evaluation task possible.
We are grateful to all the participants in the shared
task, for their hard work and involvement in this eval-
uation exercise. Without them, all these comparative
analyses of word alignment techniques would not be
possible. In particular, we would like to thank Dan
Tufis¸ and Bob Moore for their helpful comments con-
cerning the Romanian–English data. We would also
like to thank Benoit Farley for his valuable assistance
with the English–Inuktitut data.
We are very thankful to Niraj Aswani and Rob
Gaizauskas from University of Sheffield for making
possible the English–Hindi word alignment evalua-
tion. They provided sentence aligned training data
from the Emille project, as well as word aligned trial
and test data sets.
We are also grateful to all the Program Committee
members for their comments and suggestions, which
helped us improve the definition of this shared task.

References
N. Aswani and R. Gaizauskas. 2005a. Aligning words in english-
hindi parallel corpora. In (this volume).
N. Aswani and R. Gaizauskas. 2005b. A hybrid approach to align
sentences and words in English-Hindi parallel corpora. In Pro-
ceedings of the ACL Workshop on ”Building and Exploiting
Parallel Texts”, Ann Arbor, MI.
P. Baker, K. Bontcheva, H. Cunningham, R. Gaizauskas,
O. Hamza, A. Hardie, B. Jayaram, M. Leisher, A McEnery,
D Maynard, V. Tablan, C. Ursu, and Z. Xiao. 2004. Corpus
linguistics and south asian languages: Corpus creation and tool
development. Literary and Linguistic Computing, 19(4).
P. Brown, S. della Pietra, V. della Pietra, and R. Mercer. 1993.
The mathematics of statistical machine translation: parameter
estimation. Computational Linguistics, 19(2).
R. D. Brown, J.D. Kim, P. J. Jansen, and J. G. Carbonell. 2005.
Symmetric probabilistic alignment. In (this volume).
H. Caseli, M. G. V. Nunes, and M. L. Forcada. 2005. Lihla:
Shared task system description. In (this volume).
T. Erjavec, N. Ide, and D. Tufis. 1997. Encoding and parallel
alignment of linguistic corpora in six central and Eastern Eu-
ropean languages. In Proceedings of the Joint ACH/ALL Con-
ference, Queen’s University, Kingston, Ontario, June.
A. Fraser and D. Marcu. 2005. Isi’s participation in the romanian-
english alignment task. In (this volume).
P. Langlais, F. Gotti, and G. Cao. 2005. Nukti: English-inuktitut
word alignment system description. In (this volume).
A. Lopez and P. Resnik. 2005. Improved hmm alignment models
for languages with scarce resources. In (this volume).
J. Martin, H. Johnson, B. Farley, and A. Maclachlan. 2003.
Aligning and using an english-inuktitut parallel corpus. In
Proceedings of the HLT-NAACL Workshop on Building and
Using Parallel Texts: Data Driven Machine Translation and
Beyond, Edmonton, Canada.
R. Mihalcea and T. Pedersen. 2003. An evaluation exercise for
word alignment. In HLT-NAACL 2003 Workshop: Building
and Using Parallel Texts: Data Driven Machine Translation
and Beyond, Edmonton, Canada, May.
R. Moore. 2005. Association-based bilingual word alignment. In
(this volume).
F. Och and H. Ney. 2000. A comparison of alignment models
for statistical machine translation. In Proceedings of the 18th
International Conference on Computational Linguistics (COL-
ING 2000), Saarbrucken, Germany, August.
F.J. Och and H. Ney. 2003. A systematic comparison of vari-
ous statistical alignment models. Computational Linguistics,
29(1).
C. Schafer and E. Drabek. 2005. Models for inuktitut-english
word alignment. In (this volume).
D. Tufis, R. Ion, A. Ceausu, and D. Stefanescu. 2005. Combined
word alignments. In (this volume).
J.M. Vilar. 2005. Experiments using mar for aligning corpora. In
(this volume).
