An Evaluation Exercise for Word Alignment
Rada Mihalcea
Department of Computer Science
University of North Texas
Denton, TX 76203
rada@cs.unt.edu
Ted Pedersen
Department of Computer Science
University of Minnesota
Duluth, MN 55812
tpederse@umn.edu
Abstract
This paper presents the task definition, re-
sources, participating systems, and compara-
tive results for the shared task on word align-
ment, which was organized as part of the
HLT/NAACL 2003 Workshop on Building and
Using Parallel Texts. The shared task in-
cluded Romanian-English and English-French
sub-tasks, and drew the participation of seven
teams from around the world.
1 Defining a Word Alignment Shared Task
The task of word alignment consists of finding correspon-
dences between words and phrases in parallel texts. As-
suming a sentence aligned bilingual corpus in languages
L1 and L2, the task of a word alignment system is to indi-
cate which word token in the corpus of language L1 cor-
responds to which word token in the corpus of language
L2.
As part of the HLT/NAACL 2003 workshop on ”Build-
ing and Using Parallel Texts: Data Driven Machine
Translation and Beyond”, we organized a shared task on
word alignment, where participating teams were provided
with training and test data, consisting of sentence aligned
parallel texts, and were asked to provide automatically
derived word alignments for all the words in the test set.
Data for two language pairs were provided: (1) English-
French, representing languages with rich resources (20
million word parallel texts), and (2) Romanian-English,
representing languages with scarce resources (1 million
word parallel texts). Similar with the Machine Transla-
tion evaluation exercise organized by NIST
1
, two sub-
tasks were defined, with teams being encouraged to par-
ticipate in both subtasks.
1
http://www.nist.gov/speech/tests/mt/
1. Limited resources, where systems are allowed to use
only the resources provided.
2. Unlimited resources, where systems are allowed to
use any resources in addition to those provided.
Such resources had to be explicitly mentioned in the
system description.
Test data were released one week prior to the deadline
for result submissions. Participating teams were asked
to produce word alignments, following a common format
as specified below, and submit their output by a certain
deadline. Results were returned to each team within three
days of submission.
1.1 Word Alignment Output Format
The word alignment result files had to include one line
for each word-to-word alignment. Additionally, lines in
the result files had to follow the format specified in Fig.1.
While the CBCYC8 and confidence fields overlap in their
meaning, the intent of having both fields available is to
enable participating teams to draw their own line on what
they consider to be a Sure or Probable alignment. Both
these fields were optional, with some standard values as-
signed by default.
1.1.1 A Running Word Alignment Example
Consider the following two aligned sentences:
[English] BOs snum=18BQ They had gone . BO/sBQ
[French] BOs snum=18BQ Ils etaient alles . BO/sBQ
A correct word alignment for this sentence is
18 1 1
18 2 2
18 3 3
18 4 4
stating that: all the word alignments pertain to sentence
18, the English token 1 They aligns with the French to-
ken 1 Ils, the English token 2 had, aligns with the French
token 2 etaient, and so on. Note that punctuation is also
sentence no position L1 position L2 [CBCYC8] [confidence]
where:
Æ sentence no represents the id of the sentence within the
test file. Sentences in the test data already have an id as-
signed. (see the examples below)
Æ position L1 represents the position of the token that is
aligned from the text in language L1; the first token in each
sentence is token 1. (not 0)
Æ position L2 represents the position of the token that is
aligned from the text in language L2; again, the first token
is token 1.
Æ CBCYC8 can be either S or P, representing a Sure or Probable
alignment. All alignments that are tagged as S are also con-
sidered to be part of the P alignments set (that is, all align-
ments that are considered ”Sure” alignments are also part of
the ”Probable” alignments set). If the CBCYC8 field is missing, a
value of S will be assumed by default.
Æ confidence is a real number, in the range (0-1] (1 meaning
highly confident, 0 meaning not confident); this field is op-
tional, and by default confidence number of 1 was assumed.
Figure 1: Word Alignment file format
aligned (English token 4 aligned with French token 4),
and counts towards the final evaluation figures.
Alternatively, systems could also provide an CBCYC8
marker and/or a confidence score, as shown in the fol-
lowing example:
18 1 1 1
18 2 2 P 0.7
18 3 3 S
1844S1
with missing CBCYC8 fields considered by default to be S,
and missing confidence scores considered by default 1.
1.2 Annotation Guide for Word Alignments
The annotation guide and illustrative word alignment ex-
amples were mostly drawn from the Blinker Annotation
Project. Please refer to (Melamed, 1999, pp. 169–182)
for additional details.
1. All items separated by a white space are considered
to be a word (or token), and therefore have to be
aligned. (punctuation included)
2. Omissions in translation use the NULL token, i.e.
token with id 0. For instance, in the examples below:
[English]: BOs snum=18BQ And he said , appoint me
thy wages , and I will give it . BO/sBQ
[French]: BOs snum=18BQ fixe moi ton salaire , et je
te le donnerai . BO/sBQ
and he said from the English sentence has no cor-
responding translation in French, and therefore all
these words are aligned with the token id 0.
...
18 1 0
18 2 0
18 3 0
18 4 0
...
3. Phrasal correspondences produce multiple word-to-
word alignments. For instance, in the examples be-
low:
English: BOs snum=18BQ cultiver la terre BO/sBQ
French: BOs snum=18BQ to be a husbandman BO/sBQ
since the words do not correspond one to one, and
yet the two phrases mean the same thing in the given
context, the phrases should be linked as wholes, by
linking each word in one to each word in another.
For the example above, this translates into 12 word-
to-word alignments:
18 1 1 18 1 2
18 1 3 18 1 4
18 2 1 18 2 2
18 2 3 18 2 4
18 3 1 18 3 2
18 3 3 18 3 4
2 Resources
The shared task included two different language pairs:
the alignment of words in English-French parallel texts,
and in Romanian-English parallel texts. For each lan-
guage pair, training data were provided to participants.
Systems relying only on these resources were considered
part of the Limited Resources subtask. Systems making
use of any additional resources (e.g. bilingual dictionar-
ies, additional parallel corpora, and others) were classi-
fied under the Unlimited Resources category.
2.1 Training Data
Two sets of training data were made available.
1. A set of Romanian-English parallel texts, consist-
ing of about 1 million Romanian words, and about
the same number of English words. These data con-
sisted of:
AF Parallel texts collected from the Web using a
semi-supervised approach. The URLs format
for pages containing potential parallel transla-
tions were manually identified (mainly from
the archives of Romanian newspapers). Next,
texts were automatically downloaded and sen-
tence aligned. A manual verification of the
alignment was also performed. These data col-
lection process resulted in a corpus of about
850,000 Romanian words, and about 900,000
English words.
AF Orwell’s 1984, aligned within the MULTEXT-
EAST project (Erjavec et al., 1997), with about
130,000 Romanian words, and a similar num-
ber of English words.
AF The Romanian Constitution, for about 13,000
Romanian words and 13,000 English words.
2. A set of English-French parallel texts, consisting of
about 20 million English words, and about the same
number of French words. This is a subset of the
Canadian Hansards, processed and sentence aligned
by Ulrich Germann at ISI (Germann, 2001).
All data were pre-tokenized. For English and French,
we used a version of the tokenizers provided within the
EGYPT Toolkit
2
. For Romanian, we used our own tok-
enizer. Identical tokenization procedures were used for
training, trial, and test data.
2.2 Trial Data
Two sets of trial data were made available at the same
time training data became available. Trial sets consisted
of sentence aligned texts, provided together with man-
ually determined word alignments. The main purpose
of these data was to enable participants to better under-
stand the format required for the word alignment result
files. Trial sets consisted of 37 English-French, and 17
Romanian-English aligned sentences.
2.3 Test Data
A total of 447 English-French aligned sentences (Och
and Ney, 2000), and 248 Romanian-English aligned sen-
tences were released one week prior to the deadline. Par-
ticipants were required to run their word alignment sys-
tems on these two sets, and submit word alignments.
Teams were allowed to submit an unlimited number of
results sets for each language pair.
2.3.1 Gold Standard Word Aligned Data
The gold standard for the two language pair alignments
were produced using slightly different alignment proce-
dures, which allowed us to study different schemes for
producing gold standards for word aligned data.
For English-French, annotators where instructed to as-
sign a Sure or Probable tag to each word alignment they
produced. The intersection of the Sure alignments pro-
duced by the two annotators led to the final Sure aligned
set, while the reunion of the Probable alignments led to
the final Probable aligned set. The Sure alignment set is
2
http://www.clsp.jhu.edu/ws99/projects/mt/toolkit/
guaranteed to be a subset of the Probable alignment set.
The annotators did not produce any NULL alignments.
Instead, we assigned NULL alignments as a default back-
up mechanism, which forced each word to belong to at
least one alignment. The English-French aligned data
were produced by Franz Och and Hermann Ney (Och and
Ney, 2000).
For Romanian-English, annotators were instructed to
assign an alignment to all words, with specific instruc-
tions as to when to assign a NULL alignment. Annota-
tors were not asked to assign a Sure or Probable label.
Instead, we had an arbitration phase, where a third anno-
tator judged the cases where the first two annotators dis-
agreed. Since an inter-annotator agreement was reached
for all word alignments, the final resulting alignments
were considered to be Sure alignments.
3 Evaluation Measures
Evaluations were performed with respect to four differ-
ent measures. Three of them – precision, recall, and F-
measure – represent traditional measures in Information
Retrieval, and were also frequently used in previous word
alignment literature. The fourth measure was originally
introduced by (Och and Ney, 2000), and proposes the no-
tion of quality of word alignment.
Given an alignment BT, and a gold standard alignment
BZ, each such alignment set eventually consisting of two
sets BT
CB
, BT
C8
, and BZ
CB
, BZ
C8
corresponding to Sure and
Probable alignments, the following measures are defined
(where CC is the alignment type, and can be set to either S
or P).
C8
CC
BP
CYBT
CC
CKBZ
CC
CY
CYBT
CC
CY
(1)
CA
CC
BP
CYBT
CC
CKBZ
CC
CY
CYBZ
CC
CY
(2)
BY
CC
BP
BEC8
CC
CA
CC
C8
CC
B7CA
CC
(3)
BTBXCA BP BDA0
CYBT
C8
CKBZ
CB
CYB7CYBT
C8
CKBZ
C8
CY
CYBT
C8
CYB7CYBZ
CB
CY
(4)
Each word alignment submission was evaluated in
terms of the above measures. Moreover, we conducted
two sets of evaluations for each submission:
AF NULL-Align, where each word was enforced to be-
long to at least one alignment; if a word did not be-
long to any alignment, a NULL Probable alignment
was assigned by default. This set of evaluations per-
tains to full coverage word alignments.
AF NO-NULL-Align, where all NULL alignments were
removed from both submission file and gold stan-
dard data.
Team System name Description
Language Technologies Institute, CMU BiBr (Zhao and Vogel, 2003)
MITRE Corporation Fourday (Henderson, 2003)
RALI - Universit´e the Montr´eal Ralign (Simard and Langlais, 2003)
Romanian Academy Institute of Artificial Intelligence RACAI (Tufis¸ et al., 2003)
University of Alberta ProAlign (Lin and Cherry, 2003)
University of Minnesota, Duluth UMD (Thomson McInnes and Pedersen, 2003)
Xerox Research Centre Europe XRCE (Dejean et al., 2003)
Table 1: Teams participating in the word alignment shared task
We conducted therefore 14 evaluations for each
submission file: AER, Sure/Probable Precision,
Sure/Probable Recall, and Sure/Probable F-measure,
with a different figure determined for NULL-Align and
NO-NULL-Align alignments.
4 Participating Systems
Seven teams from around the world participated in the
word alignment shared task. Table 1 lists the names of
the participating systems, the corresponding institutions,
and references to papers in this volume that provide de-
tailed descriptions of the systems and additional analysis
of their results.
All seven teams participated in the Romanian-English
subtask, and five teams participated in the English-French
subtask.
3
There were no restrictions placed on the num-
ber of submissions each team could make. This resulted
in a total of 27 submissions from the seven teams, where
14 sets of results were submitted for the English-French
subtask, and 13 for the Romanian-English subtask. Of
the 27 total submissions, there were 17 in the Limited re-
sources subtask, and 10 in the Unlimited resources sub-
task. Tables 2 and 3 show all of the submissions for each
team in the two subtasks, and provide a brief description
of their approaches.
While each participating system was unique, there
were a few unifying themes.
Four teams had approaches that relied (to varying de-
grees) on an IBM model of statistical machine translation
(Brown et al., 1993). UMD was a straightforward imple-
mentation of IBM Model 2, BiBr employed a boosting
procedure in deriving an IBM Model 1 lexicon, Ralign
used IBM Model 2 as a foundation for their recursive
splitting procedure, and XRCE used IBM Model 4 as a
base for alignment with lemmatized text and bilingual
lexicons.
Two teams made use of syntactic structure in the text
to be aligned. ProAlign satisfies constraints derived from
a dependency tree parse of the English sentence being
3
The two teams that did not participate in English-French
were Fourday and RACAI.
aligned. BiBr also employs syntactic constraints that
must be satisfied. However, these come from parallel text
that has been shallowly parsed via a method known as
bilingual bracketing.
Three teams approached the shared task with baseline
or prototype systems. Fourday combines several intuitive
baselines via a nearest neighbor classifier, RACAI car-
ries out a greedy alignment based on an automatically
extracted dictionary of translations, and UMD’s imple-
mentation of IBM Model 2 provides an experimental plat-
form for their future work incorporating prior knowledge
about cognates. All three of these systems were devel-
oped within a short period of time before and during the
shared task.
5 Results and Discussion
Tables 4 and 5 list the results obtained by participating
systems in the Romanian-English task. Similarly, results
obtained during the English-French task are listed in Ta-
bles 6 and 7.
For Romanian-English, limited resources, XRCE sys-
tems (XRCE.Nolem-56k.RE.2 and XRCE.Trilex.RE.3)
seem to lead to the best results. These are systems that
are based on GIZA++, with or without additional re-
sources (lemmatizers and lexicons). For unlimited re-
sources, ProAlign.RE.1 has the best performance.
For English-French, Ralign.EF.1 has the best perfor-
mance for limited resources, while ProAlign.EF.1 has
again the largest number of top ranked figures for unlim-
ited resources.
To make a cross-language comparison, we paid partic-
ular attention to the evaluation of the Sure alignments,
since these were collected in a similar fashion (an agree-
ment had to be achieved between two different anno-
tators). The results obtained for the English-French
Sure alignments are significantly higher (80.54% best F-
measure) than those for Romanian-English Sure align-
ments (71.14% best F-measure). Similarly, AER for
English-French (5.71% highest error reduction) is clearly
better than the AER for Romanian-English (28.86% high-
est error reduction).
This difference in performance between the two data
sets is not a surprise. As expected, word alignment, like
many other NLP tasks (Banko and Brill, 2001), highly
benefits from large amounts of training data. Increased
performance is therefore expected when larger training
data sets are available.
The only evaluation set where Romanian-English data
leads to better performance is the Probable alignments
set. We believe however that these figures are not di-
rectly comparable, since the English-French Probable
alignments were obtained as a reunion of the align-
ments assigned by two different annotators, while for
the Romanian-English Probable set two annotators had
to reach an agreement (that is, an intersection of their in-
dividual alignment assignments).
Interestingly, in an overall evaluation, the limited re-
sources systems seem to lead to better results than those
with unlimited resources. Out of 28 different evaluation
figures, 20 top ranked figures are provided by systems
with limited resources. This suggests that perhaps using
a large number of additional resources does not seem to
improve a lot over the case when only parallel texts are
employed.
Ranked results for all systems are plotted in Figures 2
and 3. In the graphs, systems are ordered based on their
AER scores. System names are preceded by a marker to
indicate the system type: L stands for Limited Resources,
and U stands for Unlimited Resources.
6 Conclusion
A shared task on word alignment was organized as part
of the HLT/NAACL 2003 Workshop on Building and
Using Parallel Texts. In this paper, we presented the
task definition, and resources involved, and shortly de-
scribed the participating systems. The shared task in-
cluded Romanian-English and English-French sub-tasks,
and drew the participation of seven teams from around the
world. Comparative evaluations of results led to interest-
ing insights regarding the impact on performance of (1)
various alignment algorithms, (2) large or small amounts
of training data, and (3) type of resources available. Data
and evaluation software used in this exercise are available
online at http://www.cs.unt.edu/˜rada/wpt.
Acknowledgments
There are many people who contributed greatly to mak-
ing this word alignment evaluation task possible. We are
grateful to all the participants in the shared task, for their
hard work and involvement in this evaluation exercise.
Without them, all these comparative analyses of word
alignment techniques would not be possible.
We are very thankful to Franz Och from ISI and Her-
mann Ney from RWTH Aachen for kindly making their
English-French word aligned data available to the work-
shop participants; the Hansards made available by Ul-
rich Germann from ISI constituted invaluable data for
the English-French shared task. We would also like to
thank the student volunteers from the Department of En-
glish, Babes-Bolyai University, Cluj-Napoca, Romania
who helped creating the Romanian-English word aligned
data.
We are also grateful to all the Program Committee
members of the current workshop, for their comments
and suggestions, which helped us improve the definition
of this shared task. In particular, we would like to thank
Dan Melamed for suggesting the two different subtasks
(limited and unlimited resources), and Michael Carl and
Phil Resnik for initiating interesting discussions regard-
ing phrase-based evaluations.

References
M. Banko and E. Brill. 2001. Scaling to very very large
corpora for natural language disambiguation. In Pro-
ceedings of the 39th Annual Meeting of the Association
for Computational Lingusitics (ACL-2001), Toulouse,
France, July.
P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer.
1993. The mathematics of statistical machine trans-
lation: Parameter estimation. Computational Linguis-
tics, 19(2):263–311.
Herve Dejean, Eric Gaussier, Cyril Goutte, and Kenji
Yamada. 2003. Reducing parameter space for word
alignment. In HLT-NAACL 2003 Workshop: Building
and Using Parallel Texts: Data Driven Machine Trans-
lation and Beyond, pages 23–26, Edmonton, Alberta,
Canada, May 31. Association for Computational Lin-
guistics.
T. Erjavec, N. Ide, and D. Tufis¸. 1997. Encoding and
parallel alignment of linguistic corpora in six central
and Eastern European languages. In Proceedings of
the Joint ACH/ALL Conference, Queen’s University,
Kingston, Ontario, June.
U. Germann. 2001. Aligned hansards of the 36th
parliament of canada. http://www.isi.edu/natural-
language/download/hansard/.
John C. Henderson. 2003. Word alignment baselines. In
HLT-NAACL 2003 Workshop: Building and Using Par-
allel Texts: Data Driven Machine Translation and Be-
yond, pages 27–30, Edmonton, Alberta, Canada, May
31. Association for Computational Linguistics.
Dekang Lin and Colin Cherry. 2003. Proalign: Shared
task system description. In HLT-NAACL 2003 Work-
shop: Building and Using Parallel Texts: Data Driven
Machine Translation and Beyond, pages 11–14, Ed-
monton, Alberta, Canada, May 31. Association for
Computational Linguistics.
D.I. Melamed. 1999. Empirical Methods for Exploiting
Parallel Texts. MIT Press.
F. Och and H. Ney. 2000. A comparison of alignment
models for statistical machine translation. In Proceed-
ings of the 18th International Conference on Computa-
tional Linguistics (COLING-ACL 2000), Saarbrucken,
Germany, August.
Michel Simard and Philippe Langlais. 2003. Statisti-
cal translation alignment with compositionality con-
straints. In HLT-NAACL 2003 Workshop: Building
and Using Parallel Texts: Data Driven Machine Trans-
lation and Beyond, pages 19–22, Edmonton, Alberta,
Canada, May 31. Association for Computational Lin-
guistics.
Bridget Thomson McInnes and Ted Pedersen. 2003. The
duluth word alignment system. In HLT-NAACL 2003
Workshop: Building and Using Parallel Texts: Data
Driven Machine Translation and Beyond, pages 40–
43, Edmonton, Alberta, Canada, May 31. Association
for Computational Linguistics.
Dan Tufis¸, Ana-Maria Barbu, and Radu Ion. 2003. Treq-
al: A word alignment system with limited language
resources. In HLT-NAACL 2003 Workshop: Building
and Using Parallel Texts: Data Driven Machine Trans-
lation and Beyond, pages 36–39, Edmonton, Alberta,
Canada, May 31. Association for Computational Lin-
guistics.
D Tufis¸. 2002. A cheap and fast way to build useful
translation lexicons. In Proceedings of the 19th In-
ternational Conference on Computational Linguistics,
pages 1030–1036, Taipei, August.
Bing Zhao and Stephan Vogel. 2003. Word alignment
based on bilingual bracketing. In HLT-NAACL 2003
Workshop: Building and Using Parallel Texts: Data
Driven Machine Translation and Beyond, pages 15–
18, Edmonton, Alberta, Canada, May 31. Association
for Computational Linguistics.
