Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 119–124,
Ann Arbor, June 2005. c©Association for Computational Linguistics, 2005
Shared Task: Statistical Machine Translation between European Languages
Philipp Koehn
School of Informatics
University of Edinburgh
Edinburgh, EH8 9LW, UK
pkoehn@inf.ed.ac.uk
Christof Monz
UMIACS
University of Maryland
College Park, MD 20742, USA
christof@umiacs.umd.edu
Abstract
The ACL-2005 Workshop on Paral-
lel Texts hosted a shared task on
building statistical machine translation
systems for four European language
pairs: French–English, German–English,
Spanish–English, and Finnish–English.
Eleven groups participated in the event.
This paper describes the goals, the task
definition and resources, as well as results
and some analysis.
Statistical machine translation is currently the
dominant paradigm in machine translation research.
Annual competitions are held for Chinese–English
and Arabic–English by NIST (sponsored by the US
military funding agency DARPA), which creates a
forum to present and compare novel ideas and leads
to steady progress in the field.
One of the advantages of statistical machine trans-
lation is that the currently applied methods are fairly
language-independent. Building a new machine
translation system for a new language pair is not
much more than a matter of running a training pro-
cess on a training corpus of parallel text (a text in
one language paired with a translation in another).
It is therefore possible to hold a competition
where research groups have only a few weeks to
build machine translation systems for language pairs
that they have not previously worked on. We effec-
tively demonstrated this with our shared task. For in-
stance, seven teams built Finnish–English machine
translation systems, a language pair that was cer-
tainly not of their immediate concern before.
In contrast to the bigger NIST competition, we
wanted to keep the barrier of entry as low as possi-
ble. We provided not only training data from the Eu-
roparl corpus (Koehn, 2005), but also additional re-
sources: sentence and word alignments, the decoder
Pharaoh1 (Koehn, 2004b), and a language model,
so that participation was feasible even as a graduate
level class project.
Using about 15 million words of translated text,
participants were asked to build a phrase-based sta-
tistical machine translation system. The focus of
the task was to build a probabilistic phrase transla-
tion table, since most of the other resources were
provided — for more on phrase-based statistical
machine translation, refer to Koehn et al. (2003).
The participants’ systems were compared by how
well they translated 2000 previously unseen test sen-
tences from the same domain.
The shared task operated within an extremely
short timeframe. The workshop and hence the
shared task was accepted on February 22, 2005 and
announced on March 3. The official test data was
made available on April 3, results were due one
week later. Despite this tight schedule, eleven re-
search groups participated and built a total of 32 ma-
chine translation systems for the four language pairs.
1 Goals
When setting up this competition, we were moti-
vated by a number of goals. We set out to:
Create a platform to demonstrate the effective-
ness of novel ideas: The research community is
easily balkanized, where different groups work on
1http://www.isi.edu/licensed-sw/pharaoh/
119
different data sets and under different conditions,
so that it becomes often hard to assess, how effec-
tive a novel method is. By creating an environment
with common test and training sets, language model,
preprocessing, and even decoder, the effect of other
model choices can be more easily demonstrated.
Work on new language pairs, new problems:
Different language pairs pose different challenges.
We picked Finnish–English and German–English
for the special problems of rich morphology, word
order, which are a challenge to current phrase-based
SMT methods.
Enable more researchers to get engaged in
SMT research: One of our main goals with provid-
ing as many resources as possible was to keep the
barrier of entry low. Participants could use the word
alignment and other resources and focus on phrase
extraction. We hoped to attract researchers that are
relatively new to the field. We were satisfied to learn
that many entries are by graduate students working
on their own.
Promote and create free resources: Academic
research thrives on freely available resources. The
field of statistical machine translation has been
blessed with a long tradition of freely available soft-
ware tools — such as GIZA++ (Och and Ney, 2003)
— and parallel corpora — such as the Canadian
Hansards2. Following this lead, we made word
alignments and a language model available for this
competition in addition to our previously published
resources (Europarl and Pharaoh). The competition
created resources as well. Most teams agreed to
share system output and their model files. You can
download them from the competition web site3.
Promote work on European language pairs:
Finally, we wanted to promote work on European
languages. The increasing economic and political
ties within the European Union create a huge need
for translation services. We would like to see re-
searchers rise to the challenge of creating high qual-
ity machine translation systems to fill these needs.
We are very grateful for the strong participation,
especially by researchers who are relatively new to
the field.
2http://www.isi.edu/natural-language/download/hansard/
3http://www.statmt.org/wpt05/mt-shared-task/
2 Rules of Engagement
We set up a machine translation competition for
four language pairs. We chose Spanish–English and
French–English, because many researchers would
be familiar with these languages. We chose
German–English for its special problems with word
order (such as nested constructions and split verb
groups) and morphology. Finally, we picked
Finnish–English for the rich agglutinative morphol-
ogy of Finnish.
Statistical machine translation systems are typi-
cally trained on sentence-aligned parallel corpora.
We selected Europarl4, a freely available parallel
corpus in eleven languages. In addition, we also
made a word alignment available, which was de-
rived using a variant of the current default method
for word alignment – Och and Ney (2003)’s refined
method.
Figure 1 details some properties of the parallel
corpora. The training corpus is most of the Europarl
corpus, only the text of sessions from last quarter of
the year 2000 was reserved for testing. The corpus
has the size of roughly 15 million English words in
700,000 sentences – these numbers differ for each of
the four parallel corpora due to the different number
of discarded sentences during sentence alignment
and after enforcing a 40 word length limit for sen-
tences.
The number of foreign words differs even more
dramatically. The effect of Finnish morphology
manifests itself in a low number of words (just over
11 million), but a high number of distinct words
(more than 5 times as many as in the English half).
The test corpus consists of 2000 sentences aligned
across all five languages. Note that the output of
each system is compared against the same English
references for all source languages. The number of
total words, distinct words, and words not seen in the
training data reflects again the morphology effect.
For researchers willing to create their own word
alignment, we suggested the use of GIZA++5, an
implementation of the IBM word-based machine
translation models, which also assisted the creation
of the provided word alignments.
We trained a language model on the English part
4http://www.statmt.org/europarl/
5http://www.fjoch.com/GIZA++.html
120
Spanish–English French–English Finnish–English German–English
Training corpus
Sentences 730,740 688,031 716,960 751,088
Source words 15,676,710 15,323,737 11,318,287 15,256,793
English words 15,222,105 13,808,104 15,492,903 16,052,269
Distinct source words 102,886 80,349 358,345 195,291
Distinct English words 64,123 61,627 64,662 65,889
Test corpus
Sentences 2,000
Source words 60,276 65,029 41,431 54,247
English words 57,945
Distinct source words 7,782 7,285 11,996 8,666
Distinct English words 6,054
Unseen source words 209 143 737 377
Figure 1: Properties of the Europarl training and test corpora used in the shared task
of the Europarl corpus using the SRI language mod-
eling toolkit (Stolke, 2002). Finally, we suggested
the use of Pharaoh (Koehn, 2004b), a phrase-based
machine translation decoder.
How well does this setup match the state of the
art? The MIT system using the Pharaoh decoder
(Koehn, 2004a) proved to be very competitive in
last year’s NIST evaluation. However, the field is
moving fast, and a number of steps help to improve
upon the provided baseline setup, e.g., larger lan-
guage models trained on general text (up to a bil-
lion words have been used), better reodering mod-
els (e.g., suggested by Tillman (2004) and Och
et al. (2004)), better language-specific preprocessing
(Koehn and Knight, 2003) and restructuring (Collins
et al., 2005), additional feature functions such as
word class language models, and minimum error
rate training (Och, 2003) to optimize parameters.
Some of these steps (e.g., improved reorder-
ing models) go beyond the current capabilities of
Pharaoh. However, we are hopeful that freely avail-
able software continues to match or at least follow
closely the state of the art.
We announced the shared task on March 3, and
provided all the resources mentioned above (also a
development test corpus to track the quality of sys-
tems being developed). The test schedule called for
the translation of 2000 sentence for each of the four
language pairs in the week between April 3–10. We
allowed late submissions up to April 17.
3 Results
Eleven teams from eight institutions in Europe and
North America participated, see Figure 2 for a com-
plete list. The figure also indicates, if a team used
the Pharaoh decoder (eight teams), the provided lan-
guage model (seven teams) and the provided word
alignment (four did, three of those with additional
preprocessing or additional data).
Translation performance was measured using the
BLEU score (Papineni et al., 2002), which measures
n-gram overlap with a reference translation. In our
case, we only used a single reference translation,
since the test set was taken from a held-out portion
of the Europarl corpus. On the other hand we used a
relatively large number of test sentences to guaran-
tee that the BLEU results are stable despite the fact
that we used only one reference translation for each
sentence.
Shared tasks like this one, of course, bring out the
competitive spirit of participants and can draw criti-
cisms about being a horse race. From an outside per-
spective, however, it is far more interesting to learn
which methods and ideas proved to be successful,
than who won the competition.
Taking stock of the results — see Figure 3 — one
observes a very packed field at the top. While the
participants from the University of Washington pro-
duced the best translations for every single language
pair, the distance to many other participant scores
121
ID Team Pharaoh LM Word Al.
cmu-b Carnegie Mellon University, USA - Bing Zhao yes yes no
cmu-j Carnegie Mellon University, USA - Ying (Joy) Zhang yes yes no
glasgow University of Glasgow, UK yes yes yes+
nrc National Research Council, Canada no no no
rali University of Montreal / RALI, Canada yes yes no
saar Saarland University, Germany yes yes yes
uji University Jaume I, Spain yes yes yes+
upc-j Polytechnic University of Catalonia, Spain - Jesus Gimenez yes yes no
upc-m Polytechnic University of Catalonia, Spain - Marta Ruiz no no no
upc-r Polytechnic University of Catalonia, Spain - Rafael Banchs no no no
uw University of Washington, USA yes no yes+
Figure 2: The eleven participating teams: the table also lists, if the Pharaoh decoder, the provided language
model, and the provided word alignment was used (yes+ indicates additional preprocessing)
is within a BLEU percentage point or two. As one
might have expected, the scores are best for Spanish
and French, and worst for Finnish. Figure 4 shows
some typical output of the submitted systems.
The proceedings to the workshop include detailed
system descriptions of all participants. Novel phrase
extraction approaches were proposed, along with
better preprocessing, language modeling, rescoring,
and other ideas. We are certain that better perfor-
mance can be achieved by combining some of the
methods used by different participants.
And hence, we would like to pose the challenge to
the research community to build and test better sys-
tems using the provided resources. We will gladly
list additional results on the competition web site.
4 Survey
Following the end of the competition, we sent out a
questionnaire to the participants. One of the ques-
tions what they would like to see different in a po-
tential future competition.
We listed four potential changes: 70% of the re-
spondends checked translation from English, 50%
checked out of domain test data, 40% checked more
language pairs, 0% checked fewer language pairs.
Additional suggestions were: alternatives to the
BLEU scoring method (maybe human judgment by
participants themselves), transitive translation using
pivot languages, translation of resource-poor lan-
guages, and more time to prepare for the task.
5 Outlook
Given the short timeframe, one should view the sys-
tem performances (albeit very competitive with the
state of the art) as a baseline effort on the task of
open domain text translation between European lan-
guages.
We hope that future researchers will use the pro-
vided environment as a test bed for their machine
translation systems. We will continue to publish any
scores reported to us.
Since we placed much of the systems’ output on-
line, the interested reader may be inspired to more
closely explore the quality and shortcomings. Even
some of the model files have been made available,
so it is even possible to download and install some
of the systems.

References
Collins, M., Koehn, P., and Kucerova, I. (2005).
Clause restructuring for statistical machine trans-
lation. In Proceedings of the 43rd Meeting
of the Association for Computational Linguistics
(ACL’05), Main Volume, pages 531–540, Ann Ar-
bor, Michigan.
Koehn, P. (2004a). The foundation for statistical ma-
chine translation at MIT. In Proceedings of Ma-
chine Translation Evaluation Workshop 2004.
Koehn, P. (2004b). Pharaoh: a beam search decoder
for statistical machine translation. In 6th Confer-
ence of the Association for Machine Translation
in the Americas, AMTA, Lecture Notes in Com-
puter Science. Springer.
Koehn, P. (2005). Europarl: A parallel corpus for
statistical machine translation. In MT Summit X
(submitted).
Koehn, P. and Knight, K. (2003). Empirical methods
for compound splitting. In Proceedings of Meet-
ing of the European Chapter of the Association of
Computational Linguistics (EACL).
Koehn, P., Och, F. J., and Marcu, D. (2003). Statisti-
cal phrase based translation. In Proceedings of the
Joint Conference on Human Language Technolo-
gies and the Annual Meeting of the North Ameri-
can Chapter of the Association of Computational
Linguistics (HLT-NAACL).
Och, F. J. (2003). Minimum error rate training for
statistical machine translation. In Proceedings
of the 41st Annual Meeting of the Association of
Computational Linguistics (ACL).
Och, F. J., Gildea, D., Khudanpur, S., Sarkar, A.,
Yamada, K., Fraser, A., Kumar, S., Shen, L.,
Smith, D., Eng, K., Jain, V., Jin, Z., and Radev,
D. (2004). A smorgasbord of features for statis-
tical machine translation. In Proceedings of the
Joint Conference on Human Language Technolo-
gies and the Annual Meeting of the North Ameri-
can Chapter of the Association of Computational
Linguistics (HLT-NAACL).
Och, F. J. and Ney, H. (2003). A systematic compar-
ison of various statistical alignment models. Com-
putational Linguistics, 29(1):19–52.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J.
(2002). BLEU: a method for automatic evalua-
tion of machine translation. In Proceedings of the
40th Annual Meeting of the Association of Com-
putational Linguistics (ACL).
Stolke, A. (2002). SRILM - an extensible language
modeling toolkit. In Proceedings of the Interna-
tional Conference on Spoken Language Process-
ing.
Tillman, C. (2004). A unigram orientation model
for statistical machine translation. In Proceed-
ings of the Joint Conference on Human Language
Technologies and the Annual Meeting of the North
American Chapter of the Association of Computa-
tional Linguistics (HLT-NAACL).
