Proceedings of the Workshop on Statistical Machine Translation, pages 102–121,
New York City, June 2006. c©2006 Association for Computational Linguistics
Manual and Automatic Evaluation of Machine Translation
between European Languages
Philipp Koehn
School of Informatics
University of Edinburgh
pkoehn@inf.ed.ac.uk
Christof Monz
Department of Computer Science
Queen Mary, University of London
christof@dcs.qmul.ac.uk
Abstract
We evaluated machine translation perfor-
mance for six European language pairs
that participated in a shared task: translat-
ing French, German, Spanish texts to En-
glish and back. Evaluation was done auto-
matically using the BLEU score and man-
ually on fluency and adequacy.
For the 2006 NAACL/HLT Workshop on Ma-
chine Translation, we organized a shared task to
evaluate machine translation performance. 14 teams
from 11 institutions participated, ranging from com-
mercial companies, industrial research labs to indi-
vidual graduate students.
The motivation for such a competition is to estab-
lish baseline performance numbers for defined train-
ing scenarios and test sets. We assembled various
forms of data and resources: a baseline MT system,
language models, prepared training and test sets,
resulting in actual machine translation output from
several state-of-the-art systems and manual evalua-
tions. All this is available at the workshop website1.
The shared task is a follow-up to the one we orga-
nized in the previous year, at a similar venue (Koehn
and Monz, 2005). As then, we concentrated on the
translation of European languages and the use of the
Europarl corpus for training. Again, most systems
that participated could be categorized as statistical
phrase-based systems. While there is now a num-
ber of competitions — DARPA/NIST (Li, 2005),
IWSLT (Eck and Hori, 2005), TC-Star — this one
focuses on text translation between various Euro-
pean languages.
This year’s shared task changed in some aspects
from last year’s:
• We carried out a manual evaluation in addition
to the automatic scoring. Manual evaluation
1http://www.statmt.org/wmt06/
was done by the participants. This revealed
interesting clues about the properties of auto-
matic and manual scoring.
• We evaluated translation from English, in ad-
dition to into English. English was again
paired with German, French, and Spanish.
We dropped, however, one of the languages,
Finnish, partly to keep the number of tracks
manageable, partly because we assumed that it
would be hard to find enough Finnish speakers
for the manual evaluation.
• We included an out-of-domain test set. This al-
lows us to compare machine translation perfor-
mance in-domain and out-of-domain.
1 Evaluation Framework
The evaluation framework for the shared task is sim-
ilar to the one used in last year’s shared task. Train-
ing and testing is based on the Europarl corpus. Fig-
ure 1 provides some statistics about this corpus.
1.1 Baseline system
To lower the barrier of entrance to the competition,
we provided a complete baseline MT system, along
with data resources. To summarize, we provided:
• sentence-aligned, tokenized training corpus
• a development and development test set
• trained language models for each language
• the phrase-based MT decoder Pharaoh
• a training script to build models for Pharaoh
The performance of the baseline system is simi-
lar to the best submissions in last year’s shared task.
We are currently working on a complete open source
implementation of a training and decoding system,
which should become available over the summer.
102
Training corpus
Spanish ↔ English French ↔ English German ↔ English
Sentences 730,740 688,031 751,088
Foreign words 15,676,710 15,323,737 15,256,793
English words 15,222,105 13,808,104 16,052,269
Distinct foreign words 102,886 80,349 195,291
Distinct English words 64,123 61,627 65,889
Language model data
English Spanish French German
Sentence 1,003,349 1,070,305 1,066,974 1,078,141
Words 27,493,499 29,129,720 31,604,879 26,562,167
In-domain test set
English Spanish French German
Sentences 2,000
Words 59,307 61,824 66,783 55,533
Unseen words 141 206 164 387
Ratio of unseen words 0.23% 0.40% 0.24% 0.70%
Distinct words 6,031 7,719 7,230 8,812
Distinct unseen words 139 203 163 385
Out-of-domain test set
English Spanish French German
Sentences 1,064
Words 25,919 29,826 31,937 26,818
Unseen words 464 368 839 913
Ratio of unseen words 1.79% 1.23% 2.62% 3.40%
Distinct words 5,166 5,689 5,728 6,594
Distinct unseen words 340 267 375 637
Figure 1: Properties of the training and test sets used in the shared task. The training data is the Europarl cor-
pus, from which also the in-domain test set is taken. There is twice as much language modelling data, since
training data for the machine translation system is filtered against sentences of length larger than 40 words.
Out-of-domain test data is from the Project Syndicate web site, a compendium of political commentary.
103
ID Participant
cmu Carnegie Mellon University, USA (Zollmann and Venugopal, 2006)
lcc Language Computer Corporation, USA (Olteanu et al., 2006b)
ms Microsoft, USA (Menezes et al., 2006)
nrc National Research Council, Canada (Johnson et al., 2006)
ntt Nippon Telegraph and Telephone, Japan (Watanabe et al., 2006)
rali RALI, University of Montreal, Canada (Patry et al., 2006)
systran Systran, France
uedin-birch University of Edinburgh, UK — Alexandra Birch (Birch et al., 2006)
uedin-phi University of Edinburgh, UK — Philipp Koehn (Birch et al., 2006)
upc-jg University of Catalonia, Spain — Jes´us Gim´enez (Gim´enez and M`arquez, 2006)
upc-jmc University of Catalonia, Spain — Josep Maria Crego (Crego et al., 2006)
upc-mr University of Catalonia, Spain — Marta Ruiz Costa-juss`a (Costa-juss`a et al., 2006)
upv University of Valencia, Spain (S´anchez and Bened´ı, 2006)
utd University of Texas at Dallas, USA (Olteanu et al., 2006a)
Figure 2: Participants in the shared task. Not all groups participated in all translation directions.
1.2 Test Data
The test data was again drawn from a segment of
the Europarl corpus from the fourth quarter of 2000,
which is excluded from the training data. Partici-
pants were also provided with two sets of 2,000 sen-
tences of parallel text to be used for system develop-
ment and tuning.
In addition to the Europarl test set, we also col-
lected 29 editorials from the Project Syndicate web-
site2, which are published in all the four languages
of the shared task. We aligned the texts at a sen-
tence level across all four languages, resulting in
1064 sentence per language. For statistics on this
test set, refer to Figure 1.
The out-of-domain test set differs from the Eu-
roparl data in various ways. The text type are edi-
torials instead of speech transcripts. The domain is
general politics, economics and science. However, it
is also mostly political content (even if not focused
on the internal workings of the European Union) and
opinion.
1.3 Participants
We received submissions from 14 groups from 11
institutions, as listed in Figure 2. Most of these
groups follow a phrase-based statistical approach to
machine translation. Microsoft’s approach uses de-
2http://www.project-syndicate.com/
pendency trees, others use hierarchical phrase mod-
els. Systran submitted their commercial rule-based
system that was not tuned to the Europarl corpus.
About half of the participants of last year’s shared
task participated again. The other half was replaced
by other participants, so we ended up with roughly
the same number. Compared to last year’s shared
task, the participants represent more long-term re-
search efforts. This may be the sign of a maturing
research environment.
While building a machine translation system is
a serious undertaking, in future we hope to attract
more newcomers to the field by keeping the barrier
of entry as low as possible.
For more on the participating systems, please re-
fer to the respective system description in the pro-
ceedings of the workshop.
2 Automatic Evaluation
For the automatic evaluation, we used BLEU, since it
is the most established metric in the field. The BLEU
metric, as all currently proposed automatic metrics,
is occasionally suspected to be biased towards sta-
tistical systems, especially the phrase-based systems
currently in use. It rewards matches of n-gram se-
quences, but measures only at most indirectly over-
all grammatical coherence.
The BLEU score has been shown to correlate
well with human judgement, when statistical ma-
104
chine translation systems are compared (Dodding-
ton, 2002; Przybocki, 2004; Li, 2005). However, a
recent study (Callison-Burch et al., 2006), pointed
out that this correlation may not always be strong.
They demonstrated this with the comparison of sta-
tistical systems against (a) manually post-edited MT
output, and (b) a rule-based commercial system.
The development of automatic scoring methods is
an open field of research. It was our hope that this
competition, which included the manual and auto-
matic evaluation of statistical systems and one rule-
based commercial system, will give further insight
into the relation between automatic and manual eval-
uation. At the very least, we are creating a data re-
source (the manual annotations) that may the basis
of future research in evaluation metrics.
2.1 Computing BLEU Scores
We computed BLEU scores for each submission with
a single reference translation. For each sentence,
we counted how many n-grams in the system output
also occurred in the reference translation. By taking
the ratio of matching n-grams to the total number of
n-grams in the system output, we obtain the preci-
sion pn for each n-gram order n. These values for
n-gram precision are combined into a BLEU score:
BLEU = BP · exp(
4summationdisplay
n=1
log pn) (1)
BP = min(1,e1−r/c) (2)
The formula for the BLEU metric also includes a
brevity penalty for too short output, which is based
on the total number of words in the system output c
and in the reference r.
BLEU is sensitive to tokenization. Because of
this, we retokenized and lowercased submitted out-
put with our own tokenizer, which was also used to
prepare the training and test data.
2.2 Statistical Significance
Confidence Interval: Since BLEU scores are not
computed on the sentence level, traditional methods
to compute statistical significance and confidence
intervals do not apply. Hence, we use the bootstrap
resampling method described by Koehn (2004).
Following this method, we repeatedly — say,
1000 times — sample sets of sentences from the out-
put of each system, measure their BLEU score, and
use these 1000 BLEU scores as basis for estimating
a confidence interval. When dropping the top and
bottom 2.5% the remaining BLEU scores define the
range of the confidence interval.
Pairwise comparison: We can use the same method
to assess the statistical significance of one system
outperforming another. If two systems’ scores are
close, this may simply be a random effect in the test
data. To check for this, we do pairwise bootstrap re-
sampling: Again, we repeatedly sample sets of sen-
tences, this time from both systems, and compare
their BLEU scores on these sets. If one system is bet-
ter in 95% of the sample sets, we conclude that its
higher BLEU score is statistically significantly bet-
ter.
The bootstrap method has been critized by Riezler
and Maxwell (2005) and Collins et al. (2005), as be-
ing too optimistic in deciding for statistical signifi-
cant difference between systems. We are therefore
applying a different method, which has been used at
the 2005 DARPA/NIST evaluation.
We divide up each test set into blocks of 20 sen-
tences (100 blocks for the in-domain test set, 53
blocks for the out-of-domain test set), check for each
block, if one system has a higher BLEU score than
the other, and then use the sign test.
The sign test checks, how likely a sample of better
and worse BLEU scores would have been generated
by two systems of equal performance.
Let say, if we find one system doing better on 20
of the blocks, and worse on 80 of the blocks, is it
significantly worse? We check, how likely only up
to k = 20 better scores out of n = 100 would have
been generated by two equal systems, using the bi-
nomial distribution:
p(0..k;n,p) =
ksummationdisplay
i=0
parenleftbiggi
n
parenrightbigg
pipn−i
= 0.5n
ksummationdisplay
i=0
parenleftbiggi
n
parenrightbigg (3)
If p(0..k;n,p) < 0.05, or p(0..k;n,p) > 0.95
then we have a statistically significant difference be-
tween the systems.
105
Figure 3: Annotation tool for manual judgement of adequacy and fluency of the system output. Translations
from 5 randomly selected systems for a randomly selected sentence is presented. No additional information
beyond the instructions on this page are given to the judges. The tool tracks and reports annotation speed.
3 Manual Evaluation
While automatic measures are an invaluable tool
for the day-to-day development of machine trans-
lation systems, they are only a imperfect substitute
for human assessment of translation quality, or as
the acronym BLEU puts it, a bilingual evaluation
understudy.
Many human evaluation metrics have been pro-
posed. Also, the argument has been made that ma-
chine translation performance should be evaluated
via task-based evaluation metrics, i.e. how much it
assists performing a useful task, such as supporting
human translators or aiding the analysis of texts.
The main disadvantage of manual evaluation is
that it is time-consuming and thus too expensive to
do frequently. In this shared task, we were also con-
fronted with this problem, and since we had no fund-
ing for paying human judgements, we asked partic-
ipants in the evaluation to share the burden. Par-
ticipants and other volunteers contributed about 180
hours of labor in the manual evaluation.
3.1 Collecting Human Judgements
We asked participants to each judge 200–300 sen-
tences in terms of fluency and adequacy, the most
commonly used manual evaluation metrics. We set-
tled on contrastive evaluations of 5 system outputs
for a single test sentence. See Figure 3 for a screen-
shot of the evaluation tool.
Presenting the output of several system allows
the human judge to make more informed judge-
ments, contrasting the quality of the different sys-
tems. The judgements tend to be done more in form
of a ranking of the different systems. We assumed
that such a contrastive assessment would be benefi-
cial for an evaluation that essentially pits different
systems against each other.
While we had up to 11 submissions for a trans-
lation direction, we did decide against presenting
all 11 system outputs to the human judge. Our ini-
tial experimentation with the evaluation tool showed
that this is often too overwhelming.
Making the ten judgements (2 types for 5 sys-
tems) takes on average 2 minutes. Typically, judges
106
initially spent about 3 minutes per sentence, but then
accelerate with experience. Judges where excluded
from assessing the quality of MT systems that were
submitted by their institution. Sentences and sys-
tems were randomly selected and randomly shuffled
for presentation.
We collected around 300–400 judgements per
judgement type (adequacy or fluency), per system,
per language pair. This is less than the 694 judge-
ments 2004 DARPA/NIST evaluation, or the 532
judgements in the 2005 DARPA/NIST evaluation.
This decreases the statistical significance of our re-
sults compared to those studies. The number of
judgements is additionally fragmented by our break-
up of sentences into in-domain and out-of-domain.
3.2 Normalizing the judgements
The human judges were presented with the follow-
ing definition of adequacy and fluency, but no addi-
tional instructions:
Adequacy Fluency
5 All Meaning Flawless English
4 Most Meaning Good English
3 Much Meaning Non-native English
2 Little Meaning Disfluent English
1 None Incomprehensible
Judges varied in the average score they handed
out. The average fluency judgement per judge
ranged from 2.33 to 3.67, the average adequacy
judgement ranged from 2.56 to 4.13. Since different
judges judged different systems (recall that judges
were excluded to judge system output from their
own institution), we normalized the scores.
The normalized judgement per judge is the raw
judgement plus (3 minus average raw judgement for
this judge). In words, the judgements are normal-
ized, so that the average normalized judgement per
judge is 3.
Another way to view the judgements is that they
are less quality judgements of machine translation
systems per se, but rankings of machine translation
systems. In fact, it is very difficult to maintain con-
sistent standards, on what (say) an adequacy judge-
ment of 3 means even for a specific language pair.
The way judgements are collected, human judges
tend to use the scores to rank systems against each
other. If one system is perfect, another has slight
flaws and the third more flaws, a judge is inclined
to hand out judgements of 5, 4, and 3. On the other
hand, when all systems produce muddled output, but
one is better, and one is worse, but not completely
wrong, a judge is inclined to hand out judgements of
4, 3, and 2. The judgement of 4 in the first case will
go to a vastly better system output than in the second
case.
We therefore also normalized judgements on a
per-sentence basis. The normalized judgement per
sentence is the raw judgement plus (0 minus average
raw judgement for this judge on this sentence).
Systems that generally do better than others will
receive a positive average normalized judgement per
sentence. Systems that generally do worse than oth-
ers will receive a negative one.
One may argue with these efforts on normaliza-
tion, and ultimately their value should be assessed
by assessing their impact on inter-annotator agree-
ment. Given the limited number of judgements we
received, we did not try to evaluate this.
3.3 Statistical Significance
Confidence Interval: To estimate confidence inter-
vals for the average mean scores for the systems, we
use standard significance testing.
Given a set of n sentences, we can compute the
sample mean ¯x and sample variance s2 of the indi-
vidual sentence judgements xi:
¯x = 1n
nsummationdisplay
i=1
xi (4)
s2 = 1n−1
nsummationdisplay
i=1
(xi − ¯x)2 (5)
The extend of the confidence interval [¯x−d, ¯x+d]
can be computed by
d = 1.96· s√n (6)
Pairwise Comparison: As for the automatic evalu-
ation metric, we want to be able to rank different sys-
tems against each other, for which we need assess-
ments of statistical significance on the differences
between a pair of systems.
Unfortunately, we have much less data to work
with than with the automatic scores. The way we
107
Basis Diff. Ratio
Sign test on BLEU 331 75%
Bootstrap on BLEU 348 78%
Sign test on Fluency 224 50%
Sign test on Adequacy 225 51%
Figure 4: Number and ratio of statistically signifi-
cant distinction between system performance. Au-
tomatic scores are computed on a larger tested than
manual scores (3064 sentences vs. 300–400 sen-
tences).
collected manual judgements, we do not necessar-
ily have the same sentence judged for both systems
(judges evaluate 5 systems out of the 8–10 partici-
pating systems).
Still, for about good number of sentences, we do
have this direct comparison, which allows us to ap-
ply the sign test, as described in Section 2.2.
4 Results and Analysis
The results of the manual and automatic evaluation
of the participating system translations is detailed in
the figures at the end of this paper. The scores and
confidence intervals are detailed first in the Figures
7–10 in table form (including ranks), and then in
graphical form in Figures 11–16. In the graphs, sys-
tem scores are indicated by a point, the confidence
intervals by shaded areas around the point.
In all figures, we present the per-sentence normal-
ized judgements. The normalization on a per-judge
basis gave very similar ranking, only slightly less
consistent with the ranking from the pairwise com-
parisons.
The confidence intervals are computed by boot-
strap resampling for BLEU, and by standard signif-
icance testing for the manual scores, as described
earlier in the paper.
Pairwise comparison is done using the sign test.
Often, two systems can not be distinguished with
a confidence of over 95%, so there are ranked the
same. This actually happens quite frequently (more
below), so that the rankings are broad estimates. For
instance: if 10 systems participate, and one system
does better than 3 others, worse then 2, and is not
significant different from the remaining 4, its rank is
in the interval 3–7.
Domain BLEU Fluency Adequacy
in-domain 26.63 3.17 3.58
out-of-domain 20.37 2.74 3.08
Figure 5: Evaluation scores for in-domain and out-
of-domain test sets, averaged over all systems
4.1 Close results
At first glance, we quickly recognize that many sys-
tems are scored very similar, both in terms of man-
ual judgement and BLEU. There may be occasion-
ally a system clearly at the top or at the bottom, but
most systems are so close that it is hard to distin-
guish them.
In Figure 4, we displayed the number of system
comparisons, for which we concluded statistical sig-
nificance. For the automatic scoring method BLEU,
we can distinguish three quarters of the systems.
While the Bootstrap method is slightly more sensi-
tive, it is very much in line with the sign test on text
blocks.
For the manual scoring, we can distinguish only
half of the systems, both in terms of fluency and ad-
equacy. More judgements would have enabled us
to make better distinctions, but it is not clear what
the upper limit is. We can check, what the conse-
quences of less manual annotation of results would
have been: With half the number of manual judge-
ments, we can distinguish about 40% of the systems,
10% less.
4.2 In-domain vs. out-of-domain
The test set included 2000 sentences from the
Europarl corpus, but also 1064 sentences out-of-
domain test data. Since the inclusion of out-of-
domain test data was a very late decision, the par-
ticipants were not informed of this. So, this was a
surprise element due to practical reasons, not mal-
ice.
All systems (except for Systran, which was not
tuned to Europarl) did considerably worse on out-
of-domain training data. This is demonstrated by
average scores over all systems, in terms of BLEU,
fluency and adequacy, as displayed in Figure 5.
The manual scores are averages over the raw un-
normalized scores.
108
Language Pair BLEU Fluency Adequacy
French-English 26.09 3.25 3.61
Spanish-English 28.18 3.19 3.71
German-English 21.17 2.87 3.10
English-French 28.33 2.86 3.16
English-Spanish 27.49 2.86 3.34
English-German 14.01 3.15 3.65
Figure 6: Average scores for different language
pairs. Manual scoring is done by different judges,
resulting in a not very meaningful comparison.
4.3 Language pairs
It is well know that language pairs such as English-
German pose more challenges to machine transla-
tion systems than language pairs such as French-
English. Different sentence structure and rich target
language morphology are two reasons for this.
Again, we can compute average scores for all sys-
tems for the different language pairs (Figure 6). The
differences in difficulty are better reflected in the
BLEU scores than in the raw un-normalized man-
ual judgements. The easiest language pair according
to BLEU (English-French: 28.33) received worse
manual scores than the hardest (English-German:
14.01). This is because different judges focused on
different language pairs. Hence, the different av-
erages of manual scores for the different language
pairs reflect the behaviour of the judges, not the
quality of the systems on different language pairs.
4.4 Manual judgement vs. BLEU
Given the closeness of most systems and the wide
over-lapping confidence intervals it is hard to make
strong statements about the correlation between hu-
man judgements and automatic scoring methods
such as BLEU.
We confirm the finding by Callison-Burch et al.
(2006) that the rule-based system of Systran is not
adequately appreciated by BLEU. In-domain Sys-
tran scores on this metric are lower than all statistical
systems, even the ones that have much worse human
scores. Surprisingly, this effect is much less obvious
for out-of-domain test data. For instance, for out-of-
domain English-French, Systran has the best BLEU
and manual scores.
Our suspicion is that BLEU is very sensitive to
jargon, to selecting exactly the right words, and
not synonyms that human judges may appreciate
as equally good. This is can not be the only ex-
planation, since the discrepancy still holds, for in-
stance, for out-of-domain French-English, where
Systran receives among the best adequacy and flu-
ency scores, but a worse BLEU score than all but
one statistical system.
This data set of manual judgements should pro-
vide a fruitful resource for research on better auto-
matic scoring methods.
4.5 Best systems
So, who won the competition? The best answer
to this is: many research labs have very competi-
tive systems whose performance is hard to tell apart.
This is not completely surprising, since all systems
use very similar technology.
For some language pairs (such as German-
English) system performance is more divergent than
for others (such as English-French), at least as mea-
sured by BLEU.
The statistical systems seem to still lag be-
hind the commercial rule-based competition when
translating into morphological rich languages, as
demonstrated by the results for English-German and
English-French.
The predominate focus of building systems that
translate into English has ignored so far the difficult
issues of generating rich morphology which may not
be determined solely by local context.
4.6 Comments on Manual Evaluation
This is the first time that we organized a large-scale
manual evaluation. While we used the standard met-
rics of the community, the we way presented trans-
lations and prompted for assessment differed from
other evaluation campaigns. For instance, in the
recent IWSLT evaluation, first fluency annotations
were solicited (while withholding the source sen-
tence), and then adequacy annotations.
Almost all annotators reported difficulties in
maintaining a consistent standard for fluency and ad-
equacy judgements, but nevertheless most did not
explicitly move towards a ranking-based evaluation.
Almost all annotators expressed their preference to
move to a ranking-based evaluation in the future. A
few pointed out that adequacy should be broken up
109
into two criteria: (a) are all source words covered?
(b) does the translation have the same meaning, in-
cluding connotations?
Annotators suggested that long sentences are al-
most impossible to judge. Since all long sen-
tence translation are somewhat muddled, even a con-
trastive evaluation between systems was difficult. A
few annotators suggested to break up long sentences
into clauses and evaluate these separately.
Not every annotator was fluent in both the source
and the target language. While it is essential to be
fluent in the target language, it is not strictly nec-
essary to know the source language, if a reference
translation was given. However, ince we extracted
the test corpus automatically from web sources, the
reference translation was not always accurate — due
to sentence alignment errors, or because translators
did not adhere to a strict sentence-by-sentence trans-
lation (say, using pronouns when referring to enti-
ties mentioned in the previous sentence). Lack of
correct reference translations was pointed out as a
short-coming of our evaluation. One annotator sug-
gested that this was the case for as much as 10% of
our test sentences. Annotators argued for the impor-
tance of having correct and even multiple references.
It was also proposed to allow annotators to skip
sentences that they are unable to judge.
5 Conclusions
We carried out an extensive manual and automatic
evaluation of machine translation performance on
European language pairs. While many systems had
similar performance, the results offer interesting in-
sights, especially about the relative performance of
statistical and rule-based systems.
Due to many similarly performing systems, we
are not able to draw strong conclusions on the ques-
tion of correlation of manual and automatic evalua-
tion metrics. The bias of automatic methods in favor
of statistical systems seems to be less pronounced on
out-of-domain test data.
The manual evaluation of scoring translation on
a graded scale from 1–5 seems to be very hard to
perform. Replacing this with an ranked evalua-
tion seems to be more suitable. Human judges also
pointed out difficulties with the evaluation of long
sentences.
Acknowledgements
The manual evaluation would not have been possible
without the contributions of the manual annotators:
Jesus Andres Ferrer, Abhishek Arun, Amittai Axel-
rod, Alexandra Birch, Chris Callison-Burch, Jorge
Civera, Marta Ruiz Costa-juss`a, Josep Maria Crego,
Elsa Cubel, Chris Irwin Davis, Loic Dugast, Chris
Dyer, Andreas Eisele, Cameron Fordyce, Jes´us
Gim´enez, Fabrizio Gotti, Hieu Hoang, Eric Joanis
Howard Johnson, Philipp Koehn, Beata Kouchnir,
Roland Kuhn, Elliott Macklovitch, Arul Menezes,
Marian Olteanu, Chris Quirk, Reinhard Rapp, Fatiha
Sadat, Joan Andreu S`anchez, Germ´an Sanchis,
Michel Simard, Ashish Venugopal, and Taro Watan-
abe.
This work was supported in part under the GALE
program of the Defense Advanced Research Projects
Agency, Contract No. HR0011-06-C-0022.

References
Birch, A., Callison-Burch, C., Osborne, M., and
Koehn, P. (2006). Constraining the phrase-based,
joint probability statistical translation model. In
Proceedings on the Workshop on Statistical Ma-
chine Translation, pages 154–157, New York
City. Association for Computational Linguistics.
Callison-Burch, C., Osborne, M., and Koehn, P.
(2006). Re-evaluating the role of BLEU in ma-
chine translation research. In Proceedings of
EACL.
Collins, M., Koehn, P., and Kucerova, I. (2005).
Clause restructuring for statistical machine trans-
lation. In Proceedings of ACL.
Costa-juss`a, M. R., Crego, J. M., de Gispert, A.,
Lambert, P., Khalilov, M., Mari˜no, J. B., Fonol-
losa, J. A. R., and Banchs, R. (2006). Talp phrase-
based statistical translation system for european
language pairs. In Proceedings on the Workshop
on Statistical Machine Translation, pages 142–
145, New York City. Association for Computa-
tional Linguistics.
Crego, J. M., de Gispert, A., Lambert, P., Costa-
juss`a, M. R., Khalilov, M., Banchs, R., Mari˜no,
J. B., and Fonollosa, J. A. R. (2006). N-gram-
based smt system enhanced with reordering pat-
terns. In Proceedings on the Workshop on Statis-
tical Machine Translation, pages 162–165, New
York City. Association for Computational Lin-
guistics.
Doddington, G. (2002). The NIST automated mea-
sure and its relation to IBM’s BLEU. In Proceed-
ings of LREC-2002 Workshopon Machine Trans-
lation Evaluation: Human Evaluators Meet Auto-
mated Metrics, Gran Canaria, Spain.
Eck, M. and Hori, C. (2005). Overview of the iwslt
2005 evaluation campaign. In Proc. of the Inter-
national Workshop on Spoken Language Transla-
tion.
Gim´enez, J. and M`arquez, L. (2006). The ldv-
combo system for smt. In Proceedings on
the Workshop on Statistical Machine Translation,
pages 166–169, New York City. Association for
Computational Linguistics.
Johnson, H., Sadat, F., Foster, G., Kuhn, R., Simard,
M., Joanis, E., and Larkin, S. (2006). Portage:
with smoothed phrase tables and segment choice
models. In Proceedings on the Workshop on
Statistical Machine Translation, pages 134–137,
New York City. Association for Computational
Linguistics.
Koehn, P. (2004). Statistical significance tests for
machine translation evaluation. In Lin, D. and
Wu, D., editors, Proceedings of EMNLP 2004,
pages 388–395, Barcelona, Spain. Association for
Computational Linguistics.
Koehn, P. and Monz, C. (2005). Shared task: Statis-
tical machine translation between European lan-
guages. In Proceedings of the ACL Workshop
on Building and Using Parallel Texts, pages 119–
124, Ann Arbor, Michigan. Association for Com-
putational Linguistics.
Li, A. (2005). Results of the 2005 NIST machine
translation evaluation. In Machine Translation
Workshop.
Menezes, A., Toutanova, K., and Quirk, C. (2006).
Microsoft research treelet translation system:
Naacl 2006 europarl evaluation. In Proceedings
on the Workshop on Statistical Machine Transla-
tion, pages 158–161, New York City. Association
for Computational Linguistics.
Olteanu, M., Davis, C., Volosen, I., and Moldovan,
D. (2006a). Phramer - an open source statisti-
cal phrase-based translator. In Proceedings on
the Workshop on Statistical Machine Translation,
pages 146–149, New York City. Association for
Computational Linguistics.
Olteanu, M., Suriyentrakorn, P., and Moldovan, D.
(2006b). Language models and reranking for ma-
chine translation. In Proceedings on the Workshop
on Statistical Machine Translation, pages 150–
153, New York City. Association for Computa-
tional Linguistics.
Patry, A., Gotti, F., and Langlais, P. (2006). Mood at
work: Ramses versus pharaoh. In Proceedings on
the Workshop on Statistical Machine Translation,
pages 126–129, New York City. Association for
Computational Linguistics.
Przybocki, M. (2004). NIST machine translation
2004 evaluation – summary of results. In Machine
Translation Evaluation Workshop.
Riezler, S. and Maxwell, J. T. (2005). On some pit-
falls in automatic evaluation and significance test-
ing for MT. In Proceedings of the ACL Workshop
on Intrinsic and Extrinsic Evaluation Measures
for Machine Translation and/or Summarization,
pages 57–64, Ann Arbor, Michigan. Association
for Computational Linguistics.
S´anchez, J. A. and Bened´ı, J. M. (2006). Stochas-
tic inversion transduction grammars for obtaining
word phrases for phrase-based statistical machine
translation. In Proceedings on the Workshop on
Statistical Machine Translation, pages 130–133,
New York City. Association for Computational
Linguistics.
Watanabe, T., Tsukada, H., and Isozaki, H. (2006).
Ntt system description for the wmt2006 shared
task. In Proceedings on the Workshop on Statis-
tical Machine Translation, pages 122–125, New
York City. Association for Computational Lin-
guistics.
Zollmann, A. and Venugopal, A. (2006). Syntax
augmented machine translation via chart parsing.
In Proceedings on the Workshop on Statistical
Machine Translation, pages 138–141, New York
City. Association for Computational Linguistics.
