Minimum Bayes-Risk Decoding for Statistical Machine Translation
Shankar Kumar and William Byrne a0
Center for Language and Speech Processing, Johns Hopkins University,
3400 North Charles Street, Baltimore, MD, 21218, USA
a1 skumar,byrne
a2 @jhu.edu
Abstract
We present Minimum Bayes-Risk (MBR) de-
coding for statistical machine translation. This
statistical approach aims to minimize expected
loss of translation errors under loss functions
that measure translation performance. We de-
scribe a hierarchy of loss functions that incor-
porate different levels of linguistic information
from word strings, word-to-word alignments
from an MT system, and syntactic structure
from parse-trees of source and target language
sentences. We report the performance of the
MBR decoders on a Chinese-to-English trans-
lation task. Our results show that MBR decod-
ing can be used to tune statistical MT perfor-
mance for specific loss functions.
1 Introduction
Statistical Machine Translation systems have achieved
considerable progress in recent years as seen from their
performance on international competitions in standard
evaluation tasks (NIST, 2003). This rapid progress has
been greatly facilitated by the development of automatic
translation evaluation metrics such as BLEU score (Pa-
pineni et al., 2001), NIST score (Doddington, 2002)
and Position Independent Word Error Rate (PER) (Och,
2002). However, given the many factors that influence
translation quality, it is unlikely that we will find a single
translation metric that will be able to judge all these fac-
tors. For example, the BLEU, NIST and the PER metrics,
a3 This work was supported by the National Science Foun-
dation under Grant No. 0121285 and an ONR MURI Grant
N00014-01-1-0685. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the au-
thors and do not necessarily reflect the views of the National
Science Foundation or the Office of Naval Research.
though effective, do not take into account explicit syntac-
tic information when measuring translation quality.
Given that different Machine Translation (MT) eval-
uation metrics are useful for capturing different aspects
of translation quality, it becomes desirable to create MT
systems tuned with respect to each individual criterion. In
contrast, the maximum likelihood techniques that under-
lie the decision processes of most current MT systems do
not take into account these application specific goals. We
apply the Minimum Bayes-Risk (MBR) techniques devel-
oped for automatic speech recognition (Goel and Byrne,
2000) and bitext word alignment for statistical MT (Ku-
mar and Byrne, 2002), to the problem of building au-
tomatic MT systems tuned for specific metrics. This is
a framework that can be used with statistical models of
speech and language to develop decision processes opti-
mized for specific loss functions.
We will show that MBR decoding can be applied to
machine translation in two scenarios. Given an automatic
MT metric, we design a loss function based on the met-
ric and use MBR decoding to tune MT performance un-
der the metric. We also show how MBR decoding can
be used to incorporate syntactic structure into a statistical
MT system by building specialized loss functions. These
loss functions can use information from word strings,
word-to-word alignments and parse-trees of the source
sentence and its translation. In particular we describe
the design of a Bilingual Tree Loss Function that can ex-
plicitly use syntactic structure for measuring translation
quality. MBR decoding under this loss function allows
us to integrate syntactic knowledge into a statistical MT
system without building detailed models of linguistic fea-
tures, and retraining the system from scratch.
We first present a hierarchy of loss functions for trans-
lation based on different levels of lexical and syntactic
information from source and target language sentences.
This hierarchy includes the loss functions useful in both
situations where we intend to apply MBR decoding. We
then present the MBR framework for statistical machine
translation under the various translation loss functions.
We finally report the performance of MBR decoders op-
timized for each loss function.
2 Translation Loss Functions
We now introduce translation loss functions to measure
the quality of automatically generated translations. Sup-
pose we have a sentence a0 in a source language for
which we have generated an automatic translation a1a3a2
with word-to-word alignmenta4a5a2 relative to a0 . The word-
to-word alignment a4a5a2 specifies the words in the source
sentence a0 that are aligned to each word in the transla-
tion a1a6a2 . We wish to compare this automatic translation
with a reference translation a1 with word-to-word align-
ment a4 relative to a0 .
We will now present a three-tier hierarchy of trans-
lation loss functions of the form a7a5a8a9a8a10a1a6a2a10a11a12a4a5a2a14a13a15a11a16a8a10a1a17a11a12a4a18a13a20a19a12a0a21a13
that measure a8a10a1a21a2a22a11a23a4a24a2a14a13 against a8a10a1a17a11a12a4a18a13 . These loss func-
tions will make use of different levels of information from
word strings, MT alignments and syntactic structure from
parse-trees of both the source and target strings as illus-
trated in the following table.
Loss Function Functional Form
Lexical a7a5a8a25a1a17a11a23a1a21a2a14a13
Target Language Parse-Tree a7a24a8a25a26a28a27a29a11a9a26a30a27a32a31a25a13
Bilingual Parse-Tree a7a5a8a9a8a33a26a28a27a34a11a12a4a18a13a20a11a35a8a33a26a30a27a32a31a36a11a23a4a24a2a37a13a20a19a12a26a30a38a39a13
We start with an example of two competing English
translations for a Chinese sentence (in Pinyin without
tones), with their word-to-word alignments in Figure 1.
The reference translation for the Chinese sentence with
its word-to-word alignment is shown in Figure 2. In this
section, we will show the computation of different loss
functions for this example.
2.1 Lexical Loss Functions
The first class of loss functions uses no informa-
tion about word alignments or parse-trees, so that
a7a24a8a12a8a25a1a21a2a10a11a12a4a5a2a37a13a20a11a16a8a10a1a17a11a12a4a18a13a20a19a23a0a21a13 can be reduced to a7a24a8a10a1a17a11a12a1a3a2a40a13 . We
consider three loss functions in this category: The BLEU
score (Papineni et al., 2001), word-error rate, and the
position-independent word-error rate (Och, 2002). An-
other example of a loss function in this class is the MT-
eval metric introduced in Melamed et al. (2003). A loss
function of this type depends only on information from
word strings.
BLEU score (Papineni et al., 2001) computes the
geometric mean of the precision of a41 -grams of vari-
ous lengths (a41a43a42a45a44a47a46a49a48a40a48a50a52a51 ) between a hypothesis and
a reference translation, and includes a brevity penalty
(a53a54a8a10a1a17a11a12a1a21a2a14a13a56a55a57a46 ) if the hypothesis is shorter than the refer-
ence. We use a50a59a58a61a60 .
a62
a7a63a1a3a64a65a8a10a1a17a11a12a1 a2a13a39a58a67a66a16a68a70a69
a71a73a72
a74
a75a77a76a79a78a81a80a25a82a35a83
a69
a75
a8a25a1a17a11a23a1a21a2a37a13
a50 a84a86a85
a53a54a8a10a1a17a11a12a1 a2a13a20a11
where a69 a75 a8a25a1a17a11a23a1a21a2a37a13 is the precision of a41 -grams in the hy-
pothesis a1a21a2 . The BLEU score is zero if any of the n-gram
precisions a69 a75 a8a10a1a17a11a12a1a6a2a14a13 is zero for that sentence pair. We
note that a87a88a55 a62 a7a63a1a3a64a65a8a10a1a17a11a12a1 a2a13a52a55a89a46 . We derive a loss
function from BLEU score as
a7 BLEUa8a10a1a17a11a12a1a6a2a40a13a90a58a88a46a34a91
a62
a7a90a1a3a64a17a8a25a1a17a11a23a1a21a2a37a13 .
Word Error Rate (WER) is the ratio of the string-edit
distance between the reference and the hypothesis word
strings to the number of words in the reference. String-
edit distance is measured as the minimum number of edit
operations needed to transform a word string to the other
word string.
Position-independent Word Error Rate (PER) mea-
sures the minimum number of edit operations needed
to transform a word string to any permutation of the
other word string. The PER score (Och, 2002) is then
computed as a ratio of this distance to the number of
words in the reference word string.
2.2 Target Language Parse-Tree Loss Functions
The second class of translation loss functions uses infor-
mation only from the parse-trees of the two translations,
so that a7a5a8a9a8a10a1a17a11a12a4a18a13a20a11a16a8a10a1a21a2a22a11a12a4a5a2a37a13a20a19a12a0a21a13a92a58a93a7a24a8a25a26a30a27a29a11a9a26a30a27a32a31a25a13 . This loss
function has no access to any information from the source
sentence or the word alignments.
Examples of such loss functions are tree-edit distances
between parse-trees, string-edit distances between event
representation of parse-trees (Tang et al., 2002), and tree-
kernels (Collins and Duffy, 2002). The computation of
tree-edit distance involves an unconstrained alignment of
the two English parse-trees. We can simplify this prob-
lem once we have a third parse tree (for the Chinese sen-
tence) with node-to-node alignment relative to the two
English trees. We will introduce such a loss function in
the next section. We did not perform experiments involv-
ing this class of loss functions, but mention them for com-
pleteness in the hierarchy of loss functions.
2.3 Bilingual Parse-Tree Loss Functions
The third class of loss functions uses information from
word strings, alignments and parse-trees in both lan-
guages, and can be described by
a7a5a8a9a8a25a1a94a11a12a4a5a13a15a11a16a8a10a1a21a2a22a11a12a4a24a2a14a13a20a19a23a0a21a13a90a58a95a7a5a8a9a8a25a26a96a27a97a11a12a4a5a13a15a11a16a8a25a26a96a27a79a31a98a11a23a4a24a2a14a13a20a19a9a26a30a38a39a13 .
We will now describe one such loss function using the
example in Figures 1 and 2. Figure 3 shows a tree-
to-tree mapping between the source (Chinese) parse-tree
and parse-trees of its reference translation and two com-
peting hypothesis (English) translations.
E 2
2A
1 A
the first two months of this year guangdong ’s high−tech products 3.76 billion US dollars
jin−nian qian liangyue guangdong gao xinjishu chanpin chukou sanqidianliuyi meiyuan
the first two months of this year guangdong exported high−tech products 3.76 billion US dollars
F
1E
Figure 1: Two competing English translations for a Chinese sentence with their word-to-word alignments.
A
E export of high−tech products in guangdong in first two months this year reached 3.76 billion US dollars
jin−nian qian liangyue guangdong gao xinjishu chanpin chukou sanqidianliuyi meiyuanF
Figure 2: The reference translation for the Chinese sentence from Figure 1 with its word-to-word alignments. Words in
the Chinese (English) sentence shown as unaligned are aligned to the NULL word in the English (Chinese) sentence.
We first assume that a node a41 in the source tree a26a32a38 can
be mapped to a node a0 in a26 (and a node a0 a2 in a26a24a2 ) using
word alignment a4 (and a4a18a2 respectively). We denote the
subtree of a26 rooted at node a0 by a1a3a2 and the subtree of
a26a24a2 rooted at node a0 a2 by
a1
a2
a2
a31
. We will now describe a
simple procedure that makes use of the word alignment
a4 to construct node-to-node alignment between nodes in
the source tree a26a28a38 and the target tree a26 .
2.3.1 Alignment of Parse-Trees
For each node a41 in the source tree a26 a38 we consider the
subtree a1 a75 rooted at a41 . We first read off the source word
sequence corresponding to the leaves of a1 a75 . We next con-
sider the subset of words in the target sentence that are
aligned to any word in this source word sequence, and
select the leftmost and rightmost words from this sub-
set. We locate the leaf nodes corresponding to these two
words in the target parse tree a26 , and obtain their closest
common ancestor node a0 a42 a26 . This procedure gives us
a mapping from a node a41 a42 a26a28a38 to a node a0a59a42 a26 and this
mapping associates one subtree a1 a75 a42 a26a30a38 to one subtree
a1a4a2
a42 a26 .
2.3.2 Loss Computation between Aligned
Parse-Trees
Given the subtree alignment between a26a28a38 and a26 , and
a26a30a38 and a26 a2 , we first identify the subset of nodes in a26a32a38 for
which we can identify a corresponding node in both a26
and a26a24a2 .
a5
a50a6a38 a58 a44a35a41 a42 a26a30a38a7a6a8a0a10a9a58a12a11a14a13a15a0 a2a16a9a58a17a11 a51a47a48
The Bilingual Parse-Tree (BiTree) Loss Function can
then be computed as
BiTreeLossa18a19a18a21a20a23a22a25a24a4a26a28a27a3a24a29a18a21a20 a22 a31 a24a4a26a31a30a32a27a3a33a34a20a23a35a36a27a36a37a39a38
a40a42a41a44a43
a45a47a46a49a48
a18a21a50a19a51a52a24a34a50a34a30a51 a31 a27a3a24
(1)
where a53 a8 a1 a11 a1 a2a37a13 is a distance measure between sub-trees
a1 and a1
a2 . Specific Bi-tree loss functions are determined
through particular choices of a53 . In our experiments, we
used a 0/1 loss function between sub-trees a1 and a1 a2 .
a53 a8
a1
a11
a1
a2a13a39a58a55a54
a46
a1
a9a58
a1
a2
a87 otherwisea48
(2)
We note that other tree-to-tree distance measures can
also be used to compute a53 , e.g. the distance function
could compare if the subtrees a1 and a1 a2 have the same
headword/non-terminal tag.
The Bitree loss function measures the distance be-
tween two trees in terms of distances between their cor-
responding subtrees. In this way, we replace the string-
to-string (Levenshtein) alignments (for WER) or a41 -gram
matches (for BLEU/PER) with subtree-to-subtree align-
ments.
The Bitree Error Rate (in %) is computed as a ratio of
the Bi-tree Loss function to the number of nodes in the
set a5a50a6a38 .
The complete node-to-node alignment between the
parse-tree of the source (Chinese) sentence and the parse
trees of its reference translation and the two hypothesis
translations (English) is given in Table 1. Each row in
this table shows the alignment between a node in the Chi-
nese parse-tree and nodes in the reference and the two hy-
pothesis parse-trees. The computation of the Bitree Loss
function and the Bitree Error Rate is presented in the last
two rows of the table.
a26 : Reference Translation (English) S
NP1
NP2
NN1
export
PP1
IN1
of
NP3
NP4
JJ1
high-tech
NNS1
products
PP2
IN2
in
NP5
NNP1
guangdong
PP2
IN3
in
NP6
NP7
JJ2
first
CD1
two
NNS2
months
NP8
DT1
this
NN2
year
VP1
VBD1
reached
NP9
CD2
3.76
CD3
billion
NNP2
US
NNS3
dollars
a26a96a38 : Source Sentence(Chinese) VP1
LCP1
NP1
NT1
jin-nian
LC1
qian
VP2
VV1
liangyue
NP2
NP3
NR2
guangdong
ADJP1
JJ1
gao
NP3
NN1
xinjishu
NN2
chanpin
NN3
chukou
QP1
CD1
sanqidianliuyi
CLP1
M1
meiyuan
S
NP1
NP2
DT1
the
JJ1
first
CD1
two
NNS1
months
PP1
IN1
of
NP3
DT2
this
NN1
year
NNP1
Guangdong
VP1
VBD1
exported
NP4
JJ2
high-tech
NNS2
products
NP5
CD1
3.76
CD2
billion
NNP2
US
NNS3
dollars
a26
a78 : Hypothesis Translation 1 (English)
NP1
NP2
DT1
the
JJ1
first
CD1
two
NNS1
months
PP1
IN1
of
NP3
DT2
this
NN1
year
NP4
NNP1
Guangdong
POS1
’s
JJ2
high-tech
NNS2
products
NP4
QP1
CD2
3.76
CD3
billion
NP5
NNP2
US
NNS3
dollars
a26a1a0 : Hypothesis Translation 2 (English)
Figure 3: An example showing a parse-tree for a Chinese sentence and parse-trees for its reference translation and
two competing hypothesis translations. We show a sample alignment for one of the nodes in the Chinese tree with its
corresponding nodes in the three English trees. The complete node-to-node alignment between the parse-trees of the
Chinese sentence and the three English sentences is given in Table 1.
Node a0a2a1 a20a23a35 Node a3a4a1 a20 Node a3a6a5a7a1 a20a8a5 a9 a18a21a50a19a51 a24a34a50a19a51a7a10 a27 Node a3a12a11a13a1 a20a14a11 a9a44a18a21a50a34a51 a24a34a50 a30a51a16a15 a27
VP1 S S 1 NP1 1
LCP NP6 NP1 1 NP1 1
NP1 NP8 NP3 1 NP3 1
NT1 NP8 NP3 1 NP3 1
jin-nian NP8 NP3 1 NP3 1
LC1 first NP1 1 NP2 1
qian first NP1 1 NP2 1
VP2 S S 1 NP1 1
VV NP7 NP2 1 NP2 1
liangyue NP7 NP2 1 NP2 1
NP2 S S 1 NP1 1
NP3 Guangdong Guangdong 0 NP4 1
NR2 Guangdong Guangdong 0 NP4 1
guangdong Guangdong Guangdong 0 NP4 1
ADJP1 reached high-tech 1 high-tech 1
JJ1 reached high-tech 1 high-tech 1
gao reached high-tech 1 high-tech 1
NP3 NP1 VP1 1 NP3 1
NN2 products products 0 products 0
chanpin products products 0 products 0
NN3 export exported 1 products 1
chukou export exported 1 products 1
QP1 NP9 NP5 0 NP1 1
CLP1 NP9 NP5 0 NP1 1
M1 NP9 NP5 0 NP1 1
meiyuan NP9 NP5 0 NP1 1
BiTree Loss Lossa18a18a17 a24a19a17 a5 a27 17 Lossa18a18a17 a24a20a17 a11 a27 24
BiTree Error Rate (%) 17/26 = 65.4 24/26 = 92.3
Table 1: Bi-Tree Loss Computation for the parse-trees shown in Figure 3. Each row shows a mapping between a node
in the parse-tree of the Chinese sentence and the nodes in parse-trees of its reference translation, hypothesis translation
1 and hypothesis translation 2.
2.4 Comparison of Loss Functions
In Table 2 we compare various translation loss functions
for the example from Figure 1. The two hypothesis trans-
lations are very similar at the word level and therefore the
BLEU score, PER and the WER are identical. However
we observe that the sentences differ substantially in their
syntactic structure (as seen from Parse-Trees in Figure 3),
and to a lesser extent in their word-to-word alignments
(Figure 1) to the source sentence. The first hypothesis
translation is parsed as a sentence a21a23a22 a50a25a24a27a26a28a24 while
the second translation is parsed as a noun phrase. The Bi-
tree loss function which depends both on the parse-trees
and the word-to-word alignments, is therefore very differ-
ent for the two translations (Table 2). While string based
metrics such as BLEU, WER and PER are insensitive to
the syntactic structure of the translations, BiTree Loss is
able to measure this aspect of translation quality, and as-
signs different scores to the two translations.
We provide this example to show how a loss function
which makes use of syntactic structure from source and
target parse trees, can capture properties of translations
that string based loss functions are unable to measure.
Loss Functions a7a24a8a10a1a17a11a12a1
a78
a13 a7a24a8a10a1a17a11a12a1 a0a16a13
BLEU (%) 26.4 26.4
WER (%) 70.6 70.6
PER (%) 23.5 23.5
BiTree Error Rate (%) 65.4 92.3
Table 2: Comparison of the different loss functions for
hypothesis and reference translations from Figures 1, 2.
3 Minimum Bayes-Risk Decoding
Statistical Machine Translation (Brown et al., 1990) can
be formulated as a mapping of a word sequence a0 in a
source language to word sequence a1a3a2 in the target lan-
guage that has a word-to-word alignmenta4a18a2 relative to a0 .
Given the source sentence a0 , the MT decoder a29 a8a25a0a21a13 pro-
duces a target word string a1a6a2 with word-to-word align-
ment a4a5a2 . Relative to a reference translation a1 with word
alignment a4 , the decoder performance is measured as
a7a24a8a12a8a25a1a17a11a23a4a5a13a15a11a30a29 a8a25a0a21a13a12a13 . Our goal is to find the decoder that has
the best performance over all translations. This is mea-
sured through Bayes-Risk :
a31
a8a32a29 a8a25a0a21a13a9a13a63a58a61a1a34a33a36a35 a27a36a37 a38a39a37a38a41a40a43a42a7a24a8a12a8a25a1a17a11a23a4a5a13a20a11a44a29 a8a25a0a21a13a12a13a19a45 a48
The expectation is taken under the true distribution
a24a17a8a25a1a94a11a12a4a21a11a23a0a21a13 that describes translations of human quality.
Given a loss function and a distribution, it is well
known that the decision rule that minimizes the Bayes-
Risk is given by (Bickel and Doksum, 1977; Goel and
Byrne, 2000):
a29 a8a10a0a21a13a90a58a1a0a3a2a5a4a7a6a9a8a11a10
a27 a31 a37 a38 a31
a74
a27a36a37 a38
a7a5a8a9a8a25a1a94a11a12a4a5a13a15a11a16a8a10a1 a2 a11a12a4 a2a13a20a19a23a0a21a13a20a24a17a8a10a1a17a11a12a4a13a12a0a21a13a15a48
(3)
We shall refer to the decoder given by this equation
as the Minimum Bayes-Risk (MBR) decoder. The MBR
decoder can be thought of as selecting a consensus trans-
lation: For each sentence a0 , Equation 3 selects the trans-
lation that is closest on an average to all the likely trans-
lations and alignments. The closeness is measured under
the loss function of interest.
This optimal decoder has the difficulties of search
(minimization) and computing the expectation under the
true distribution. In practice, we will consider the space
of translations to be an a50 -best list of translation alterna-
tives generated under a baseline translation model. Of
course, we do not have access to the true distribution
over translations. We therefore use statistical transla-
tion models (Och, 2002) to approximate the distribution
a24a17a8a25a1a94a11a12a4a14a12a0a21a13 .
Decoder Implementation: The MBR decoder (Equa-
tion 3) on the a50 -best List is implemented as
a15a16
a37 a17a19a18a21a20a23a22a25a24a27a26
a28
a41a30a29
a5a32a31 a11a33a31 a34 a34 a34 a31
a45a36a35
a45
a38
a37a33a38
a5
a9 a18a19a18a18a17
a37
a24 a26
a37
a27a3a24 a18a18a17
a28
a24 a26
a28
a27a19a27a40a39 a18a18a17
a37
a24a4a26
a37a3a41 a42
a27
and a43a42a18 a42 a27 a37 a18a18a17a45a44a28 a24a4a26a46a44a28 a27 . This is a rescoring procedure that
searches for consensus under a given loss function. The
posterior probability of each hypothesis in the a50 -best list
is derived from the joint probability assigned by the base-
line translation model.
a24a17a8a12a8a25a1a48a47a47a11a12a4a46a47a35a13a49a12a0a21a13a39a58
a24a17a8a25a1a48a47a47a11a12a4a46a47a49a11a23a0a21a13
a50
a72a51
a76a79a78 a24a17a8a25a1
a51
a11a12a4
a51
a11a23a0a21a13
a48 (4)
The conventional Maximum A Posteriori (MAP) de-
coder can be derived as a special case of the MBR de-
coder by considering a loss function that assigns a equal
cost (say 1) to all misclassifications. Under the 0/1 loss
function,
a7a24a8a9a8a10a1a17a11a12a4a18a13a20a11a35a8a25a1 a2 a11a12a4 a2a13a12a13a90a58 a54
a87 if a1a88a58a67a1a21a2a53a52 a4a67a58a95a4a24a2
a46 otherwise,
(5)
the decoder of Equation 3 reduces to the MAP decoder
a29 MAPa8a10a0a21a13a90a58a54a0a7a2a33a4a53a6a55a0a30a56
a35 a27 a31 a37 a38 a31 a40
a24a17a8a10a1 a2 a11a12a4 a2 a12a0a21a13a20a48 (6)
This illustrates why we are interested in MBR decoders
based on other loss functions: the MAP decoder is opti-
mal with respect to a loss function that is very harsh. It
does not distinguish between different types of translation
errors and good translations receive the same penalty as
poor translations.
4 Performance of MBR Decoders
We performed our experiments on the Large-Data Track
of the NIST Chinese-to-English MT task (NIST, 2003).
The goal of this task is the translation of news stories
from Chinese to English. The test set has a total of 1791
sentences, consisting of 993 sentences from the NIST
2001 MT-eval set and 878 sentences from the NIST 2002
MT-eval set. Each Chinese sentence in this set has four
reference translations.
4.1 Evaluation Metrics
The performance of the baseline and the MBR decoders
under the different loss functions was measured with re-
spect to the four reference translations provided for the
test set. Four evaluation metrics were used. These
were multi-reference Word Error Rate (mWER) (Och,
2002), multi-reference Position-independent word Error
Rate (mPER) (Och, 2002) , BLEU and multi-reference
BiTree Error Rate.
Among these evaluation metrics, the BLEU score
directly takes into account multiple reference transla-
tions (Papineni et al., 2001). In case of the other metrics,
we consider multiple references in the following way. For
each sentence, we compute the error rate of the hypothe-
sis translation with respect to the most similar reference
translation under the corresponding loss function.
4.2 Decoder Performance
In our experiments, a baseline translation model (JHU,
2003), trained on a Chinese-English parallel cor-
pus (NIST, 2003) (a46a58a57a77a87a53a59 English words and a46a23a60a53a57a7a59 Chi-
nese words), was used to generate 1000-best translation
hypotheses for each Chinese sentence in the test set. The
1000-best lists were then rescored using the different
translation loss functions described in Section 2.
The English sentences in the a50 -best lists were parsed
using the Collins parser (Collins, 1999), and the Chinese
sentences were parsed using a Chinese parser provided to
us by D. Bikel (Bikel and Chiang, 2000). The English
parser was trained on the Penn Treebank and the Chinese
parser on the Penn Chinese treebank.
Under each loss function, the MBR decoding was per-
formed using Equation 3. We say we have a matched
condition when the same loss function is used in both the
error rate and the decoder design. The performance of
the MBR decoders on the NIST 2001+2002 test set is re-
ported in Table 3. For all performance metrics, we show
the 70% confidence interval with respect to the MAP
baseline computed using bootstrap resampling (Press et
al., 2002; Och, 2003). We note that this significance level
does meet the customary criteria for minimum signifi-
cance intervals of 68.3% (Press et al., 2002).
We observe in most cases that the MBR decoder under
a loss function performs the best under the correspond-
ing error metric i.e. matched conditions perform the best.
The gains from MBR decoding under matched conditions
are statistically significant in most cases. We note that the
MAP decoder is not optimal in any of the cases. In partic-
ular, the translation performance under the BLEU metric
can be improved by using MBR relative to MAP decod-
ing. This shows the value of finding decoding procedure
matched to the performance criterion of interest.
We also notice some affinity among the loss functions.
The MBR decoding under the Bitree Loss function per-
forms better under the WER relative to the MAP decoder,
but perform poorly under the BLEU metric. The MBR
decoder under WER and PER perform better than the
MAP decoder under all error metrics. The MBR decoder
under BLEU loss function obtains a similar (or worse)
performance relative to MAP decoder on all metrics other
than BLEU.
5 Discussion
We have described the formulation of Minimum Bayes-
Risk decoders for machine translation. This is a general
framework that allows us to build special purpose de-
coders from general purpose models. The procedure aims
at direct minimization of the expected risk of translation
errors under a given loss function. In this paper we have
focused on two situations where this framework could be
applied.
Given an MT evaluation metric of interest such as
BLEU, PER or WER, we can use this metric as a loss
function within the MBR framework to design decoders
optimized for the evaluation criterion. In particular, the
MBR decoding under the BLEU loss function can yield
further improvements on top of MAP decoding.
Suppose we are interested in improving syntactic struc-
ture of automatic translations and would like to use an
existing statistical MT system that is trained without any
linguistic features. We have shown in such a situation
how MBR decoding can be applied to the MT system.
This can be done by the design of translation loss func-
tions from varied linguistic analyzes. We have shown the
construction of a Bitree loss function to compare parse-
trees of any two translations using alignments with re-
spect to a parse-tree for the source sentence. The loss
function therefore avoids the problem of unconstrained
tree-to-tree alignment. Using an example, we have shown
that this loss function can measure qualities of transla-
tion that string (and ngram) based metrics cannot cap-
ture. The MBR decoder under this loss function gives
improvements under an evaluation metric based on the
loss function.
We present results under the Bitree loss function as
an example of incorporating linguistic information into
a loss function; we have not yet measured its correla-
tion with human assessments of translation quality. This
loss function allows us to integrate syntactic structure into
the statistical MT framework without building detailed
models of syntactic features and retraining models from
scratch. However, we emphasize that the MBR tech-
niques do not preclude the construction of complex mod-
els of syntactic structure. Translation models that have
been trained with linguistic features could still benefit by
the application of MBR decoding procedures.
That machine translation evaluation continues to be
an active area of research is evident from recent work-
shops (AMTA, 2003). We expect new automatic MT
evaluation metrics to emerge frequently in the future.
Given any translation metric, the MBR decoding frame-
work will allow us to optimize existing MT systems for
the new criterion. This is intended to compensate for any
mismatch between decoding strategy of MT systems and
their evaluation criteria. While we have focused on de-
veloping MBR procedures for loss functions that mea-
sure various aspects of translation quality, this frame-
work can also be used with loss functions which measure
application-specific error criteria.
We now describe related training and search proce-
dures for NLP that explicitly take into consideration task-
specific performance metrics. Och (2003) developed a
training procedure that incorporates various MT evalua-
tion criteria in the training procedure of log-linear MT
models. Foster et al. (2002) developed a text-prediction
system for translators that maximizes expected benefit to
the translator under a statistical user model. In parsing,
Goodman (1996) developed parsing algorithms that are
appropriate for specific parsing metrics. There has also
been recent work that combines 1-best hypotheses from
multiple translation systems (Bangalore et al., 2002); this
approach uses string-edit distance to align the hypotheses
and rescores the resulting lattice with a language model.
In future work we plan to extend the search space of
MBR decoders to translation lattices produced by the
baseline system. Translation lattices (Ueffing et al., 2002;
Kumar and Byrne, 2003) are a compact representation of
a large set of most likely translations generated by an MT
system. While an a50 -best list contains only a limited re-
ordering of hypotheses, a translation lattice will contain
hypotheses with a vastly greater number of re-orderings.
We are developing efficient lattice search procedures for
MBR decoders. By extending the search space of the de-
coder to a much larger space than the a50 -best list, we ex-
pect further performance improvements.
MBR is a promising modeling framework for statisti-
cal machine translation. It is a simple model rescoring
framework that improves well-trained statistical models
Performance Metrics
Decoder BLEU (%) mWER(%) mPER (%) mBiTree Error Rate(%)
70% Confidence Intervals +/-0.3 +/-0.9 +/-0.6 +/-1.0
MAP(baseline) 31.2 64.9 41.3 69.0
MBR
BLEU 31.5 65.1 41.1 68.9
WER 31.3 64.3 40.8 68.5
PER 31.3 64.6 40.4 68.6
BiTree Loss 30.7 64.1 41.1 68.0
Table 3: Translation performance of the MBR decoder under various loss functions on the NIST 2001+2002 Test set.
For each metric, the performance under a matched condition is shown in bold. Note that better results correspond to
higher BLEU scores and to lower error rates.
by tuning them for particular criteria. These criteria could
come from evaluation metrics or from other desiderata
(such as syntactic well-formedness) that we wish to see
in automatic translations.
Acknowledgments
This work was performed as part of the 2003 Johns Hop-
kins Summer Workshop research group on Syntax for
Statistical Machine Translation. We would like to thank
all the group members for providing various resources
and tools and contributing to useful discussions during
the course of the workshop.
References
AMTA. 2003. Workshop on Machine
Translation Evaluation, MT Summit IX.
www.issco.unige.ch/projects/isle/MTE-at-MTS9.html.
S. Bangalore, V. Murdock, and G. Riccardi. 2002. Boot-
strapping bilingual data using consensus translation for
a multilingual instant messaging system. In Proceed-
ings of COLING, Taipei, Taiwan.
P. J. Bickel and K. A. Doksum. 1977. Mathematical
Statistics: Basic Ideas and Selected topics. Holden-
Day Inc., Oakland, CA, USA.
D. Bikel and D. Chiang. 2000. Two statistical pars-
ing models applied to the chinese treebank. In Pro-
ceedings of the Second Chinese Language Processing
Workshop, pages 1–6, Hong Kong.
P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della
Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and
P. S. Roossin. 1990. A statistical approach to machine
translation. Computational Linguistics, 16(2):79–85.
M. Collins and N. Duffy. 2002. New ranking algorithms
for parsing and tagging: Kernels over discrete struc-
tures, and the weighted perceptron. In Proceedings of
EMNLP, Philadelphia, PA, USA.
M. J. Collins. 1999. Head-driven Statistical Models for
Natural Language Parsing. Ph.D. thesis, University of
Pennsylvania, Philadelphia.
G. Doddington. 2002. Automatic evaluation of machine
translation quality using n-gram co-occurrence statis-
tics. In Proc. of HLT 2002, San Diego, CA. USA.
G. Foster, P. Langlais, and G. Lapalme. 2002. User-
friendly text prediction for translators. In Proc. of
EMNLP, Philadelphia, PA, USA.
V. Goel and W. Byrne. 2000. Minimum Bayes-risk auto-
matic speech recognition. Computer Speech and Lan-
guage, 14(2):115–135.
J. Goodman. 1996. Parsing algorithms and metrics. In
Proc. of ACL-1996, pages 177–183, Santa Cruz, CA,
USA.
JHU. 2003. Syntax for statistical machine
translation, Final report, JHU summer workshop.
http://www.clsp.jhu.edu/ws2003/groups/translate/.
S. Kumar and W. Byrne. 2002. Minimum Bayes-Risk
alignment of bilingual texts. In Proc. of EMNLP,
Philadelphia, PA, USA.
S. Kumar and W. Byrne. 2003. A weighted finite state
transducer implementation of the alignment template
model for statistical machine translation. In Proceed-
ings of HLT-NAACL, Edmonton, Canada.
I. D. Melamed, R. Green, and J. P. Turian. 2003. Preci-
sion and recall of machine translation. In Proceedings
of the HLT-NAACL, Edmonton, Canada.
NIST. 2003. The NIST Machine Translation Evalua-
tions. http://www.nist.gov/speech/tests/mt/.
F. Och. 2002. Statistical Machine Translation: From
Single Word Models to Alignment Templates. Ph.D.
thesis, RWTH Aachen, Germany.
F. Och. 2003. Minimum error rate training in statistical
machine translation. In Proc. of ACL, Sapporo, Japan.
K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001.
Bleu: a method for automatic evaluation of machine
translation. Technical Report RC22176 (W0109-022),
IBM Research Division.
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P.
Flannery. 2002. Numerical Recipes in C++. Cam-
bridge University Press, Cambridge, UK.
M. Tang, X. Luo, and S. Roukos. 2002. Active learning
for statistical natural language parsing. In Proceedings
of ACL 2002, Philadelphia, PA, USA.
N. Ueffing, F. Och, and H. Ney. 2002. Generation of
word graphs in statistical machine translation. In Proc.
of EMNLP, pages 156–163, Philadelphia, PA, USA.
