Minimum Bayes-Risk Word Alignments of Bilingual Texts
Shankar Kumar and William Byrne
Center for Language and Speech Processing, Johns Hopkins University,
3400 North Charles Street, Baltimore, MD, 21218, USA
skumar,byrne @jhu.edu
Abstract
We present Minimum Bayes-Risk word
alignment for machine translation. This
statistical, model-based approach attempts
to minimize the expected risk of align-
ment errors under loss functions that mea-
sure alignment quality. We describe var-
ious loss functions, including some that
incorporate linguistic analysis as can be
obtained from parse trees, and show that
these approaches can improve alignments
of the English-French Hansards.
1 Introduction
The automatic determination of word alignments in
bilingual corpora would be useful for Natural Lan-
guage Processing tasks such as statistical machine
translation, automatic dictionary construction, and
multilingual document retrieval. The development
of techniques in all these areas would be facili-
tated by automatic performance metrics, and align-
ment and translation quality metrics have been pro-
posed (Och and Ney, 2000b; Papineni et al., 2002).
However, given the difficulty of judging translation
quality, it is unlikely that a single, global metric will
be found for any of these tasks. It is more likely
that specialized metrics will be developed to mea-
sure specific aspects of system performance. This is
even desirable, as these specialized metrics could be
used in tuning systems for particular applications.
We have applied Minimum Bayes-Risk (MBR)
procedures developed for automatic speech recog-
nition (Goel and Byrne, 2000) to word alignment of
bitexts. This is a modeling approach that can be used
with statistical models of speech and language to de-
velop algorithms that are optimized for specific loss
functions. We will discuss loss functions that can
be used for word alignment and show how the over-
all alignment process can be improved by the use
of loss functions that incorporate linguistic features,
such as parses and part-of-speech tags.
2 Word-to-Word Bitext Alignment
We will study the problem of aligning an English
sentence to a French sentence and we will use the
word alignment of the IBM statistical translation
models (Brown et al., 1993).
Let and denote a pair of translated
English and French sentences. An English word is
defined as an ordered pair
, where the index refers to the posi-
tion of the word in the English sentence; is the
vocabulary of English; and the word at position is
the NULL word to which “spurious” French words
may be aligned. Similarly, a French word is written
as .
An alignment between and is defined to be
a sequence where
. Under the alignment
, the French word is connected to the English
word . For every alignment , we define a link
set defined as whose ele-
ments are given by the alignment links
.
3 Alignment Loss Functions
In this section we introduce loss functions to mea-
sure the quality of automatically produced align-
ments. Suppose we wish to compare an automat-
ically produced alignment to a reference align-
ment , which we assume was produced by a com-
petent translator. We will define various loss func-
tions that measure the quality of relative
to through their link sets and .
The desirable qualities in translation are fluency
                                            Association for Computational Linguistics.
                    Language Processing (EMNLP), Philadelphia, July 2002, pp. 140-147.
                         Proceedings of the Conference on Empirical Methods in Natural
and adequacy. We assume here that both word se-
quences are fluent and adequate translations but that
the word and phrase correspondences are unknown.
It is these correspondences that we wish to deter-
mine and evaluate automatically.
We now present two general classes of loss func-
tions that measure alignment quality. In subsequent
sections, we will give specific examples of these and
show how to construct decoders that are optimized
for each loss function.
3.1 Alignment Error
The Alignment Error Rate (AER) introduced by
Och and Ney (2000b) measures the fraction of links
by which the automatic alignment differs from the
reference alignment. Links to the NULL word are
ignored. This is done by defining modified link sets
for the reference alignment
and the automatic alignment
.
The reference annotation procedure allowed the
human transcribers to identify which links in they
judged to be unambiguous. In addition to the ref-
erence alignment, this gives a set of sure links (S)
which is a subset of .
AER is defined as (Och and Ney, 2000b)
(1)
Since our modeling techniques require loss func-
tions rather than error rates, we introduce the Align-
ment Error loss function
(2)
We consider error rates to be “normalized” loss
functions. We also note that, unlike AER, does
not distinguish between ambiguous and unambigu-
ous links. However, if a decoder generates an align-
ment for which is zero, the AER is
also zero. Therefore if AER is the metric of inter-
est, we will design alignment procedures to mini-
mize .
3.2 Generalized Alignment Error
We are interested in extending the Alignment Er-
ror loss function to incorporate various linguistic
features into the measurement of alignment quality.
The Generalized Alignment Error loss is defined as
(3)
where and
(4)
Here we have introduced the word-to-word distance
measure which compares the
links and as a function of the words in
the translation. refers to all loss functions that
have the form of Equation 3. Specific loss functions
are determined through the choice of . To see the
value in this, suppose is a verb in the French sen-
tence and that it is aligned in the reference alignment
to , the verb in the English sentence. If our goal is
to ensure verb alignment, then can be constructed
to penalize any link in the automatic align-
ment in which is not a verb. We will later give ex-
amples of distances in which is based on Part-
of-Speech (POS) tags, parse tree distances, and au-
tomatically determined word clusters. We note that
the can almost be reduced to , except for
the treatment of NULL in the English sentence.
4 Minimum Bayes-Risk Decoding For
Automatic Word Alignment
We present the Minimum Bayes-Risk alignment for-
mulation and derive MBR alignment procedures un-
der the loss functions of Section 3.
Given a translated pair of English-French sen-
tences , the decoder produces an align-
ment . Relative to a reference align-
ment , the decoder performance is measured as
. Our goal is to find the decoder that
has the best performance over all translated sen-
tences. This is measured through the Bayes Risk
. The ex-
pectation is taken with respect to the true distribu-
tion that describes “human quality” align-
ments of translations as they are found in bitext.
Given a loss function and a probability distribu-
tion, it is well known that the decision rule which
minimizes the Bayes Risk is given by the follow-
ing expression (Bickel and Doksum, 1977; Goel and
Byrne, 2000).
(5)
Several modeling assumptions have been made to
obtain this form of the decoder. We do not have ac-
cess to the true distribution over translations. We
therefore use statistical MT models to approximate
. We furthermore assume that the space of
alignment alternatives can be restricted to an align-
ment lattice , which is a compact representation of
the most likely word alignments of the sentence pair
under the baseline models.
It is clear from Equation 5 that the MBR de-
coder is determined by the loss function. The Sen-
tence Alignment Error refers to the loss function
that gives a penalty of 1 for any errorful alignment:
, where is the indi-
cator function of the set . The MBR decoder un-
der this loss can easily be seen to be the Maximum
Likelihood (ML) alignment under the MT models:
. This illustrates why we
are interested in MBR decoders based on other loss
functions: the ML decoder is optimal with respect to
a loss function that is overly harsh. It does not dis-
tinguish between different types of alignment errors
and good alignments receive the same penalty as
poor alignments. Moreover, such a harsh penalty is
particularly inappropriate when unambiguous word-
to-word alignments cannot be provided in all cases
even by human translators who produce the refer-
ence alignments. The AER makes an explicit dis-
tinction between ambiguous and unambiguous word
alignments. Ideally, the decoder should be able to do
so as well. Motivated by this, the MBR hypothesis
can be thought of as the consensus hypothesis un-
der a particular loss function: Equation 5 selects the
hypothesis that is, in an average sense, close to the
other likely hypotheses. In this way, ambiguity can
be reduced by selecting the hypothesis that is “most
similar” to the collection of most likely competing
hypotheses.
We now describe the alignment lattice (Sec-
tion 4.1) and introduce the lattice based probabilities
required for the MBR alignment (Section 4.2). The
derivation of the MBR alignment under the AE and
GAE loss functions is presented in Sections 4.3 and
4.4.
4.1 Alignment Lattice
The lattice is represented as a Weighted Finite
State Transducer (WFST) (Mohri et al., 2000)
with a finite set of states , a set of
transition labels , an initial state , the set of fi-
nal states , and a finite set of transitions . A
transition in this WFST is given by
where is the starting state, is the ending state,
is the alignment link and is the weight. For
an English sentence of length and a French sen-
tence of length , we define as
.
A complete path through the WFST is a sequence
of transitions given by
such that and . Each complete
path defines an alignment link set .
When we write , we mean that is derived
from a complete path through . This allows us to
use alignment models in which the probability of an
alignment can be written as a sum over alignment
link weights, i.e. .
4.2 Alignment Link Posterior Probability
We first introduce the lattice transition posterior
probability of each transition in the
lattice
(6)
where is if and otherwise. The
lattice transition posterior probability is the sum of
the posterior probabilities of all lattice paths pass-
ing through the transition . This can be com-
puted very efficiently with a forward-backward al-
gorithm on the alignment lattice (Wessel et al.,
1998). is the posterior probability of an
alignment link set which can be written as
(7)
We now define the alignment link posterior prob-
ability for a link
(8)
where . This is the probability
that any two words are aligned given all the
alignments in the lattice .
4.3 MBR Alignment Under
In this section we derive MBR alignment under the
Alignment Error loss function (Equation 2). The op-
timal decoder has the form (Equation 5)
(9)
The summation is equal to
If is the subset of transitions ( )
that do not contain links with the NULL word, we
can simplify the bracketed term as
For an alignment link we note that
. Therefore, the
MBR alignment (Equation 9) can be found in terms
of the modified link weight for each alignment link
(10)
We can rewrite the above equation as
(11)
4.4 MBR Alignment Under
We now derive MBR alignment under the Gener-
alized Alignment Error loss function (Equation 3).
The optimal decoder has the form (Equation 5)
(12)
The summation can be rewritten as
where and .
We can simplify the bracketed term as
where and .
The MBR alignment (Equation 12) can be found
in terms of the modified link weight for each align-
ment link
(13)
4.5 MBR Alignment Using WFST Techniques
The MBR alignment procedures under the and
loss functions begin with a WFST that con-
tains the alignment probabilities as de-
scribed in Section 4.1. To build the MBR decoder
for each loss function the weights on the transitions
( ) of the WFST are modified ac-
cording to either Equation 11 ( ) or Equa-
tion 13 ( ). Once the weights are modified,
the search procedure for the MBR alignment is the
same in each case. The search is carried out using a
shortest-path algorithm (Mohri et al., 2000).
5 Word Alignment Experiments
We present here examples of Generalized Align-
ment Error loss functions based on three types of
linguistic features and show how they can be incor-
porated into a statistical MT system to obtain auto-
matic alignments.
5.1 Syntactic Distances From Parse-Trees
Suppose a parser is available that generates a parse-
tree for the English sentence. Our goal is to con-
struct an alignment loss function that incorporates
features from the parse. One way to do this is to
define a graph distance
(14)
Here and are the parse-tree leaf nodes cor-
responding to the English words and . This
quantity is computed as the sum of the distances
from each node to their closest common ancestor.
It gives a syntactic distance between any pair of
English words based on the parse-tree. This dis-
tance has been used to measure word association for
information retrieval (Mittendorfer and Winiwarter,
2001). It reflects how strongly the words and
are bound together by the syntactic structure of the
English sentence as determined by the parser. Fig-
ure 1 shows the parse tree for an English sentence
in the test data with the pairwise syntactic distances
between the English words corresponding to the leaf
nodes.
TOP
S
NP
PRP i
VP
VBP think SBAR
S
NP
DT that
VP
VBZ is ADJP
JJ good . .
Pairwise Distances
g("i","think") = 4
g("i", "that") = 7
g("i","is") = 7
g("i" , "good") = 8
g("i" , ".") = 8
Figure 1: Parse tree for a English sentence with the
pairwise syntactic distances between words.
To obtain these distances, Ratnaparkhi’s part-
of-speech (POS) tagger (Ratnaparkhi, 1996) and
Collins’ parser (Collins, 1999) were used to obtain
parse trees for the English side of the test corpus.
With defined as in Equation 14, the Generalized
Alignment Error loss function (Equation 3) is called
the Parse-Tree Syntactic Distance ( ).
5.2 Distances Derived From Part-of-Speech
Labels
Suppose a Part-of-Speech(POS) tagger is available
to tag each word in the English sentence. If POS
denotes the POS of the English word , we can de-
fine the word-to-word distance measure (Equa-
tion 4) as
POS POS (15)
Ratnaparkhi’s POS tagger (Ratnaparkhi, 1996)
was used to obtain POS tags for each word in
the English sentence. With specified by Equa-
tion 15, the Generalized Alignment Error loss func-
tion (Equation 3) is called the Part-Of-Speech Dis-
tance ( ).
5.3 Automatic Word Cluster Distances
Suppose we are working in a language for which
parsers and POS taggers are not available. In this
situation we might wish to construct the loss func-
tions based on word classes determined by auto-
matic clustering procedures. If specifies the
word cluster for the English word , then we define
the distance
(16)
In our experiments we obtained word clusters
for English words using a statistical learning proce-
dure (Kneser and Ney, 1991) where the total number
of word classes is restricted to be 100. With as
defined in Equation 16, the Generalized Alignment
Error loss function (Equation 3) is called the Auto-
matic Word Class Distance ( ).
5.4 IBM-3 Word Alignment Models
Since the true distribution over alignments is not
known, we used the IBM-3 statistical transla-
tion model (Brown et al., 1993) to approximate
. This model is specified through four
components: Fertility probabilities for words; Fer-
tility probabilities for NULL; Word Translation
probabilities; and Distortion probabilities. We
used a modified version of the IBM-3 distortion
model (Knight and Al-Onaizan, 1998) in which
each of the possible permutations of the French
sentence is equally likely. The IBM-3 models
were trained on a subset of the Canadian Hansards
French-English data which consisted of 50,000 par-
allel sentences (Och and Ney, 2000b). The vocab-
ulary size was 18,499 for English and 24,198 for
French. The GIZA++ toolkit (Och and Ney, 2000a)
was used for training the IBM-3 models (as in (Och
and Ney, 2000b)).
5.5 Word Alignment Lattice Generation
We obtained word alignments under the
modified IBM-3 models using the finite
state translation framework introduced by
Knight and Al-Onaizan (1998). The finite state
operations were carried out using the AT&T Finite
State Machine Toolkit (Mohri et al., 2001; Mohri et
al., 2000).
The WFST framework involves building a trans-
ducer for each constituent of the IBM-3 Alignment
Models: the word fertility model ; the NULL fer-
tility model ; and the word translation model
(Section 5.4). For each sentence pair we also built a
finite state acceptor that accepts the English sen-
tence and another acceptor which accepts all legal
permutations of the French sentence. The alignment
lattice for the sentence pair was then obtained
by the following weighted finite state composition
. In practice, the WFST ob-
tained by the composition was pruned to a maximum
of 10,000 states using a likelihood based pruning op-
eration. In terms of AT&T Finite State Toolkit shell
commands, these operations are given as:
fsmcompose E M fsmcompose - N
fsmcompose - T fsmcompose - F
fsmprune -n 10000
The finite state composition and pruning were per-
formed using lazy implementations of algorithms
provided in AT&T Finite State libraries (Mohri et
al., 2000). This made the computation efficient be-
cause even though five WFSTs are composed into
a potentially huge transducer, only a small portion
of it is actually searched during the pruning used to
generate the final lattice.
A heavily pruned alignment lattice for a
sentence-pair from the test data is shown in Fig-
ure 2. For clarity of presentation, each alignment
link in the lattice is shown as an ordered
pair where and are
the English and French words on the link. For each
sentence, we also computed the lattice path with the
highest probability . This gives the ML
alignment under the statistical MT models that will
give our baseline performance under the various loss
functions.
5.6 Performance Under The Alignment Error
Rates
Our unseen test data consisted of 207 French-
English sentence pairs from the Hansards cor-
pus (Och and Ney, 2000b). These sentence pairs had
at most 16 words in the French sentence; this restric-
tion on the sentence length was necessary to control
the memory requirements of the composition.
5.6.1 MBR Consensus Alignments
In the previous sections we introduced a total
of four loss functions: , , and
. Using either Equation 11 or 13, an MBR
decoder can be constructed for each. These decoders
are called MBR-AE, MBR-PTSD, MBR-POSD, and
MBR-AWCD, respectively.
5.6.2 Evaluation Metrics
The performance of the four decoders was mea-
sured with respect to the alignments provided by hu-
man experts (Och and Ney, 2000b). The first eval-
uation metric used was the Alignment Error Rate
(Equation 1). We also evaluated each decoder un-
der the Generalized Alignment Error Rates (GAER).
These are defined as:
(17)
There are six variants of GAER. These arise
when is specified by ,
or . There are two versions of each
of these: one version is sensitive only to sure
(S) links. The other version considers all (A)
links in the reference alignment. We there-
fore have the following six Generalized Alignment
Error Rates: PTSD-S, POSD-S, AWCD-S, and
PTSD-A, POSD-A, AWCD-A. We say we have a
matched condition when the same loss function is
used in both the error rate and the decoder design.


 
0
1NULL_0:a_4/5.348
3
it_1:ce_1/2.344
2it_1:ce_1/1.927
4NULL_0:a_4/5.348
6
is_2:est_2/1.349
5
is_2:est_2/1.349
9
quite_3:tout_3/4.132
8
quite_3:fait_5/4.405is_2:est_2/0.933
7NULL_0:a_4/5.348
10
quite_3:fait_5/2.195
quite_3:tout_3/1.921quite_3:tout_3/3.715
quite_3:fait_5/3.989
11understandable_4:comprehensible_6/2.161 12/0._5:._7/0.432
Figure 2: A heavily pruned alignment lattice for the English-French sentence pair
e=“it is quite understandable .” f=“ce est tout a fait comprehensible .”.
5.6.3 Decoder Performance
The performance of the decoders under various
loss functions is given in Table 1. We observe that
in none of these experiments was the ML decoder
found to be optimal. In all instances, the MBR
decoder tuned for each loss function was the best
performing decoder under the corresponding error
rate. In particular, we note that alignment perfor-
mance as measured under the AER metric can be
improved by using MBR instead of ML alignment.
This demonstrates the value of finding decoding pro-
cedures matched to the performance criterion of in-
terest.
We observe some affinity among the loss func-
tions. In particular, the ML decoder performs better
under the AER than any of the MBR-GAE decoders.
This is because the loss, for which the ML de-
coder is optimal, is closer to the loss than any
of the loss functions. The NULL symbol is
treated quite differently under and , and
this leads to a large mismatch between the MBR-
GAE decoders and the AER metric. Similarly, the
performance of the MBR-POS decoder degrades
significantly under the AWCD-S and AWCD-A met-
rics. Since there are more word clusters (100) than
POS tags (55), the MBR-POS decoder is therefore
incapable of producing hypotheses that can match
the word clusters used in the AWCD metrics.
6 Discussion And Conclusions
We have presented a Minimum Bayes-Risk decod-
ing strategy for obtaining word alignments of bilin-
gual texts. MBR decoding is a general formulation
that allows the construction of specialized decoders
from general purpose models. The strategy aims at
direct minimization of the expected risk of align-
ment errors under a given alignment loss function.
We have introduced several alignment loss func-
tions to measure the alignment quality. These in-
corporate information from varied analyses, such
as parse trees, POS tags, and automatically derived
word clusters. We have derived and implemented
lattice based MBR consensus decoders under these
loss functions. These decoders rescore the lattices
produced by maximum likelihood decoding to pro-
duce the optimal MBR alignments.
We have chosen to present MBR decoding using
the IBM-3 statistical MT models implemented via
WFSTs. However MBR decoding is not restricted
to this framework. It can be applied more broadly
using other MT model architectures that might be
selected for reasons of modeling fidelity or compu-
tational efficiency.
We have presented these alignment loss functions
to explore how linguistic knowledge might be in-
corporated into machine translation systems without
building detailed statistical models of these linguis-
tic features. However we stress that the MBR decod-
ing procedures described here do not preclude the
construction of complex MT models that incorporate
linguistic features. The application of such mod-
els, which could be trained using conventional max-
imum likelihood estimation techniques, should still
benefit by the application of MBR decoding tech-
niques.
In future work we will investigate loss functions
that incorporate French and English parse-tree infor-
mation into the alignment decoding process. Our ul-
timate goal, towards which this work is the first step,
is to construct loss functions that take advantage of
linguistic structures such as syntactic dependencies
found through monolingual analysis of the sentences
to be aligned. Recent work (Hwa et al., 2002) sug-
gests that translational corresponence of linguistic
structures can indeed be useful in projecting parses
across languages. Our ideal would be to construct
MBR decoders based on loss functions that are sen-
sitive both to word alignment as well as to agreement
in higher level structures such as parse trees. In this
way ambiguity present in word-to-word alignments
will be resolved by the alignment of linguistic struc-
tures.
Generalized Alignment Error Rates
Decoder AER PTSD-S POSD-S AWCD-S PTSD-A POSD-A AWCD-A
ML 18.13 3.13 4.35 4.69 29.39 51.36 54.58
MBR-AE 14.87 1.34 1.89 1.94 19.81 36.42 38.58
MBR-PTSD 23.26 0.62 0.69 0.82 14.45 26.76 28.42
MBR-POSD 28.60 2.43 0.69 3.23 15.70 26.28 29.48
MBR-AWCD 24.71 1.00 0.95 0.86 14.92 26.83 28.39
Table 1: Performance (%) of the MBR decoders under the Alignment Error and Generalized Alignment
Error Rates. For each metric the error rate of the matched decoder is in bold.
MBR alignment is a promising modeling frame-
work for the detailed linguistic annotation of bilin-
gual texts. It is a simple model rescoring formalism
that improves well trained statistical models by tun-
ing them for particular performance criteria. Ideally,
it will be used to produce decoders optimized for
the loss functions that actually measure the qualities
that we wish to see in newly developed automatic
systems.
Acknowledgments
We would like to thank F. J. Och of RWTH, Aachen
for providing us the GIZA++ SMT toolkit, the mk-
cls toolkit to train word classes, the Hansards 50K
training and test data, and the reference word align-
ments and AER metric software. We would also like
to thank P. Resnik, R. Hwa and O. Kolak of the Univ.
of Maryland for useful discussions and help with the
GIZA++ setup. We thank AT&T Labs - Research for
use of the FSM Toolkit. This work was supported by
an ONR MURI grant N00014-01-1-0685.

References
P. J. Bickel and K. A. Doksum. 1977. Mathematical
Statistics: Basic Ideas and Selected topics. Holden-
Day Inc., Oakland, CA, USA.
P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and
R. L. Mercer. 1993. The mathematics of statistical
machine translation: Parameter estimation. Computa-
tional Linguistics, 19(2):263–311.
M. Collins. 1999. Head-Driven Statistical Models for
Natural Language Parsing. Ph.D. thesis, University
of Pennsylvania, Philadelphia, PA, USA.
V. Goel and W. Byrne. 2000. Minimum Bayes-risk auto-
matic speech recognition. Computer Speech and Lan-
guage, 14(2):115–135.
R. Hwa, P. Resnik, A. Weinberg, and O. Kolak. 2002.
Evaluating translational correspondence using annota-
tion projection. In Proceedings of ACL-2002. To ap-
pear.
R. Kneser and H. Ney. 1991. Forming word classes by
statistical clustering for statistical language modelling.
In The 1st Quantititative Linguistics Conference, Trier,
Germany.
K. Knight and Y. Al-Onaizan. 1998. Translation with
finite-state devices. In Proceedings of the AMTA Con-
ference, pages 421–437, Langhorne, PA, USA.
M. Mittendorfer and W. Winiwarter. 2001. Experiments
with the use of syntactic analysis in information re-
trieval. In Proceedings of the 6th International Work-
shop on Applications of Natural Language and Infor-
mation Systems, Bonn, Germany.
M. Mohri, F. C. N. Pereira, and M. Riley. 2000. The
design principles of a weighted finite-state transducer
library. Theoretical Computer Science, 231(1):17–32.
M. Mohri, F. Pereira, and M. Riley, 2001. ATT
General-purpose finite-state machine software tools.
http://www.research.att.com/sw/tools/fsm/.
F. Och and H. Ney. 2000a. A comparison of alignment
models for statistical machine translation. In Proceed-
ings Of 18th Conference On Computational Linguis-
tics, pages 1086–1090, Saarbrucken, Germany.
F. Och and H. Ney. 2000b. Improved statistical align-
ment models. In Proceedings of ACL-2000, pages
440–447, Hong Kong, China.
K. Papineni, S. Roukos, T. Ward, J. Henderson, and
F. Reeder. 2002. Corpus-based comprehensive and di-
agnostic mt evaluation: Initial arabic, chinese, french,
and spanish results. In Proceedings of HLT 2002.
A. Ratnaparkhi. 1996. A maximum entropy model for
part-of-speech tagging. In Proceedings of the Confer-
ence on Empirical Methods in Natural Language Pro-
cessing, pages 133–142, Philadelphia, PA, USA.
F. Wessel, K. Macherey, and R. Schlueter. 1998. Us-
ing word probabilities as confidence measures. In Pro-
ceedings of ICASSP-98, pages 225–228, Seattle, WA,
USA.
