Extensions to HMM-based Statistical Word Alignment Models
Kristina Toutanova, H. Tolga Ilhan and Christopher D. Manning
Department of Computer Science
Stanford University
Stanford, CA 94305-9040 USA
kristina@cs.stanford.edu
ilhan@stanford.edu
manning@cs.stanford.edu
Abstract
This paper describes improved HMM-based word
level alignment models for statistical machine
translation. We present a method for using part of
speech tag information to improve alignment accu-
racy, and an approach to modeling fertility and cor-
respondence to the empty word in an HMM align-
ment model. We present accuracy results from eval-
uating Viterbi alignments against human-judged
alignments on the Canadian Hansards corpus, as
compared to a bigram HMM, and IBM model 4.
The results show up to 16% alignment error reduc-
tion.
1 Introduction
The main task in statistical machine translation is
to model the string translation probability a0a2a1a4a3a6a5a7a9a8a10a12a11a7a14a13
where the string a10 a11a7 in one language is translated
into another language as string a3a15a5a7 . We refer to a3a16a5a7
as the source language string and a10 a11a7 as the target
language string in accordance with the noisy chan-
nel terminology used in the IBM models of (Brown
et al., 1993). Word-level translation models assume
a pairwise mapping between the words of the source
and target strings. This mapping is generated by
alignment models. In this paper we present exten-
sions to the HMM alignment model of (Vogel et al.,
1996; Och and Ney, 2000b). Some of our extensions
are applicable to other alignment models as well and
are of general utility.1
For most language pairs huge amounts of parallel
corpora are not readily available whereas monolin-
gual resources such as taggers are more often avail-
able. Little research has gone into exploring the po-
1This paper was supported in part by the National Science
Foundation under Grants IIS-0085896 and IIS-9982226. The
authors would also like to thank the various reviewers for their
helpful comments on earlier versions.
tential of part of speech information to better model
translation probabilities and permutation probabili-
ties. Melamed (2000) uses a very broad classifica-
tion of words (content, function and several punctu-
ation classes) to estimate class-specific parameters
for translation models. Fung and Wu (1995) adapt
English tags for Chinese language modeling using
Coerced Markov Models. They use English POS
classes as states of the Markov Model to generate
Chinese language words. In this paper we use POS
tag information to incorporate prior knowledge of
word translation and to model local word order vari-
ation. We show that using this information can help
in the translation modeling task.
Many alignment models assume a one to many
mapping from source language words to target lan-
guage words, such as the IBM models 1-5 of Brown
et al. (1993) and the HMM alignment model of (Vo-
gel et al., 1996). In addition, the IBM Models 3,
4 and 5 include a fertility model a17 a1a19a18a20a8a3a21a13 where a18 is
the number of words aligned to a source word a3 . In
HMM-based alignment word fertilities are not mod-
eled. The alignment positions of target words are the
states in an HMM. The alignment probabilities for
word a10a23a22 depend only on the alignment of the pre-
vious word a10a23a22a23a24 a7 if using a first order HMM. There-
fore, source words are not awarded/penalized for be-
ing aligned to more than one target word. We present
an extension to HMM alignment that approximately
models word fertility.
Another assumption of existing alignment mod-
els is that there is a special Null word in the source
sentence from which all target words that do not
have other correspondences in the source language
are generated. Use of such a Null word has proven
problematic in many models. We also assume the
                                            Association for Computational Linguistics.
                      Language Processing (EMNLP), Philadelphia, July 2002, pp. 87-94.
                         Proceedings of the Conference on Empirical Methods in Natural
existence of a special Null word in the source lan-
guage that generates words in the target language.
However, we define a different model that better
constrains and conditions generation from Null. We
assume that the generation probability of words by
Null depends on other words in the target sentence.
Next we present the general equations for decom-
position of the translation probability using part of
speech tags and later we will go into more detail of
our extensions.
2 Part of Speech Tags in a Translation
Model
Augmenting the model a0a25a1a4a3 a5a7a15a8a10 a11a7a26a13 with part of
speech tag information leads to the following equa-
tions. We use a3 a5a7 , a10 a11a7 or vector notation e, f to de-
note English and French strings. (a27 and a28 represent
the lengths of the French and English strings respec-
tively.) Let us define eT and fT as possible POS tag
sequences of the sentences e and f. We can rewrite
the string translation probability a0a2a1a4a3 a5a7a15a8a10 a11a7a14a13 as fol-
lows (using Bayes rule to give the last line):
a29a2a1a19a30a31a8a32a15a13a12a33a35a34a36a38a37a39a29a2a1a19a30a41a40a42a32a44a43a45a8a32a9a13
a33a35a34a36a38a37 a29a2a1a19a32a46a43a45a8a32a15a13a46a29a2a1a19a30a31a8a32a47a40a42a32a46a43a14a13
a33 a34a36a38a37a39a29a2a1a19a32a46a43a45a8a32a15a13 a34a48a49a37 a29a2a1a19a30a41a40a42a30a50a43a45a8a32a51a40a42a32a44a43a14a13
a33a35a34a36a38a37a52a29a2a1a19a32a46a43a53a8a32a6a13a54a34a48a49a37a55a29a2a1a19a30a56a13a46a29a2a1a19a30a50a43a57a8a30a56a13a15a58a60a59
a36
a61
a36a38a37a20a62a48
a61
a48a49a37a60a63
a58a60a59
a36
a61
a36a38a37a64a63
If we also assume that the taggers in both languages
generate a single tag sequence for each sentence then
the equation for machine translation by the noisy
channel model simplifies to
a65a9a66a42a67a69a68a70a65a15a71
a48
a29a2a1a19a30a55a8a32a9a13a41a33
a65a9a66a42a67a72a68a2a65a15a71
a48
a29a2a1a19a30a56a13a46a29a2a1a19a32a51a40a42a32a44a43a57a8a30a41a40a42a30a50a43a14a13
This is the decomposition of the string translation
probability into a language model and translation
model. In this paper we only address the transla-
tion model and assume that there exists a one-to-one
alignment from target to source words. Therefore,
a29a2a1a19a32a51a40a42a32a44a43a45a8a30a73a40a42a30a74a43a14a13a75a33 a34a49a76a77a29a2a1a19a32a51a40a42a32a44a43a25a40a42a78a79a8a30a41a40a42a30a50a43a14a13
One possible way to rewrite a29a25a1a4a10 a11a7a53a40a80a10a41a81 a11a7a45a40a42a82 a11a7a83a8
a3a23a5a7 a40a80a3a84a81a85a5a7 a13 without loss of generality is:
a29a25a1a4a10a12a11a7 a40a80a10a41a81a86a11a7 a40a42a82a87a11a7 a8a30a41a40a42a30a50a43a14a13a52a33a88a29a25a1
a28
a8a30a73a40a42a30a74a43a26a13a15a89
a90
a91
a22a93a92
a7
a94a95
a96
a95a97
a29a2a1a19a82a9a22a87a8a82
a22a84a24
a7
a7 a40a80a10
a22a23a24
a7
a7 a40a80a10a41a81
a22a84a24
a7
a7 a40
a28
a40a42a30a41a40a42a30a50a43a14a13
a29a2a1a4a10a41a81a98a22a87a8a82
a22
a7 a40a80a10
a22a23a24
a7
a7 a40a80a10a41a81
a22a23a24
a7
a7 a40
a28
a40a42a30a41a40a42a30a50a43a26a13
a29a2a1a4a10a84a22a50a8a82
a22
a7 a40a80a10
a22a23a24
a7
a7 a40a80a10a41a81
a22
a7 a40
a28
a40a42a30a73a40a42a30a74a43a26a13
a99a95a100
a95
a101
(1)
Here each a82a9a22 gives the index of the word a3a15a102a44a103 to
which a10a23a22 is aligned. The models we present in
this paper will differ in the decompositions of align-
ment probabilities, tag translation and word trans-
lation probabilities in Eqn. 1. Section 3 describes
the baseline model in more detail. Section 4 illus-
trates examples where the baseline model performs
poorly. Section 5 presents our extensions and Sec-
tion 6 presents experimental results.
3 Baseline Model
Translation of French and English sentences shows a
strong localization effect. Words close to each other
in the source language remain close in the transla-
tion. Furthermore, most of the time the alignment
shows monotonicity. This means that pairwise align-
ments stay close to the diagonal line of the a1a105a104a6a40a107a106a108a13
plane. It has been shown (Vogel et al., 1996; Och
et al., 1999; Och and Ney, 2000a) that HMM based
alignment models are effective at capturing such lo-
calization.
We use as a baseline the model presented by
(Och and Ney, 2000a). A basic bigram HMM-based
model gives us
a29a25a1a19a32a50a8a30a98a13a85a33 a34
a102a80a109a110
a11
a91
a22a93a92
a7a20a111
a29a2a1a19a82a15a22a50a8a82a15a22a23a24
a7
a40
a27
a13a46a29a2a1a4a10a84a22a50a8a3a16a102a44a103a9a13a113a112 (2)
In this HMM model,2 alignment probabilities are
independent of word position and depend only on
jump width (a82a6a22a115a114a116a82a9a22a23a24 a7 ).3 The Och and Ney (2000a)
model includes refinements including special treat-
ment of a jump to Null and smoothing with a uni-
form prior which we also included in our initial
model. As in their model we set the probability for
jump from any state to Null to a fixed value (a17
a33a118a117a54a119 )
which we estimated from held-out data.
2Each HMM state is [
a120
a103a122a121a124a123a107a125a108a126
] emitting a127
a103
as output.
3In order for the model not to be deficient, we normalize the
jump probabilities at each EM step so that jumping outside of
the borders of the sentence is not possible.
NULL in itaddition , serious threat to confederation
en pourrait uneconstituerelle serieuse menace pour la confederation et,outre
.unitynationaland
NNIN
PRE VB NNPREVBJJDTVBPRPPON
, PRP
could
MD JJ NN TO NN CC JJ NN
abecome
VB
unite nationale .
JJ PONNN
le
ADV DTDT CC
Aln1
Aln2
j−2 j−1 j
DT .
a123a107a125a108a126 a123a107a125a108a126a4a128a15a129 a123a42a125a46a126a124a128a16a130
P(a120
a103a107a131
a110a42a132
a120
a103a42a131a6a133a49a121a124a134
) P(a127
a103a107a131
a110a42a132
a123a107a125a108a126a4a128a15a129
) P(a120
a103
a132
a120
a103a107a131
a110
a121a124a134
) P(a127
a103
a132
a123a107a125a108a126
) total
Aln1 . national unity 0.0292 0.5741 0.05357 0.9766 8.77xa135a107a136
a131a6a137
Aln2 . unity unity 0.0862 0.2789 0.31 0.9766 7.2xa135a108a136
a131a9a138
Figure 1: The baseline model makes a simple alignment error.
4 Alignment Irregularities
Although the baseline Hidden Markov alignment
model successfully generates smooth alignments,
there are a fair number of alignment examples where
pairwise match shows local irregularities. One in-
stance of this is the transition of the NP a139 JJ NN
rule to NP a139 NN JJ from English to French. We can
list two main reasons why word translation proba-
bilities may not catch such irregularities to mono-
tonicity. First, it may be the case that both the
English adjective and noun are words that are un-
known. In this case the translation probabilities will
be close to each other after smoothing. Second, the
adjective-noun pair may consist of words that are
frequently seen together in English. National re-
serve and Canadian parliament, are examples of
such pairs. As a result there will be an indirect asso-
ciation between the English noun and the translation
of the English adjective. In both cases, word transla-
tion probabilities will not be differentiating enough
and alignment probabilities become the dominating
factor to determine where a10 a22 aligns.
Figure 1 illustrates how our baseline HMM model
makes an alignment mistake of this sort. The ta-
ble in the figure displays alignment and translation
probabilities of two competing alignments (namely
Aln1 and Aln2) for the last three words. In both
alignments, the shown a10 a22 and a3 a102a44a103 are periods at the
end of the French and English sentences. The first
alignment maps nationale to national and unit´e to
unity. (i.e. a3a16a102a46a103a107a131 a110 a33 national and a3a21a102a44a103a42a131a6a133 =unity). The
second alignment maps both nationale and unit´e to
unity (i.e. a3a16a102a46a103a107a131 a110 a33 unity and a3a16a102a46a103a107a131a47a133a140a33 unity). Start-
ing from the unity-unit´e alignment, the jump width
sequences a141 (a82a9a22a23a24 a7 a114a77a82a15a22a23a24a73a142 ), (a82a9a22a14a114a77a82a9a22a23a24 a7 )a143 for Aln1
and Aln2 are a141 a114a145a144 , 2a143 and a141 0, 1a143 respectively. The
table shows that the gain from use of monotonic
alignment probabilities dominates over the lowered
word translation probability. Although national and
nationale are strongly correlated according to the
translation probabilities, jump widths of a114a140a144 and 2
are less probable than jump widths of 0 and 1.
5 Extensions
In this section we describe our improvements on the
HMM model. We present evaluation results in Sec-
tion 6 after describing the technical details of our
models here.
5.1 POS Tags for Translation Probabilities
Our model with part of speech tags for translation
probabilities uses the following simplification of the
translation probability shown in Eqn. 1.4
a29a2a1a19a32a47a40a42a32a46a43a53a8a30a41a40a42a30a50a43a26a13a146a33 a34
a102 a109a110
a11
a91
a22a49a92
a7
a94a95
a96
a95a97
a29a2a1a19a82a9a22a74a8a82a15a22a23a24
a7
a40
a27
a13
a29a2a1a4a10a41a81 a22 a8a3a84a81 a102a46a103 a13
a29a2a1a4a10a84a22a50a8a3a23a102a46a103a15a13
a99a95a100
a95
a101
(3)
In this model we introduce tag translation probabil-
ities as an extra factor to Eqn. 2. Intuitively the role
of this factor is to boost the translation probabilities
for words of parts of speech that can often be trans-
lations of each other. Thus this probability distribu-
tion provides prior knowledge of the possible trans-
lations of a word based only on its part of speech.
However, P(a10a41a81 a22 a8a3a122a81 a102a44a103 ) should not be too sharp or
4Since we are only concerned with alignment here and not
generation of candidate translations the factor P(a147
a132
e,eT) can
be ignored and we omit it from the equations for the rest of the
paper.
it will dominate the alignment probabilities and the
probabilities a29a25a1a4a10a52a8a3a21a13 . We use the following linear
interpolation to smooth tag translation probabilities:
a29a2a1a4a10a41a81 a22 a8a3a122a81 a102a44a103 a13a52a33a77a148a85a149a29a2a1a4a10a41a81 a22 a8a3a122a81 a102a44a103 a13a31a150a151a1a108a144a152a114a153a148a12a13
a7
a154 (4)
T is the size of the French tag set and a148 is set to be
0.1 in our experiments. The tag translation model is
so heavily smoothed with a uniform distribution be-
cause in EM the tag translation probabilities quickly
become very sharp and can easily overrule the align-
ment and word translation probabilities. The Results
section shows that the addition of this factor reduces
the alignment error rate, with the improvement being
especially large when the training data size is small.
5.2 Tag Sequences for Jump Probabilities
This section describes an extension to the bigram
HMM model that uses source and target language
tag sequences as conditioning information when
predicting the alignment of target language words.
In the decomposition of the joint proba-
bility a29a2a1a19a32a47a40a42a32a46a43a70a40a42a78a60a8a30a73a40a42a30a74a43a26a13 shown in Eqn. 1
the factor for alignment probabilities is
a29a2a1a19a82a9a22a87a8a82
a22a84a24
a7
a7 a40a80a10
a22a23a24
a7
a7 a40a80a10a41a81
a22a84a24
a7
a7 a40
a28
a40a42a30a41a40a42a30a50a43a26a13 .
A bigram HMM model assumes independence of
a82a9a22 from anything but the previous alignment posi-
tion a82a9a22a23a24 a7 and the length of the English sentence.
Brown et al. (1993) and Och et al. (1999) variably
condition this probability on the English word in
position a82a9a22a23a24 a7 and/or the French word in position
a104 . As conditioning directly on words would yield
a large number of parameters and would be imprac-
tical, they cluster the words automatically into bilin-
gual word classes.
The question arises then whether we would have
larger gains by conditioning on the part of speech
tags of those words or even more words around the
alignment position. For example, if we use the fol-
lowing conditioning information:
a29a2a1a19a82 a22 a8a82
a22a23a24
a7
a7 a40a80a10
a22a23a24
a7
a7 a40a80a10a41a81
a22a84a24
a7
a7 a40
a28
a40a42a30a41a40a42a30a50a43a26a13a115a33
a29a2a1a19a82a9a22a74a8a82a15a22a23a24
a7
a40a80a3a122a81a31a102a44a103a42a131
a110
a24
a7
a40a80a3a84a81a55a102a46a103a107a131
a110
a40a80a3a122a81a31a102a44a103a42a131
a110a46a155 a7
a13
we could model probabilities of transpositions and
insertion of function words in the target language
that have no corresponding words in the source lan-
guage (a3a16a102a46a103 is Null) similarly to the channel oper-
ations of the (Yamada and Knight, 2001) syntax-
based statistical translation model. Since the syntac-
tic knowledge provided by POS tags is quite limited,
this is a crude model of transpositions and Null in-
sertions at the preterminal level. However we could
still expect that it would help in modeling local
word order variations. For example, in the sentence
J’aime la chute ‘I love the fall’ the probability of
aligning a10a23a22a156a33 la (a10a41a81a98a22a146a33 DT) to the will be boosted
by knowing a3a122a81a157a102a44a103a42a131 a110 a33 VBP and a3a122a81a31a102a46a103a107a131 a110a44a155 a7 a33 DT.
Similarly, in the sentence J’aime des chiens ‘I love
dogs’ the probability of aligning a10a21a22a158a33 la to Null
will be increased by knowing a3a122a81 a102a46a103a107a131 a110 a33 VBP and
a3a122a81a31a102a44a103a42a131
a110a46a155 a7
a33 NNS. VBP followed by NNS crudely
conducts the information that the verb is followed by
a noun phrase which does not include a determiner.
We conducted a series of experiments where
the alignment probabilities are conditioned on
different subsets of the part of speech tags
a3a122a81a31a102a44a103a42a131
a110
a24
a7
a40a80a3a122a81a31a102a44a103a42a131
a110
a40a80a3a84a81a55a102a46a103a107a131
a110a46a155 a7
a40a80a10a41a81a98a22a84a24
a7
a40a80a10a41a81a98a22a9a40a80a10a41a81a56a22
a155 a7 .
In order to be able to condition on a10a41a81a12a22a15a40a80a10a41a81a98a22 a155 a7
when generating an alignment position for a10a15a22 ,
we have to change the generative model for the
sentence f and its tag sequence fT to generate the
part of speech tags for the French words before
choosing alignment positions for them. The French
POS tags could be generated for example from
a prior distribution a29a2a1a4a10a41a81a41a22a21a13 or from the previous
French tags as in an HMM for part-of-speech tag-
ging. The generative model becomes: a29a2a1a19a32a51a40a42a32a44a43a159a8a30a56a13a115a33
a29a2a1a19a32a46a43a160a13a56a34
a102 a109a110
a11
a91
a22a93a92
a7 a111
a29a2a1a19a82a9a22a87a8a82
a103a107a131
a110
a40a80a10a41a81a98a22a9a40a80a10a41a81
a103a46a161
a110
a40
a27
a13a46a29a2a1a4a10a84a22a74a8a3a23a102a46a103a15a13a113a112
This model makes the assumption that target words
are independent of their tags given the correspond-
ing source word and models only the dependence of
alignment positions on part of speech tags.
5.3 Modeling Fertility
A major advantage of the IBM models 3–5 over the
HMM alignment model is the presence of a model
of source word fertility. Thus knowledge that some
words translate as phrases in the target language is
incorporated in the model.
The HMM model has no memory, apart from the
previous alignment, about how many words it has
aligned to a source word. Yet even this memory is
not used to decide whether to generate more words
from a given English word. The decision to gener-
ate again (to make a jump of size 0) is independent
of the word and is estimated over all words in the
corpus.
We extended the HMM model to decide whether
to generate more words from the previous English
word a3a16a102a44a103a42a131 a110 or to move on to a different word de-
pending on the identity of the English word a3a6a102a44a103a42a131 a110 .
We introduced a factor a29a2a1 staya8a3 a102a44a103a42a131 a110 a13 where the
boolean random variable stay depends on the En-
glish word a10a23a22a23a24 a7 aligned to. Since in most cases
words with fertility greater than one generate words
that are consecutive in the target language, this
extension approximates fertility modeling. More
specifically, the baseline model (i.e., Eqn. 2) is
changed as follows:
a29a2a1a19a32a74a8a30a98a13a73a33 a34
a102a80a109a110
a11
a91
a22a93a92
a7a20a111a46a162
a29a2a1a19a82a9a22a87a8a82a9a22a84a24
a7
a40a80a3a23a102a46a103a107a131
a110
a40
a27
a13a46a29a2a1a4a10a84a22a74a8a3a23a102a44a103a9a13a113a112
wherea163
a29a25a1a19a82a15a22a50a8a82a15a22a23a24
a7
a40a80a3a23a102a44a103a42a131
a110
a40
a27
a13a41a33a35a164a50a1a19a82a9a22a47a40a42a82a9a22a84a24
a7
a13a46a29a2a1 staya8a3a16a102a44a103a42a131
a110
a13a64a150
a94a95
a96
a95a97
a1a108a144a152a114a165a164a50a1a19a82a15a22a51a40a42a82a9a22a84a24
a7
a13a107a13
a1a108a144a152a114a35a29a2a1 staya8a3a16a102a44a103a42a131
a110
a13a107a13
a29a2a1a19a82a9a22a87a8a82a9a22a84a24
a7
a40
a27
a13
a99a95a100
a95
a101
(5)
a164a50a1a19a82a15a22a74a8a82a15a22a23a24
a7
a13 in Eqn. 5 is the Kronecker delta func-
tion. Basically, the new alignment probabilities
a163
a29a2a1a19a82a9a22a87a8a82a9a22a84a24
a7
a40
a27
a13 state that a jump width of zero de-
pends on the English word. If we define the fertility
of a word as the number of consecutive words from
the target language it generates, then the probabil-
ity distribution for the fertility of an English word e
according to this model is geometric with a proba-
bility of success a144a85a114a166a29a2a1 staya8a3a15a13 . The expectation isa7
a7
a24
a58a60a59 stay
a62a167a46a63
.5 Even though the fit of this distribution
to the real fertility distribution may not be very good,
this approximation improves alignment accuracy in
practice.
Sparsity is a problem in estimating stay probabil-
ities P(staya8a3a16a102a46a103a107a131 a110 ). We use the probability of a jump
of size zero from the baseline model as our prior to
do smoothing as follows:
a29a2a1 staya8a3
a125
a103a42a131
a110
a13a87a33a116a148a73a29a145a168a6a169a56a150a45a1a108a144a108a114a60a148a12a13a46a29a2a1 staya8a3
a125
a103a42a131
a110
a13 (6)
5E[X] =
a110
a170 =a171 + a172a124a171a74a173a19a135a31a174a85a171a47a175
a110
+ a176a113a171a74a173a19a135a157a174a85a171a6a175
a133
+ . . .
where X is the number of Bernoulli trials until the first success.
a29a145a177
a90 in this equation is the alignment probability
from the baseline model with zero jump distance.
a29a145a177
a90
a33a88a29a2a1a19a82a9a22a146a33a178a106a122a8a82a15a22a23a24
a7
a33a178a106a42a40
a27
a13 .
5.4 Translation Model for Null
As originally proposed by Brown et al. (1993),
words in the target sentence for which there are no
corresponding English words are assumed to be gen-
erated by the special English word Null. Null ap-
pears in every English sentence and often serves to
generate syntactic elements in the target language
that are missing in the source. A probability distri-
bution a29a2a1a4a10a52a8Nulla13 for generation probabilities of the
Null is re-estimated from a training corpus.
Modeling a Null word has proven problematic. It
has required many special fixes to keep models from
aligning everything to Null or to keep them from
aligning nothing to Null (Och and Ney, 2000b). This
might stem from the problem that the Null is respon-
sible for generating syntactic elements of the target
language as well as generating words that make the
target language sentence more idiomatic and stylis-
tic. The intuition for our model of translation proba-
bilities for target words that do not have correspond-
ing source words is that these words are generated
from the special English Null and also from the next
word in the target language by a mixture model. The
pair la conf´ed´eration in Figure 1 is an example of
such case where conf´ed´eration contributes extra in-
formation in generation of la. The formula for the
probability of a target word given that it does not
have a corresponding aligning word in the source is:
a29a2a1a4a10 a22 a8a82 a22 a33a88a179a51a13a156a33a180a148a73a29a2a1a4a10 a22 a8a10 a22
a155 a7
a40a80a3 a102a46a103 a33 Nulla13
a150a70a1a108a144a181a114a153a148a41a13a46a29a25a1a4a10a84a22a87a8a3a23a102a46a103a146a33 Nulla13 (7)
We re-estimate the probabilities
a29a2a1a4a10a84a22a50a8a10a84a22
a155 a7
a40a80a3a23a102a46a103a146a33 Nulla13 from the training cor-
pus using EM. The dependence of a French
word on the next French word requires a change
in the generative model to first propose align-
ments for all words in the French sentence and
to then generate the French words given their
alignments, starting from the end of the sentence
and going towards the beginning. For the new
model there is an efficient dynamic programming
algorithm for computations in EM similar to the
forward-backward algorithm. The probability
a29a2a1a4a10
a7
a117a84a117a84a117a49a10a84a22a6a40a42a82a15a22a182a33a183a106a80a40a80a10a84a22
a155 a7
a40a84a117a84a117a84a117a93a10
a11
a8a30a74a13 again decom-
poses into forward and backward probabilities.
The forward probability is a184 a1a105a104a6a40a107a106a46a13a25a33a185a29a2a1a19a82a47a22a116a33a186a106a108a13a187a89
a29a2a1a4a10
a7
a117a84a117a84a117a49a10a84a22a87a8a82a15a22 a33 a106a42a40a80a10a84a22
a155 a7
a40a42a30a74a13 and the backward
probability is a188 a1a105a104a6a40a107a106a46a13a116a33a189a29a25a1a4a10 a22 a155 a7 a117a84a117a84a117a93a10 a11 a8a82 a22 a33a190a106a80a40a42a30a74a13 .
These can be computed recursively and used for
efficient computation of posteriors in EM.
6 Results
We present results on word level alignment accu-
racy using the Hansards corpus. Our test data con-
sists of a191 a179a6a179 manually aligned sentences which are
the same data set used by (Och and Ney, 2000b).6
In the annotated sentences every alignment between
two words is labeled as either a sure (S) or possible
(P) alignment. (S a192 P). We used the following quan-
tity (called alignment error rate or AER) to evaluate
the alignment quality of our models, which is also
the evaluation metric used by (Och and Ney, 2000b):
a193
a3a23a194a122a82
a27a19a27
a33
a8a195a166a196a45a197a86a8
a8a54a197a198a8 a17
a193
a3a23a194a93a106a44a199a16a106a113a200a15a201a202a33
a8a195a39a196a159a0a25a8
a8a195a26a8
a195a85a203a140a204a205a33
a8a195a166a196a159a0a25a8a23a150a77a8a195a166a196a57a197a198a8
a8a195a26a8a23a150a77a8a54a197a198a8
We divided this annotated data into a validation
set of a144a84a179a6a179 sentences and a final test set of a119a47a179a6a179 sen-
tences. The validation set was used to select tuning
parameters such as a148 in Eqn. 4, 6 and 7. We report
AER results on the final test set of a119a47a179a6a179 sentences
which contain a total of a206 a40a80a207 a206a6a191 English and a208 a40a42a179a56a144a23a209
French words. We experimented with training cor-
pora of different sizes ranging from 5K to 50K sen-
tences. We concentrated on small to medium data
sets to assess the ability of our models to deal with
sparse data.
Table 1 shows the percentage of words in the cor-
pus that were seen less than the specified number of
times. For example, in our 10K training corpus a119 a208a6a210
of all word types were seen only once. As seen from
the table the sparsity is great even for large corpora.
The models we implemented and compare in this
section are the following:
a211 Baseline is the baseline HMM model described
in section 2
a211 Tags is an HMM model that includes tags for
translation probabilities (section 5.1)
6We want to thank Franz Och for sharing the annotated data
with us.
a211 SG is an HMM model that includes stay proba-
bilities (section 5.3)
a211 Null is an HMM model that includes the new gen-
eration model for words by Null (section 5.4)
a211 Tags+Null, Tags+SG, and Tags+Null+SG are
combinations of the above models
Table 2 shows AER results for our improved
models on training corpora of increasing size. The
model Null outperforms the baseline at every data
set size,with the error reduction being larger for big-
ger training sets (up to 9.2% error reduction). The
SG model reduces the baseline error rate by up to
10%. The model Tags reduces the error rate for the
smallest dataset by 7.6%. The combination of Tags
and the SG or Null models outperforms the individ-
ual models in the combination since they address
different problems and make orthogonal mistakes.
The combination of SG and Tags reduces the base-
line error rate by up to 16% and the combination of
Null and Tags reduces the error rate by up to 12.3%.
All of these error reductions are statistically signifi-
cant at the a18a116a33a118a117a212a179 a191 confidence level according to the
paired t-test. The combination Tags+Null+SG fur-
ther reduces the error rate. For small datasets, there
seems to be a stronger overlap between the strengths
of the Null and SG models because some fertility
related phenomena can be accounted for by both
models. When an English word is wrongly align-
ing to several consecutive French words because of
indirect association, while the correct alignment of
some of them is to the empty word, both the Null and
SG models can combat the problem— one by better
modeling correspondence to Null, and the other by
discouraging large fertilities.
Figure 2 displays learning curves for three mod-
els: Och, Tags, and Tags+Null. Och is the HMM
alignment model of (Och and Ney, 2000b). To ob-
tain results from the Och model we ran GIZA++.7
Both the Tags and Och models use word classes.
However the word classes used in the latter are
learned automatically from parallel bilingual cor-
pora while the classes used in the former are hu-
man defined part of speech tags. Figure 2 shows
that the Tags model outperforms the Och model
when the training data size is small. As the train-
7GIZA++ can be downloaded from http://www-i6.
informatik.rwth-aachen.de/a213 och/software/GIZA++.html
Table 1: Percentage of words in the corpus by frequency
= 1 a214 3 a214 5 a214 10
Corpus English French English French English French English French
10K 47% 50% 61% 66% 74% 77% 84% 87%
25K 43% 44% 57% 59% 69% 72% 80% 83%
50K 42% 44% 55% 57% 67% 69% 78% 81%
Table 2: Alignment Error Rate by Model and Corpus Size
Corpus Baseline Null SG Tags Tags+SG Tags+Null Tags+Null+SG
5K 17.53 16.86 16.72 16.20 15.31 15.36 15.14
15K 15.03 14.29 13.52 13.90 12.63 13.22 12.52
25K 13.85 13.05 12.79 13.10 11.91 12.30 11.79
35K 13.19 11.98 12.03 12.60 11.45 11.56 11.07
50K 12.63 11.76 11.78 12.10 11.19 11.11 10.69
ing size increases the Och model catches up with
the Tags model and even surpasses it slightly. This
suggests that when large amounts of parallel text are
not available monolingual part of speech classes can
improve alignment quality more than automatically
induced classes. When more data is available au-
tomatically induced bilingual word classes seem to
provide more improvement but it still remains to be
explored whether the combination of part-of-speech
knowledge with induction of bilingual classes will
perform even better. The third curve in the figure for
Tags+Null illustrates the relative improvement of
the Null model over the Tags model as the training
set size increases. We see that the performance gap
between the two models becomes wider for larger
training data sizes. This reflects the improved esti-
mation of the generation probabilities for Null which
require target word specific parameters. We used
5K 15K 25K 35K 45K
Training Set Size
10.5
12.5
14.5
16.5
18.5
AER
Och
Tags
Null+Tags
Figure 2: Och vs. Tags and Tags+Null.
both paired t-test and Wilcoxon signed rank tests to
show the improvements are statistically significant.
The signed rank test uses the normalized test statis-
tic a215
a161 a24a73a216
a59
a215
a161
a63
a217 a218
a102a42a219
a59
a215
a161
a63
. a220 a155 is the sum of the ranks that have
positive signs. Ties are assigned the average rank of
the tied group. Since there are 400 test sentences, we
have 400 paired samples where the elements of each
pair are the AERs of the models being compared.
The difference between Och and Tags at 5K, 10K,
and 15K is significant at the a18a221a33a205a179a74a117a212a179 a191 level accord-
ing to both tests. The difference between Och and
Tags+Null is significant for all training set sizes at
the a18a116a33a151a179a74a117a212a179 a191 level.
We also assessed the gains from using part of
speech tags in the alignment probabilities according
to the model described in section 5.2. Table 3 shows
the error rate of the basic HMM alignment model
as compared to an HMM model that conditions on
tag sequences of source and target word tags in the
neighborhood of the French word a10a21a22 and the English
word a3a16a102a44a103a42a131 a110 for a training set size of 10K. The results
we achieved showed an improvement of our model
over a model that does not include conditioning on
tags. The improvement in accuracy is best when
using the current and previous French word parts
of speech and does not increase when adding more
conditioning information. The improvement from
part of speech tag sequences for alignment proba-
bilities was not as good as we had expected, how-
ever, which leads us to believe that more sophisti-
Table 3: POS Conditioning of Jump Probabilities
Model AER
Baseline 16.37
a10a41a81a98a22 15.97
a10a41a81a98a22a20a150a166a10a41a81a98a22a23a24
a7 15.74
a10a41a81a98a22a20a150a166a3a84a81a55a102a46a103a107a131
a110 15.86
a10a41a81 a22 a150a166a3a84a81 a102a46a103a107a131
a110
a150a166a3a84a81 a102a46a103a107a131
a110a46a155 a7 15.88
a10a41a81a98a22a20a150a166a10a41a81a98a22a23a24
a7
a150a166a3a84a81a55a102a46a103a107a131
a110 15.94
cated syntax is needed to model local word order
variation.
5K 15K 25K 35K 45K
Training Set Size
10
12
14
16
18
20
AER
IBM4
SG+Tags
Figure 3: IBM-a119 vs SG+Tags
In Figure 3 we compare the IBM-a119 model to our
SG+Tags model. Such a comparison makes sense
because IBM-a119 uses a fertility model for English
words and SG approximates fertility modeling and
because IBM-a119 uses word classes as does our Tags
model. For smaller training set sizes our model per-
forms much better than IBM-a119 but when more data
is available IBM-a119 becomes slightly better. This
confirms the observation from Figure 2 that auto-
matically induced bilingual classes perform better
when trained on large amounts of data. Also as our
fertility model estimates one parameter for each En-
glish word and IBM-a119 estimates as many parame-
ters as the maximum fertility allowed, at small train-
ing set sizes our model parameters can be estimated
more reliably.
7 Conclusions
In this paper we presented three extensions to
HMM-based alignment models. We showed that
incorporating part of speech tag information of the
source and target languages in the translation model
improves word alignment accuracy. We also pre-
sented a method for approximately modeling fertil-
ity in an HMM-based model and a new generative
model for target language words that do not have
correspondences in the source language. The pro-
posed models do not increase significantly the com-
plexity of the learning algorithms while providing a
better account for some phenomena in natural lan-
guage translation.

References
P. Brown, S. Della Pietra, V. Della Pietra, and R. Mer-
cer. 1993. The mathematics of statistical machine
translation: Parameter estimation. In Computational
Linguistics, volume 19(2), pages 263–311.
Pascale Fung and Dekai Wu. 1995. Coerced markov
models for cross-lingual tag relations. In Sixth Inter-
national Conference on Theoretical and Methodolog-
ical Issues in Machine Translation, volume 1, pages
240–255.
Dan I. Melamed. 2000. Models of translational equiv-
alence among words. In Computational Linguistics,
volume 26(2), pages 221–249.
F. Och and H. Ney. 2000a. A comparison of alignment
models for statistical machine translation. In Proc.
COLING ’00: The 18th Int. Conf. on Computational
Linguistics, pages 1086–1090.
F. Josef Och and H. Ney. 2000b. Improved statistical
alignment models. In Proc. of the 39th Annual Meet-
ing of the ACL.
F. Och, C. Tillmann, and H. Ney. 1999. Improved align-
ment models for statistical machine translation. In
Proc. of the Joint Conf. of Empirical Methods in Nat-
ural Language Processing and Very Large Corpora,
pages 20–28.
S. Vogel, H. Ney, and C. Tillmann. 1996. Hmm-based
word alignment in statistical translation. In Proc.
COLING ’96: The 16th Int. Conf. on Computational
Linguistics, pages 836– 841.
K. Yamada and K. Knight. 2001. A syntax-based sta-
tistical translation model. In Proc. of the 39th Annual
Meeting of the ACL, pages 523–530.
