Proceedings of the ACL Workshop on Building and Using Parallel Texts, pages 83–86,
Ann Arbor, June 2005. c©Association for Computational Linguistics, 2005
Improved HMM Alignment Models for Languages with Scarce Resources
Adam Lopez
Institute for Advanced Computer Studies
Department of Computer Science
University of Maryland
College Park, MD 20742
alopez@cs.umd.edu
Philip Resnik
Institute for Advanced Computer Studies
Department of Linguistics
University of Maryland
College Park, MD 20742
resnik@umiacs.umd.edu
Abstract
We introduce improvements to statistical word
alignment based on the Hidden Markov
Model. One improvement incorporates syntac-
tic knowledge. Results on the workshop data
show that alignment performance exceeds that
of a state-of-the art system based on more com-
plex models, resulting in over a 5.5% absolute
reduction in error on Romanian-English.
1 Introduction
The most widely used alignment model is IBM Model 4
(Brown et al., 1993). In empirical evaluations it has out-
performed the other IBM Models and a Hidden Markov
Model (HMM) (Och and Ney, 2003). It was the basis
for a system that performed very well in a comparison
of several alignment systems (Dejean et al., 2003; Mihal-
cea and Pedersen, 2003). Implementations are also freely
available (Al-Onaizan et al., 1999; Och and Ney, 2003).
The IBM Model 4 search space cannot be efficiently
enumerated; therefore it cannot be trained directly using
Expectation Maximization (EM). In practice, a sequence
of simpler models such as IBM Model 1 and an HMM
Model are used to generate initial parameter estimates
and to enumerate a partial search space which can be ex-
panded using hill-climbing heuristics. IBM Model 4 pa-
rameters are then estimated over this partial search space
as an approximation to EM (Brown et al., 1993; Och and
Ney, 2003). This approach yields good results, but it has
been observed that the IBM Model 4 performance is only
slightly better than that of the underlying HMM Model
used in this bootstrapping process (Och and Ney, 2003).
This is illustrated in Figure 1.
Based on this observation, we hypothesize that imple-
mentations of IBM Model 4 derive most of their per-
formance benefits from the underlying HMM Model.
Furthermore, owing to the simplicity of HMM Models,
we believe that they are more conducive to study and
improvement than more complex models such as IBM
Model 4. We illustrate this point by introducing modifi-
cations to the HMM model which improve performance.
.3
.35
.4
.45
.5
.55
.6
.65
.7
Model 1 HMM Model 4
AER
Training Iterations
◦
◦ ◦
◦ ◦
◦ ◦
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
◦
◦
◦
◦
◦ ◦ ◦ ◦ ◦
◦ ◦ ◦ ◦ ◦
Figure 1: The improvement in Alignment Error Rate
(AER) is shown for both P(fje) and P(ejf) alignments on
the Romanian-English development set over several iter-
ations of the IBM Model 1 ! HMM ! IBM Model 4
training sequence.
2 HMMs and Word Alignment
The objective of word alignment is to discover the word-
to-word translational correspondences in a bilingual cor-
pus of S sentence pairs, which we denote f(f(s),e(s)) : s 2
[1,S]g. Each sentence pair (f,e) = ( f M1 ,eN1 ) consists of
a sentence f in one language and its translation e in the
other, with lengths M and N, respectively. By convention
we refer to e as the English sentence and f as the French
sentence. Correspondences in a sentence are represented
by a set of links between words. A link ( f j,ei) denotes a
correspondence between the ith word ei of e and the jth
word f j of f.
Many alignment models arise from the conditional dis-
tribution P(fje). We can decompose this by introducing
the hidden alignment variable a = aM1 . Each element of
a takes on a value in the range [1,N]. The value of ai
determines a link between the ith French word fi and
the aith English word eai. This representation introduces
83
an asymmetry into the model because it constrains each
French word to correspond to exactly one English word,
while each English word is permitted to correspond to an
arbitrary number of French words. Although the result-
ing set of links may still be relatively accurate, we can
symmetrize by combining it with the set produced by ap-
plying the complementary model P(ejf) to the same data
(Och and Ney, 2000b). Making a few independence as-
sumptions we arrive at the decomposition in Equation 1. 1
P(f,aje) =
M∏
i=1
d(aijai 1) t( fijeai) (1)
We refer to d(aijai 1) as the distortion model and t( fijeai)
as the translation model. Conveniently, Equation 1 is in
the form of an HMM, so we can apply standard algo-
rithms for HMM parameter estimation and maximization.
This approach was proposed in Vogel et al. (1996) and
subsequently improved (Och and Ney, 2000a; Toutanova
et al., 2002).
2.1 The Tree Distortion Model
Equation 1 is adequate in practice, but we can improve
it. Numerous parameterizations have been proposed for
the distortion model. In our surface distortion model, it
depends only on the distance ai  ai 1 and an automati-
cally determined word class C(eai−1) as shown in Equa-
tion 2. It is similar to (Och and Ney, 2000a). The word
class C(eai−1) is assigned using an unsupervised approach
(Och, 1999).
d(aijai 1) = p(aijai  ai 1,C(eai−1)) (2)
The surface distortion model can capture local move-
ment but it cannot capture movement of structures or the
behavior of long-distance dependencies across transla-
tions. The intuitive appeal of capturing richer informa-
tion has inspired numerous alignment models (Wu, 1995;
Yamada and Knight, 2001; Cherry and Lin, 2003). How-
ever, we would like to retain the simplicity and good per-
formance of the HMM Model.
We introduce a distortion model which depends on the
tree distance τ(ei,ek) = (w,x,y) between each pair of En-
glish words ei and ek. Given a dependency parse of eM1 ,
w and x represent the respective number of dependency
links separating ei and ek from their closest common an-
cestor node in the parse tree. 2 The final element y = f1
1We ignore the sentence length probability p(MjN), which
is not relevant to word alignment. We also omit discussion
of HMM start and stop probabilities, and normalization of
t( fijeai ), although we find in practice that attention to these de-
tails can be beneficial.
2The tree distance could easily be adapted to work with
phrase-structure parses or tree-adjoining parses instead of de-
pendency parses.
I1 very2 much3 doubt4 that5
τ(I1,very2) = (1,2,0)
τ(very2,I1) = (2,1,1)
τ(I1,doubt4) = (1,0,0)
τ(that5,I1) = (1,1,1)
Figure 2: Example of tree distances in a sentence from
the Romanian-English development set.
if i > k; 0 otherwiseg is simply a binary indicator of the
linear relationship of the words within the surface string.
Tree distance is illustrated in Figure 2.
In our tree distortion model, we condition on the tree
distance and the part of speech T (ei 1), giving us Equa-
tion 3.
d(aijai 1) = p(ai,jτ(eai,eai−1),T (eai−1)) (3)
Since both the surface distortion and tree distortion
models represent p(aijai 1), we can combine them using
linear interpolation as in Equation 4.
d(aijai 1) =
λC(eai−1);T(eai−1 ) p(aijτ(eai,eai−1),T (eai−1)) +
(1 λC(eai−1);T(eai−1 ))p(aijai  ai 1,C(eai−1))
(4)
The λC;T parameters can be initialized from a uniform
distribution and trained with the other parameters using
EM. In principle, any number of alternative distortion
models could be combined with this framework.
2.2 Improving Initialization
Our HMM produces reasonable results if we draw our
initial parameter estimates from a uniform distribution.
However, we can do better. We estimate the initial
translation probability t( f jjei) from the smoothed log-
likelihood ratio LLR(ei, f j)φ1 computed over sentence
cooccurrences. Since this method works well, we apply
LLR(ei, f j) in a single reestimation step shown in Equa-
tion 5.
t( fje) = LLR( fje)
φ2 + n
∑eprime LLR( fje0)φ2 + n jVj (5)
In reestimation LLR( fje) is computed from the expected
counts of f and e produced by the EM algorithm. This is
similar to Moore (2004); as in that work, jVj = 100,000,
and φ1, φ2, and n are estimated on development data.
We can also use an improved initial estimate for distor-
tion. Consider a simple distortion model p(aijai  ai 1).
We expect this distribution to have a maximum near
P(aij0) because we know that words tend to retain their
locality across translation. Rather than wait for this to
occur, we use an initial estimate for the distortion model
given in Equation 6.
84
corpus n φ1 φ2 α symmetrization n 1 φ 11 φ 12 α 1
English-Inuktitut 1 4 1.0 1.75 -1.5 \ 5 4 1.0 1.75 -1.5
Romanian-English 5 4 1.5 1.0 -2.5 refined (Och and Ney, 2000b) 5 4 1.5 1.0 -2.5
English-Hindi 1 4 1.5 3.0 -2.5 [ 1 2 1.0 1.0 -1.0
Table 1: Training parameters for the workshop data (see Section 2.2). Parameters n, φ1, φ2, and α were used in the
initialization of P(fje) model, while n 1, φ 11 , φ 12 , and α 1 were used in the initialization of the P(ejf) model.
corpus type HMM limited (Eq. 2) HMM unlimited (Eq. 4) IBM Model 4P R AER P R AER P R AER
English-Inuktitut
P(fje) .4962 .6894 .4513 – – – .4211 .6519 .5162
P(ejf) .5789 .8635 .3856 – – – .5971 .8089 .3749
\ .8916 .6280 .2251 – – – .8682 .5700 .2801
English-Hindi
P(fje) .5079 .4769 .5081 .5057 .4748 .5102 .5219 .4223 .5332
P(ejf) .5566 .4429 .5067 .5566 .4429 .5067 .5652 .3939 .5358
[ .4408 .5649 .5084 .4365 .5614 .5088 .4543 .5401 .5065
Romanian-English
P(fje) .6876 .6233 .3461 .6876 .6233 .3461 .6828 .5414 .3961
P(ejf) .7168 .6217 .3341 .7155 .6205 .3354 .7520 .5496 .3649
refined .7377 .6169 .3281 .7241 .6215 .3311 .7620 .5134 .3865
Table 2: Results on the workshop data. The systems highlighted in bold are the ones that were used in the shared task.
For each corpus, the last row shown represents the results that were actually submitted. Note that for English-Hindi,
our self-reported results in the unlimited task are slightly lower than the original results submitted for the workshop,
which contained an error.
d(aijai 1) =
braceleftbigg ja
i  ai 1jα/Z,α < 0 if ai 6= ai 1.
1/Z if ai = ai 1.
(6)
We choose Z to normalize the distribution. We must
optimize α on a development set. This distribution has
a maximum when jai  ai 1j 2 f 1,0,1g. Although we
could reasonably choose any of these three values as the
maximum for the initial estimate, we found in develop-
ment that the maximum of the surface distortion distribu-
tion varied with C(eai 1), although it was always in the
range [ 1,2].
2.3 Does NULL Matter in Asymmetric Alignment?
Och and Ney (2000a) introduce a NULL-alignment ca-
pability to the HMM alignment model. This allows any
word f j to link to a special NULL word – by conven-
tion denoted e0 – instead of one of the words eN1 . A link
( f j,e0) indicates that f j does not correspond to any word
in e. This improved alignment performance in the ab-
sence of symmetrization, presumably because it allows
the model to be conservative when evidence for an align-
ment is lacking.
We hypothesize that NULL alignment is unnecessary
for asymmetric alignment models when we symmetrize
using intersection-based methods (Och and Ney, 2000b).
The intuition is simple: if we don’t permit NULL align-
ments, then we expect to produce a high-recall, low-
precision alignment; the intersection of two such align-
ments should mainly improve precision, resulting in a
high-recall, high-precision alignment. If we allow NULL
alignments, we may be able produce a high-precision,
low-recall asymmetric alignment, but symmetrization by
intersection will not improve recall.
3 Results with the Workshop Data
In our experiments, the dependency parse and parts of
speech are produced by minipar (Lin, 1998). This parser
has been used in a much different alignment model
(Cherry and Lin, 2003). Since we only had parses for
English, we did not use tree distortion in the application
of P(ejf), needed for symmetrization.
The parameter settings that we used in aligning the
workshop data are presented in Table 1. Although our
prior work with English and French indicated that in-
tersection was the best method for symmetrization, we
found in development that this varied depending on the
characteristics of the corpus and the type of annotation
(in particular, whether the annotation set included proba-
ble alignments). The results are summarized in Table 2.
It shows results with our HMM model using both Equa-
tions 2 and 4 as our distortion model, which represent
85
the unlimited and limited resource tracks, respectively.
It also includes a comparison with IBM Model 4, for
which we use a training sequence of IBM Model 1 (5
iterations), HMM (6 iterations), and IBM Model 4 (5 it-
erations). This sequence performed well in an evaluation
of the IBM Models (Och and Ney, 2003).
For comparative purposes, we show results of apply-
ing both P(fje) and P(ejf) prior to symmetrization, along
with results of symmetrization. Comparison of the asym-
metric and symmetric results largely supports the hypoth-
esis presented in Section 2.3, as our system generally pro-
duces much better recall than IBM Model 4, while of-
fering a competitive precision. Our symmetrized results
usually produced higher recall and precision, and lower
alignment error rate.
We found that the largest gain in performance came
from the improved initialization. The combined distor-
tion model (Equation 4), which provided a small benefit
over the surface distortion model (Equation 2) on the de-
velopment set, performed slightly worse on the test set.
We found that the dependencies on C(eai−1) and
T (eai−1) were harmful to the P(fje) alignment for Inukti-
tut, and did not submit results for the unlimited resources
configuration. However, we found that alignment was
generally difficult for all models on this particular task,
perhaps due to the agglutinative nature of Inuktitut.
4 Conclusions
We have proposed improvements to the largely over-
looked HMM word alignment model. Our improvements
yield good results on the workshop data. We have addi-
tionally shown that syntactic information can be incorpo-
rated into such a model; although the results are not su-
perior, they are competitive with surface distortion. In fu-
ture work we expect to explore additional parameteriza-
tions of the HMM model, and to perform extrinsic evalu-
ations of the resulting alignments by using them in the pa-
rameter estimation of a phrase-based translation model.
Acknowledgements
This research was supported in part by ONR MURI Con-
tract FCPO.810548265. The authors would like to thank
Bill Byrne, David Chiang, Okan Kolak, and the anony-
mous reviewers for their helpful comments.

References
Yaser Al-Onaizan, Jan Curin, Michael Jahr, Kevin
Knight, John Lafferty, Dan Melamed, Franz Josef Och,
David Purdy, Noah A. Smith, and David Yarowsky.
1999. Statistical machine translation: Final report. In
Johns Hopkins University 1999 Summer Workshop on
Language Engineering.
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della
Pietra, and Robert L. Mercer. 1993. The mathematics
of statistical machine translation: Parameter estima-
tion. Computational Linguistics, 19(2):263–311, Jun.
Colin Cherry and Dekang Lin. 2003. A probability
model to improve word alignment. In ACL Proceed-
ings, Jul.
Herve Dejean, Eric Gaussier, Cyril Goutte, and Kenji
Yamada. 2003. Reducing parameter space for word
alignment. In Proceedings of the Workshop on Build-
ing and Using Parallel Texts: Data Driven Machine
Translation and Beyond, pages 23–26, May.
Dekang Lin. 1998. Dependency-based evaluation of
minipar. In Proceedings of the Workshop on the Eval-
uation of Parsing Systems, May.
Rada Mihalcea and Ted Pedersen. 2003. An evaluation
exercise for word alignment. In Proceedings of the
Workshop on Building and Using Parallel Texts: Data
Driven Machine Translation and Beyond, pages 1–10,
May.
Robert C. Moore. 2004. Improving IBM word-
alignment model 1. In ACL Proceedings, pages 519–
526, Jul.
Franz Josef Och and Hermann Ney. 2000a. A compari-
son of alignment models for statistical machine trans-
lation. In COLING Proceedings, pages 1086–1090,
Jul.
Franz Josef Och and Hermann Ney. 2000b. Improved
statistical alignment models. In ACL Proceedings,
pages 440–447, Oct.
Franz Josef Och and Hermann Ney. 2003. A system-
atic comparison on various statistical alignment mod-
els. Computational Linguistics, 29(1):19–51.
Franz Josef Och. 1999. An efficient method for deter-
mining bilingual word classes. In EACL Proceedings,
pages 71–76, Jun.
Kristina Toutanova, H. Tolga Ilhan, and Christopher D.
Manning. 2002. Extensions to hmm-based statistical
word alignment models. In EMNLP, pages 87–94, Jul.
Stephan Vogel, Hermann Ney, and Christoph Tillman.
1996. Hmm-based word alignment in statistical ma-
chine translation. In COLING Proceedings, pages
836–841, Aug.
Dekai Wu. 1995. Stochastic inversion transduction
grammars, with application to segmentation, bracket-
ing, and alignment of parallel corpora. In Proceedings
of the 14th International Joint Conference on Artificial
Intelligence, pages 1328–1335, Aug.
Kenji Yamada and Kevin Knight. 2001. A syntax-based
statistical translation model. In ACL Proceedings.
