A Comparison of Alignment Models for Statistical Machine 
Translation 
Franz Josef Och and Hermann Ney 
Lehrstuhl fiir Informatik VI, Comlmter Science Department 
RWTH Aachen - University of Technology 
D-52056 Aachen, Germany 
{och, ney}~inf ormat ik. ruth-aachen, de 
Abstract 
In this paper, we t)resent and compare various align- 
nmnt models for statistical machine translation. We 
propose to measure tile quality of an aligmnent 
model using the quality of the Viterbi alignment 
comt)ared to a manually-produced alignment and de- 
scribe a refined mmotation scheme to produce suit- 
able reference alignments. We also con,pare the im- 
pact of different; alignment models on tile translation 
quality of a statistical machine translation system. 
1 Introduction 
In statistical machine translation (SMT) it is neces- 
sm'y to model the translation probability Pr(fl a Ic~). 
Here .fi' = f denotes tile (15'ench) source and e{ = e 
denotes the (English) target string. Most SMT 
models (Brown et al., 1993; Vogel et al., 1996) 
try to model word-to-word corresl)ondences between 
source and target words using an alignment nmpl)ing 
from source l)osition j to target position i = aj. 
We can rewrite tim t)robal)ility Pr(fille~) t) 3, in- 
troducing the 'hidden' alignments ai 1 := al ...aj...a.l 
(aj C {0,...,/}): 
Pr(f~lel) = ~Pr(fi',a~le{) 
.1 
• j-1 I~ = E H Pr(fj'ajlfi'-"al 'el) 
q, j=l 
To allow fbr French words wlfich do not directly cor- 
respond to any English word an artificial 'empty' 
word c0 is added to the target sentence at position 
i=0. 
The different alignment models we present pro- 
vide different decoInt)ositions of Pr(f~,a~le(). An 
alignnlent 5~ for which holds 
a~ = argmax Pr(fi' , a'l'\[eI) at 
for a specific model is called Viterbi alignment of" 
this model. 
In this paper we will describe extensions to tile 
Hidden-Markov alignment model froln (Vogel et al., 
1.996) and compare tlmse to Models 1 - 4 of (Brown 
et al., 1993). We t)roI)ose to measure the quality of 
an alignment nlodel using the quality of tlle Viterbi 
alignment compared to a manually-produced align- 
ment. This has the advantage that once having pro- 
duced a reference alignlnent, the evaluation itself can 
be performed automatically. In addition, it results in 
a very precise and relia.ble evaluation criterion which 
is well suited to assess various design decisions in 
modeling and training of statistical alignment mod- 
els. 
It, is well known that manually pertbrming a word 
aligmnent is a COlnplicated and ambiguous task 
(Melamed, 1998). Therefore, to produce tlle refer- 
ence alignment we use a relined annotation scheme 
which reduces the complications and mnbiguities oc- 
curring in the immual construction of a word align- 
ment. As we use tile alignment models for machine 
translation purposes, we also evahlate the resulting 
translation quality of different nlodels. 
2 Alignment with HMM 
In the Hidden-Markov alignment model we assume 
a first-order dependence for tim aligmnents aj and 
that the translation probability depends Olfly on aj 
and not Oil (tj_l: 
- ~-' el) =p(ajl.j-,,Z)p(J~l%) Pr(fj,(glf~ ',% , 
Later, we will describe a refinement with a depen- 
dence on e,,j_, iu the alignment model. Putting 
everything together, we have the following basic 
HMM-based modeh 
.1 
*'(flJl~I) = ~ II \[~,(-jla~.-,, z). p(fjl%)\] (1) 
at j=l 
with the alignment I)robability p(ili',I ) and the 
translation probability p(fle). To find a Viterbi 
aligninent for the HMM-based model we resort to 
dynamic progralnming (Vogel et al., 1996). 
The training of tlm HMM is done by the EM- 
algorithm. In the E-step the lexical and alignment 
1086 
counts for one sentenee-i)air (f, e) are calculated: 
c(flc; f, e) = E P"(alf' e) ~ 5(f, f~)5(e, c~) 
a i,j 
,.:(ill', z; f, e) = E/','(air, e) aj) 
a j 
In the M-step the lexicon and translation probabili- 
ties are: 
p(fle) o< ~-~c(fle;f('~),e (~)) 
8 
P(ili',I) o(Ec(ili',I;fO),e(~)) 
8 
To avoid the smlunation ov(;r all possible aligmnents 
a, (Vogel et el., 1996) use the maximum apllroxima- 
tion where only the Viterbi alignlnent t)ath is used to 
collect counts. We used the Baron-Welch-algorithm 
(Baum, 1972) to train the model parameters in out' 
ext)eriments. Theret/y it is possible to t)erti)rm an 
efl-iciellt training using; all aligmnents. 
To make the alignlnenl; t)arameters indo,1)en(lent 
t'ronl absolute word i)ositions we assmne that the 
alignment i)robabilities p(i\[i', I) (lel)end only Oil the 
jmnp width (i - i'). Using a set of non-negative 
t)arameters {c(i -i')}, we can write the alignment 
probabilities ill the fl)rm: 
~'(i - i') (2) p(ili', I) = 
c(,,:" - i') 
This form ensures that for eadl word posilion it, 
i' = 1, ..., I, the aligmnent probat)ilities satis(y th(, 
normalization constraint. 
Extension: refined aligmnent model 
The count table e(i - i') has only 2.1 ......... - 1 en- 
tries. This might be suitable for small corpora, but 
fi)r large corpora it is possil)le to make a more re- 
fine(1 model of Pr(aj ~i-I i-I Ji ,% ,c'~). Est)ecially, we 
analyzed the effect of a det)endence on c,b_ ~ or .fj. 
As a dependence on all English words wouht result 
ill a huge mmflmr of aligmnent 1)arameters we use as 
(Brown et el., 1993) equivalence classes G over tlle 
English and the French words. Here G is a mallping 
of words to (:lasses. This real)ping is trained au- 
tonmtically using a modification of the method de- 
scrilled ill (Kneser and Ney, 1991.). We use 50 classes 
in our exlmriments. The most general form of align- 
ment distribution that we consider in the ItMM is 
p(aj - a.+_, la(%), G(f~), h- 
Extension: empty word 
In the original formulation of the HMM alignment 
model there ix no 'empty' word which generates 
Fren(:h words having no directly aligned English 
word. A direct inchlsion of an eml/ty wor(t ill the 
HMM model by adding all c o as in (Brown et al., 
1.993) is not 1)ossit)le if we want to model the junlp 
distances i - i', as the I)osition i = 0 of tim emt)ty 
word is chosen arbitrarily. Therefore, to introduce 
the eml)ty word we extend the HMM network by I 
empty words ci+ 1.'2I The English word ci has a col 
rest)onding eml)ty word el+ I. The I)osition of the 
eml)ty word encodes the previously visited English 
word. 
We enforce the following constraints for the tran- 
sitions in the HMM network (i _< I, i' _< I): 
p(i + Ili',I) = pff . 5(i,i') 
V(i + Ill' + I, I) = JJ. 5(i,i') 
p(ili' + I, 1) = p(iIi' ,1) 
The parameter pff is the 1)robability of a transition 
to the emt)ty word. In our extleriments we set pIl = 
0.2. 
Smoothing 
For a t)etter estimation of infrequent events we in- 
troduce the following smoothing of alignment t)rob- 
abilities: 
1 F(ajI~j-,,~) = ~" ~- + (1 -,,).p(ajlaj_l ,I) 
in our exlleriments we use (t = 0.4. 
3 Model 1 and Model 2 
l~cl)lacing the (l(~,t)endence on aj-l in the HMM 
alignment mo(M I)y a del)endence on j, we olltain 
a model wlfich (:an lie seen as a zero-order Hid(l(m- 
Markov Model which is similar to Model 2 1)rot)ose(t 
t/y (Brown et al., 1993). Assmning a mfiform align- 
ment prol)ability p(ilj, I) = 1/1, we obtain Model 
1. 
Assuming that the dominating factor in the align- 
ment model of Model 2 is the distance relative to the 
diagonal line of the (j, i) plane the too(tel p(ilj , I) can 
1)e structured as tbllows (Vogel et al., 1996): 
,'(i -, - (3) v(ilj, 5 = 
Ei,=t r('i' l 
This model will be referred to as diagonal-oriented 
Model 2. 
4 Model 3 and Model 4 
Model: The fertility models of (Brown et el., 1993) 
explicitly model the probability l,(¢lc) that the En- 
glish word c~ is aligned to 
4,, = E 
J 
\]~rench words. 
1087 
Model 3 of (Brown et al., 1993) is a zero-order 
alignment model like Model 2 including in addi- 
tion fertility paranmters. Model 4 of (Brown et al., 
1993) is also a first-order alignment model (along 
the source positions) like the HMM, trot includes 
also fertilities. In Model 4 the alignment position 
j of an English word depends on the alignment po- 
sition of tile previous English word (with non-zero 
fertility) j'. It models a jump distance j-j' (for con- 
secutive English words) while in the HMM a jump 
distance i-i' (for consecutive French words) is mod- 
eled. Tile full description of Model 4 (Brown et al., 
1993) is rather complica.ted as there have to be con- 
sidered tile cases that English words have fertility 
larger than one and that English words have fertil- 
ity zero. 
For training of Model 3 and Model 4, we use an 
extension of the program GlZA (A1-Onaizan et al., 
1999). Since there is no efficient way in these mod- 
els to avoid tile explicit summation over all align- 
ments in the EM-algorithin, the counts are collected 
only over a subset of promising alignments. It is not 
known an efficient algorithm to compute the Viterbi 
alignment for the Models 3 and 4. Therefore, the 
Viterbi alignment is comlmted only approximately 
using the method described in (Brown et al., 1993). 
The models 1-4 are trained in succession with the 
tinal parameter values of one model serving as the 
starting point tbr the next. 
A special problein in Model 3 and Model 4 con- 
cerns the deficiency of tile model. This results in 
problems in re-estimation of the parameter which 
describes the fertility of the empty word. In nor- 
real EM-training, this parameter is steadily decreas- 
ing, producing too many aligmnents with tile empty 
word. Therefore we set tile prot)ability for aligning 
a source word with tile emt)ty word at a suitably 
chosen constant value. 
As in tile HMM we easily can extend the depen- 
dencies in the alignment model of Model 4 easily 
using the word class of the previous English word 
E = G(ci,), or the word class of the French word 
F = G(Ij) (Brown et al., 1993). 
5 Including a Manual Dictionary 
We propose here a simple method to make use of 
a bilingual dictionary as an additional knowledge 
source in the training process by extending the train- 
ing corpus with the dictionary entries. Thereby, the 
dictionary is used already in EM-training and can 
improve not only the alignment fox" words which are 
in the dictionary but indirectly also for other words. 
The additional sentences in the training cortms are 
weighted with a factor Fl~x during the EM-training 
of the lexicon probabilities. 
We assign tile dictionary entries which really co- 
occur in the training corpus a high weight Fle.~. and 
the remaining entries a vex'y low weight. In our ex- 
periments we use Flex = 10 for the co-occurring dic- 
tionary entries which is equivalent to adding every 
dictionary entry ten times to the training cortms. 
6 The Alignment Template System 
The statistical machine-translation method descri- 
bed in (Och et al., 1999) is based on a word aligned 
traiifing corIms and thereby makes use of single- 
word based alignment models. Tile key element of 
tiffs apt/roach are the alignment templates which are 
pairs of phrases together with an alignment between 
the words within tile phrases. The advantage of 
the alignment template approach over word based 
statistical translation models is that word context 
and local re-orderings are explicitly taken into ac- 
count. We typically observe that this approach pro- 
duces better translations than the single-word based 
models. The alignment templates are automatically 
trailmd using a parallel trailxing corlms. For more 
information about the alignment template approach 
see (Och et at., 1999). 
7 Results 
We present results on the Verbmobil Task which is 
a speech translation task ill the donmin of appoint- 
nxent scheduling, travel planning, and hotel reserva- 
tion (Wahlster, 1993). 
We measure the quality of tile al)ove inentioned 
aligmnent models with x'espect to alignment quality 
and translation quality. 
To obtain a refereuce aligmnent for evaluating 
alignlnent quality, we manually aligned about 1.4 
percent of onr training corpus. We allowed the hu- 
mans who pertbrmed the alignment to specify two 
different kinds of alignments: an S (sure) a, lignment 
which is used for alignmelxts which are unambigu- 
ously and a P (possible) alignment which is used 
for alignments which might or might not exist. The 
P relation is used especially to align words within 
idiomatic expressions, free translations, and missing 
function words. It is guaranteed that S C P. Figure 
1 shows all example of a manually aligned sentence 
with S and P relations. The hunxan-annotated align- 
ment does not prefer rely translation direction and 
lnay therefore contain many-to-one and one-to-many 
relationships. The mmotation has been performed 
by two annotators, producing sets $1, 1~, S2, P2. 
Tile reference aliglunent is produced by forming the 
intersection of the sure aligmnents (S = $1 rqS2) and 
the ration of the possible atignumnts (P = P1 U P'2). 
Tim quality of an alignment A = { (j, aj) } is mea- 
sured using the following alignment error rate: 
AER(S, P; A) = 1 - IA o Sl + IA o Pl 
IAI + ISl 
1088 
that ......... \[\] 
at ......... \[\] 
....... V1V1. 
leave ....... \[---'l \[-"~ " 
....... liE\]. 
let ....... Cll-1 " 
e ...... • .... 
say ..... • ..... 
would " • ....... 
T .... • ...... 
then" " • ........ 
• \[\] ........ o 
yes • .......... 
-rn I:I '13 O • • -~t ~1 
J~ 
o 
Figure i: Exmnple of a manually annotated align- 
ment with sure (filled dots) and possible commotions. 
Obviously, if we colnpare the sure alignnlents of ev- 
ery sitigle annotator with the reference a.ligmnent we 
obtain an AEI{ of zero percent. 
~\[ifl)le l.: Cort)us characteristics for alignment quality 
experiments. 
Train SenteiH:(is 
Words 
Vocalmlary 
Dictionary Entries 
Words 
Test Sentences 
Words 
German I English 
34 446 
329 625 / 343 076 
5 936 \] 3 505 
4 183 
4 533 I 5 324 
354 
3 109 I 3 233 
Tal)le 1 shows the characteristics of training and 
test corlms used in the alignment quality ext)eri- 
inents. The test cortms for these ext)eriments (not 
for the translation exl)eriments) is 1)art of the train- 
ing corpus. 
Table 2 shows the aligmnent quality of different 
alignment models. Here the alignment models of 
IIMM and Model 4 do not include a dependence 
on word classes. We conclude that more sophisti- 
cated alignment lnodels are crtlcial tbr good align- 
ment quality. Consistently, the use of a first-order 
aligmnent model, modeling an elnpty word and fer- 
tilities result in better alignments. Interestingly, the 
siinl)ler HMM aligninent model outt)erforms Model 
3 which shows the importance of first-order align- 
ment models. The best t)erformanee is achieved 
with Model 4. The improvement by using a dictio- 
nary is small eomI)ared to the effect of using 1)etter 
a.lignmellt models. We see a significant dill'erence 
in alignment quality if we exchange source and tar- 
get s. This is due to the restriction in all 
alignment models that a source  word can 
1)e aligned to at most one target  word. If 
German is source  the t'requelltly occurring 
German word coml)ounds, camlot be aligned cor- 
rectly, as they typically correspond to two or more 
English words. 
WaNe 3 shows the effect of including a det)endence 
on word classes in the aligmnent model of ItMM or 
Model 4. By using word classes the results can be 
Table 3: Eft'cot of including a det)endence on word 
classes in the aligmnent model. 
AER \[%\] 
Det)endencies -IIMM I Model 4 
no 8.0 6.5 
source 7.5 6.0 
target 7.1 6.1 
source ¢ target 7.6 6.1 
improved by 0.9% when using the ItMM and by 0.5% 
when using Model 4. 
For the translation experiments we used a differ- 
ent training and an illdetmndent test corpus (Table 
4). 
Table 4: Corlms characteristics for translation (tual- 
it;.), exlmriments. 
Train 
S ~e,t 
Sentences 
Words 
Vocabulary 
Selltellees 
Words 
PP (trigram LM) 
I German English 
58332 
519523 549921 
7 940 4 673 
147 
1968 2173 
(40.3) 28.8 
For tile evMuation of the translation quality we 
used the automatically comlmtable Word Error Rate 
(WEll.) and the Subjective Sentence Error Rate 
(SSEll,) (Niefien et al., 2000). The WEll, corre- 
spomls to the edit distance t)etween the produced 
translation and one t)redefined reference translation. 
To obtain the SSER the translations are classified by 
human experts into a small number of quality classes 
ranging from "l)ertbet" to "at)solutely wrong". In 
comparison to the WEll,, this criterion is more mean- 
ingflfl, but it is also very exl)ensive to measure. The 
translations are produced by the aligmnent template 
system mentioned in the previous section. 
1089 
Table 2: Alignment error rate (AER \[%\]) of ditl~rent alignment models tbr the translations directions English 
into German (German words have fertilities) and German into English. 
English -+ German German -~ English 
Dictionary no yes no yes 
Empty Word no lYes yes no l yes yes 
Model 1 17.8 16.9 16.0 22.9 21.7 20.3 
Model 2 12.8 12.5 11.7 17.5 17.1 15.7 
Model 2(diag) 11.8 10.5 9.8 16.4 15.1 13.3 
Model 3 10.5 9.3 8.5 15.7 14.5 12.1 
HMM 10.5 9.2 8.0 14.1 12.9 11.5 
Model 4 9.0 7.8 6.5 14.0 12.5 10.8 
Table 5: Effect of different alignment models on 
translation quality. 
Alignlnent Model 
in Training WER\[%\] SSER\[%\] 
Model 1 49.8 22.2 
HMM 47.7 19.3 
Model 4 48.6 16.8 
The results are shown in Table 5. We see a clear 
improvement in translation quality as measured by 
SSER whereas WER is inore or less the same for all 
models. The imwovement is due to better lexicons 
and better alignment templates extracted from the 
resulting aliglunents. 
8 Conclusion 
We have evaluated vm'ious statistical alignment 
models by conlparing the Viterbi alignment of the 
model with a human-made alignment. We have 
shown that by using inore sophisticated models the 
quality of the alignments improves significantly. Fur- 
ther improvements in producing better alignments 
are expected from using the HMM alignment model 
to bootstrap the fertility models, fronl making use of 
cognates, and from statistical alignment models that 
are based on word groups rather than single words. 
Acknowledgment 
This article has been partially supported as 
part of the Verbmobil project (contract nmnber 
01 IV 701 T4) by the German Federal Ministry of 
Education, Science, Research and Technology. 

References 

Y. A1-Onaizan, J. Cur\]n, M. Jahr, K. Knight, J. Laf- 
ferty, I. D. Melamed, F. a. Och, D. Purdy, N. A. 
Smith, and D. Yarowsky. 1999. Statistical ina- 
chine translation, final report, JHU workshop. 
http ://www. clsp. j hu. edu/ws99/proj ects/mt/ 
f inal_report/mr- f inal-report, ps. 

L.E. Baum. 1972. An Inequality and Associated 
Maximization Technique in Statistical Estimation 
for Probabilistie Functions of Markov Processes. 
Inequalities, 3:1 8. 

P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, 
and R. L. Mercer. 1993. The mathenlatics of sta- 
tistical machine trmlslation: Parameter estima- 
tion. Computational Linguistics, 19(2):263-311. 

R. Kneser and H. Ney. 1991. Forming Word Classes 
by Statistical Clustering for Statistical Langm~ge 
Modelling. In 1. Quantitative Linguistics Conf. 

I. D. Melamed. 1998. Manual mmotation of transla- 
tional equivalence: The Blinker project. Technical 
Report 98-07, IRCS. 

S. Niegen, F. J. ()ch, G. Leusch, and H. Ney. 
2000. An evaluation tool \]'or machine translation: 
Fast evaluation for mt research. In Proceedings of 
the Second International Conference on Language 
Resources and Evaluation, pages 39-45, Athens, 
Greece, May June. 

F. J. Och, C. Tilhnalm, mid H. Ney. 1999. Improved 
alignment models for statistical machine transla- 
tion. In In Prec. of the Joint SIGDAT Co~¢ on 
Empirical Methods in Natural Language Process- 
ing and Very LaTye Corpora, pages 20-28, Univer- 
sity of Marylmld, College Park, MD, USA, June. 

S. Vogel, H. Ney, and C. Tilhnann. 1996. HMM- 
based word alignment in statistical translation. 
In COLING '96: The 16th Int. Conf. on Compu- 
tational Linguistics, pages 836-841, Copenhagen, 
August. 

W. Wahlster. 1993. Verbmobil: Translation of face- 
to-face dialogs. In P~vc. of the MT Summit IV, 
pages 127-135, Kobe, Jat)an. 
