In: Proceedings of CoNLL-2000 and LLL-2000, pages 37-42, Lisbon, Portugal, 2000. 
Incorporating Position Information into a Maximum 
Entropy/Minimum Divergence Translation Model 
George Foster 
RALI, Universit6 de Montr6al 
foster@iro, umontreal, ca 
Abstract 
I describe two methods for incorporating infor- 
mation about the relative positions of bilingual 
word pairs into a Maximum Entropy/Minimum 
Divergence translation model. The better of the 
two achieves over 40% lower test corpus perplex- 
ity than an equivalent combination of a trigram 
 model and the classical IBM transla- 
tion model 2. 
1 Introduction 
Statistical Machine Translation (SMT) systems 
use a model ofp(tls), the probability that a text 
s in the source  will translate into a text 
t in the target , to determine the best 
translation for a given source text. A straight- 
forward way of modeling this distribution is to 
apply a chain-rule expansion of the form: 
ItL 
p(tlS ) = Hp(tiltl...ti-l,S), 
i=1 
(i) 
where ti denotes the ith token in t. 1 The objects 
to be modeled in this case belong to the family 
of conditional distributions p(wlhi, s), the prob- 
ability of the ith word in t, given the tokens 
which precede it and the source text. 
The main motivation for modeling p(tls ) in 
terms of p(wlhi , s) is that it simplifies the "de- 
coding" problem of finding the most likely tar- 
get text. In particular, if hi is known, finding 
the best word at the current position requires 
only a straightforward search through the target 
1This ignores the issue of normalization over target 
texts of all possible lengths, which can be easily enforced 
when desired by using a stop token or a prior distribution 
over lengths. 
vocabulary, and efficient dynamic-programming 
based heuristics can be used to extend this to 
sequences of words. This is very important for 
applications such as TransType (Foster et al., 
1997; Langlais et al., 2000), where the task is 
to make real-time predictions of the text a hu- 
man translator will type next, based on the 
source text under translation and some prefix 
of the target text that has already been typed. 
The standard "noisy channel" approach used in 
SMT, where p(tls ) c< p(t)p(slt), is generally too 
expensive for such applications because it does 
not permit direct calculation of the probabil- 
ity of a word or sequence of words beginning at 
the current position. Complex and expensive 
search strategies are required to find the best 
target text in this approach (Garcfa-Varea et 
al., 1998; Niessen et al., 1998; Ochet al., 1999; 
Wang and Waibel, 1998). 
The challenge in modeling p(wlhi,s ) is to 
combine two disparate sources of conditioning 
information in an effective way. One obvious 
strategy is to use a linear combination of sep- 
arate  and translation components, of 
the form: 
p(w\[hi, s) -- Ap(w\[hi) + (1 - A)p(w\[i, s). (2) 
where p(w\[hi) is a  model, p(wli , s) is 
a translation model, and A E \[0, 1\] is a com- 
bining weight. However, this appears to be 
a weak technique (Langlais and Foster, 2000), 
even when A is allowed to depend on various 
features of the context (hi, s). 
In previous work (Foster, 2000), I de- 
scribed a Maximum Entropy/Minimum Diver- 
gence (MEMD) model (Berger et al., 1996) 
for p(w\[hi, s) which incorporates a trigram lan- 
guage model and a translation component which 
is an analog of the well-known IBM transla- 
tion model 1 (Brown et al., 1993). This model 
37 
significantly outperforms an equivalent linear 
combination of a trigram and model 1 in test- 
corpus perplexity, despite using several orders of 
magnitude fewer translation parameters. Like 
model 1, its translation component is based only 
on the occurrences in s of words which are po- 
tential translations for w, and does not take 
into account the positions of these words rel- 
ative to w. An obvious enhancement is to in- 
corporate such positional information into the 
MEMD model, thereby making its translation 
component analogous to the IBM model 2. This 
is the problem I address in this paper. 
2 Models 
2.1 Linear Model 
As a baseline for comparison I used a linear com- 
bination as in (2) of a standard interpolated tri- 
gram  model and the IBM translation 
model 2 (IBM2), with the combining weight A 
optimized using the EM algorithm. IBM2 is de- 
rived as follows: 2 
l 
p(wli, s ) = ~p(w,jli, s ) 
j=O 
l 
~p(w\]sj)p(jli, l) 
j=O 
where I = \[s\[, and the hidden variable j gives 
the position in s of the (single) source token sj 
assumed to give rise to w, or 0 if there is none. 
The model consists of a set of word-pair param- 
eters p(t\[s) and position parameters p(j\[i,/); in 
model 1 (IBM1) the latter are fixed at 1/(1 + 1), 
as each position, including the empty position 
0, is considered equally likely to contain a trans- 
lation for w. Maximum likelihood estimates for 
these parameters can be obtained with the EM 
algorithm over a bilingual training corpus, as 
described in (Brown et al., 1993). 
2.2 MEMD Model 1 
A MEMD model for p(w\[hi, s) has the general 
form: 
p(wlhi, s) = q(w\[hi, s) exp(~ • f(w, hi, s)) 
Z(hi,s) 
2Model 2 was originally formulated for p(tls), but 
since target words are predicted independently it can 
also be used for p(wlhi , s). The only necessary modifica- 
tion in this case is that the position parameters can no 
longer be conditioned on It\[. 
where q(w\[hi,s) is a reference distribution, 
f(w, hi, s) maps (w, hi, s) into an n-dimensional 
feature vector, (~ is a corresponding vector of 
feature weights (the parameters of the model), 
and Z(hi, s) = ~w q(w\[hi, s) exp((~-f(w, hi)) is 
a normalizing factor. For a given choice of q 
and f, the IIS algorithm (Berger et al., 1996) 
can be used to find maximum likelihood values 
for the parameters ~. It can be shown (Della 
Pietra et al., 1995) that these are the also the 
values which minimize the Kullback-Liebler di- 
vergence D(p\[\[q) between the model and the 
reference distribution under the constraint that 
the expectations of the features (ie, the compo- 
nents of f) with respect to the model must equal 
their expectations with respect to the empirical 
distribution derived from the training corpus. 
Thus the reference distribution serves as a kind 
of prior, and should reflect some initial knowl- 
edge about the true distribution; and the use 
of any feature is justified to the extent that its 
empirical expectation is accurate. 
In the present context, the natural choice for 
the reference distribution q is a trigram lan- 
guage model. To create a MEMD analog to 
IBM model 1 (MEMD1), I used boolean fea- 
tures corresponding to bilingual word pairs: 
1, sEsandt----w fst(W,S) 
= 0, else 
where (s, t) is a (source,target) word pair. Using 
the notational convention that ast is 0 whenever 
the corresponding feature fst does not exist in 
the model, MEMD1 can be written compactly 
as: 
p(wlhi,s) = q(wlhi) exp(~ asw)/Z(hi,s). 
sEs 
Due to the theoretical properties of MEMD 
outlined above, it is necessary to select a sub- 
set of all possible features fst to avoid overfitting 
the training corpus. Using a reduced feature set 
is also computationally advantageous, since the 
time taken to calculate the normalization con- 
stant Z(hi, s) grows linearly with the expected 
number of features which are active per source 
word s E s. This is in contrast to IBM1, where 
use of all available word-pair parameters p(tls ) 
is standard, and engenders only a very slight 
overfitting effect. In (Foster, 2000) I describe an 
38 
effective technique for selecting MEMD word- 
pair features. 
2.3 MEMD Model 2 
IBM2 incorporates position information by in- 
troducing a hidden position variable and mak- 
ing independence hypotheses. This approach is 
not applicable to MEMD models, whose fea- 
tures must capture events which are directly 
observable in the training corpus. 3 It would be 
possible to use pure position features of the form 
fi#, which capture the presence of any word 
pair at position (i, j, l) and are superficially sim- 
ilar to IBM2's position parameters, but these 
would add almost no information to MEMD1. 
On the other hand, features like fstijl, indicat- 
ing the presence of a specific pair (s, t) at posi- 
tion (i, j,/), would cause severe data sparseness 
problems. 
Encoding Positions as Feature Values 
A simple solution to this dilemma is to let the 
value of a word-pair feature reflect the current 
position of the pair rather just its presence or 
absence. A reasonable choice for this is the 
value of the corresponding IBM2 position pa- 
rameter p(jli, /): 
fst(W, i, s) = { P(Jsli'o, l), elseS E s and t = w 
where js is the position of s in s, or the most 
likely position according to IBM2 if it occurs 
more than once: 5s = argmaxj:sj=s P(jli, l). Us- 
ing the same convention as in the previous sec- 
tion, the resulting model (MEMD2R) can be 
written: 
q(wlhi) exP(E~es aswP(5~ li, l)) p(wlhi, s ) = 
Z(hi, s) 
MEMD2R is simple and compact but poses a 
technical difficulty due to its use of real-valued 
features, in that the IIS training algorithm re- 
quires integer or boolean features for efficient 
implemention. Since likelihood is a concave 
function of ~, any hillclimbing method such as 
gradient ascent 4 is guaranteed to find maximum 
3Although it is possible to extend the basic framework 
to allow for embedded Hidden Markov Models (Lalferty, 
1995). 
4I found that the "stochastic" variant of this algo- 
rithm, in which model parameters are updated after each 
training example, gave the best performance. 
likelihood parameter values, but convergence is 
slower than IIS and requires tuning a gradient 
step parameter. Unfortunately, apart from this 
problem, MEMD2R also turns out to perform 
slightly worse than MEMD1, as described be- 
low. 
Using Class-based Position Features 
Since the basic problem with incorporating po- 
sition information is one of insufficient data, a 
natural solution is to try to group word pair and 
position combinations with similar behaviour 
into classes such that the frequency of each 
class in the training corpus is high enough for 
reliable estimation. To do this, I made two 
preliminary assumptions: 1) word pairs with 
similar MEMD1 weights should be grouped to- 
gether; and 2) position configurations with sim- 
ilar IBM2 probabilities should be grouped to- 
gether. This converts the problem from one 
of finding classes in the five-dimensional space 
(s, t, i, j, l) to one of identifying rectangular ar- 
eas on a 2-dimensional grid where one axis con- 
tains position configurations (i, j, l), ordered by 
p(jli,/); and the other contains word pairs (s, t), 
ordered by ast. To simplify further, I parti- 
tioned both axes so as to approximately bal- 
ance the total corpus frequency of all word pairs 
or position configurations within each parti- 
tion. Thus the only parameters required to com- 
pletely specify a classification are the number of 
position and word-pair partitions. Each combi- 
nation of a position partition and a word pair 
partition corresponds to a class, and all classes 
can be expected to have roughly the same em- 
pirical counts. 
The model (MEMD2B) based on this scheme 
has one feature for each class; if A designates the 
set of triples (i, j, l) in a position partition and 
B designates the set of pairs (s, t) in a word-pair 
partition, then for all A, B there is a feature: 
fA,B(w,i,s) l = ~j=l 5\[(i,j,l) EA A 
(sj,w) B A 
j = )sj\], 
where 5\[X\] is 1 when X is true and 0 other- 
wise. For robustness, I used these position fea- 
tures along with pure MEMDl-style word-pair 
features fst. The weights O~A, s on the position 
features can thus be interpreted as correction 
terms for the pure word-pair weights as,t which 
39 
segment file pairs sentence pairs English tokens French tokens 
train 922 1,639,250 29,547,936 31,826,112 
held-out 1 30 54,758 978,394 1,082,350 
held-out 2 30 59,435 1,111,454 1,241,581 
test 30 53,676 984,809 1,103,320 
Table 1: Corpus segmentation. The train segment was the main training corpus; the held-out 1 
segment was used for combining weights for the trigram and the overall linear model; and the 
held-out 2 segment was used for the MEMD2B partition search. 
reflect the proximity of the words in the pair. p(TIS) -1~IT\], where p is the model being eval- 
The model is: uated, and (S, T) is the test corpus, considered 
.to be a set of statistically independent sentence 
p(w\[hi,s) = q(wlhi)exp(~ses asw + aA(i,j~,O,B(s,t))pair s (s,t). Perplexity is a good indicator of Z(hi,s) 
where A(i,Ss,l) gives the partition for the cur- 
rent position, B(s, t) gives the partition for the 
current word pair, and following the usual con- 
vention, aA(i,j~,0,S(s,t) is zero if these are unde- 
fined. 
To find the optimal number of position par- 
titions m and word-pair partitions n, I per- 
formed a greedy search, beginning at a small ini- 
tial point (m, n) and at each iteration training 
two MEMD2B models characterized by (km, n) 
and (m, kn), where k > 1 is a scaling factor 
(note that both these models contain kmn po- 
sition features). The model which gives the 
best performance on a validation corpus is used 
as the starting point for the next iteration. 
Since training MEMD models is very expen- 
sive, to speed up the search I relaxed the con- 
vergence criterion from a training corpus per- 
plexity 5 drop of < .1% (requiring 20-30 IIS it- 
erations) to < .6% (requiring approximately 10 
IIS iterations). I stopped the search when the 
best model's performance on the validation cor- 
pus did not decrease significantly from that of 
the model at the previous step, indicating that 
overtraining was beginning to occur. 
3 Results 
I tested the models on the Canadian Hansard 
corpus, with English as the source  
and French as the target . After sen- 
tence alignment using the method described 
in (Simard et al., 1992), the corpus was split 
into disjoint segments as shown in table 1. 
To evaluate performance, I used perplexity: 
5Defined in the next section 
l~erformance for the TransType application de- 
scribed in the introduction, and it has also been 
used in the evaluation of full-fledged SMT sys- 
tems (A1-Onaizan et al., 1999). To ensure a fair 
comparison, all models used the same target vo- 
cabulary. For all MEMD models, I used 20,000 
word-pair features selected using the method 
described in (Foster, 2000); this is suboptimal 
but gives reasonably good performance and fa- 
cilitates experimentation. 
Figures 1 and 2 show, respectively, the path 
taken by the MEMD2B partition search, and 
the validation corpus perplexities of each model 
tested during the search. As shown in figure 1, 
the search consisted of 6 iterations. Since on all 
previous iterations no increase in position parti- 
tions beyond the initial value of 10 was selected, 
on the 5th iteration I tried decreasing the num- 
ber of position partitions to 5. This model was 
not selected either, so on the final step only the 
number of word-pair partitions was augmented, 
yielding an optimal combination of 10 position 
partitions and 4000 word-pair partitions. 
Table 2 gives the final results for all mod- 
els. The IBM models tested here incorporate 
a reduced set of 1M word-pair parameters, se- 
lected using the method described in (Foster, 
2000), which gives slightly better test-corpus 
performance than the unrestricted set of all 35M 
word pairs which cooccur within aligned sen- 
tence pairs in the training corpus. 
The basic MEMD1 model (without position 
parameters) attains about 30% lower perplex- 
ity than the model 2 baseline, and MEMD2B 
with an optimal-sized set of position param- 
eters achieves in a further drop of over 10%. 
Interestingly, the difference between IBM1 and 
40 
model word-pair position perplexity improvement 
parameters parameters over baseline 
trigram 
trigram + IBM1 
trigram + IBM2 
MEMD1 
MEMD2R 
MEMD2B 
MEMD2B 
0 
1,000,000 
1,000,000 
20,000 
20,000 
20,000 
20,000 
0 
0 
115,568 
0 
0 
10 x 10 
10 x 4000 
61.0 
43.2 
35.2 
24.5 
28.4 
22.1 
20.2 
O% 
30.4% 
19.3% 
37.2% 
42.6% 
Table 2: Model performances. Linear interpolation is designated with a + sign; and the MEMD2B 
position parameters are given as rex, where m and n are the numbers of position partitions and 
word-pair partitions respectively. 
4000 
2000 
1000 
,500 
250 
\ 
i t I i 
5 10 20 50 
number of posi~on parti~ons 
20.6 
20.4 
20.2 
~. 20 
o~ u 
19.9 
19.6 
19.4 
10 position classes --~ 
20 position classes -~ 
50 position dasses -0-- 
5 position classes .x .... 
I 
I I I r 
50 250 500 1000 2000 word-pair classes 
19.2 
4000 
Figure 1: MEMD2B partition search path, be- 
ginning at the point (10, 10). Arrows out of each 
point show the configurations tested at each it- 
eration. 
IBM2's performance (18.5% lower perplexity for 
IBM2) is about the same as the difference be- 
tween MEMD1 and MEMD2B (17.6% lower for 
MEMD2B). 
4 Conclusion 
This paper deals with the problem of incorpo- 
rating information about the positions of bilin- 
gual word pairs into a MEMD model which is 
analogous to the classical IBM model 1, thereby 
creating a MEMD analog to the IBM model 2. I 
proposed and evaluated two methods for accom- 
plishing this: using IBM2 position parameter 
probabilities as MEMD feature values, which 
was unsuccessful; and adding features which 
Figure 2: Validation corpus perplexities for var- 
ious MEMD2B models. Each connected line in 
this graph corresponds to a vertical column of 
search points in figure 1. 
capture the occurrence of a word-pair with a 
MEMD1 weight that falls into a specific range 
of values at a position to which IBM2 assigns 
a probability in a certain range. The second 
model achieved over 40% lower test perplex- 
ity than a linear combination of a trigram and 
IBM2, despite using several orders of magnitude 
fewer parameters. 
This work represents a novel approach to 
translation modeling which is most appropriate 
for applications like TransType which need to 
make rapid predictions of upcoming text. How- 
ever, it is not inconceivable that it could also 
be used for full-fledged MT. One partial impedi- 
ment to this is that the MEMD framework lacks 
41 
a mechanism equivalant to the EM algorithm 
for estimating probabilities associated with hid- 
den variables. The solution I have proposed 
here can be seen as a first step to investigat- 
ing ways of getting around this problem. 
Acknowledgements 
This work was carried out as part of the 
TransType project at RALI, funded by the Nat- 
ural Sciences and Engineering Research Council 
of Canada. 

References 
Yaser A1-Onaizan, Jan Curin, Michael Jahr, Kevin 
Knight, John Lafferty, Dan Melamed, Franz-Josef 
Och, David Purdy, Noah A. Smith, and David 
Yarowsky. 1999. Statistical machine translation: 
Final report, JHU workshop 1999. Technical 
report, The Center for Language and Speech 
Processing, The Johns Hopkins University, 
www.clsp.jhu.edu/ws99/projects/mt/final_report. 
Adam L. Berger, Stephen A. Della Pietra, and Vin- 
cent J. Della Pietra. 1996. A Maximum Entropy 
approach to Natural Language Processing. Com- 
putational Linguistics, 22(1):39-71. 
Peter F. Brown, Stephen A. Della Pietra, Vincent 
Della J. Pietra, and Robert L. Mercer. 1993. 
The mathematics of Machine Translation: Pa- 
rameter estimation. Computational Linguistics, 
19(2):263-312, June. 
S. Della Pietra, V. Della Pietra, and J. Lafferty. 
1995. Inducing features of random fields. Tech- 
nical Report CMU-CS-95-144, CMU. 
George Foster, Pierre Isabelle, and Pierre Plamon- 
don. 1997. Target-text Mediated Interactive Ma- 
chine Translation. Machine Translation, 12:175- 
194. 
George Foster. 2000. A Maximum Entropy / Min- 
imum Divergence translation model. In Proceed- 
ings of the 38th Annual Meeting of the Association 
for Computational Linguistics (ACL-38), Hong 
Kong, October. 
Ismael Garcfa-Varea, Francisco Casacuberta, and 
Hermann Ney. 1998. An iterative, DP-based 
search algorithm for statistical machine trans- 
lation. In Proceedings of the 5th International 
Conference on Spoken Language Processing (IC- 
SLP) 1998, Sydney, Australia, December. pages 
1135-1138. 
John D. Lafferty. 1995. Gibbs-markov models. In 
Computing Science and Statistics: Proceedings of 
the 27th Symposium on the Interface. Interface 
Foundation. 
Ph. Langlais and G. Foster. 2000. Using context- 
dependent interpolation to combine statistical 
 and translation models for interactive 
MT. In Content-Based Multimedia Information 
Access (RIAO), Paris, France, April. 
Ph. Langlais, G. Foster, and G. Lapalme. 2000. Unit 
completion for a computer-aided translation typ- 
ing system. In Proceedings of the 5th Conference 
on Applied Natural Language Processing (ANLP- 
5), Seattle, Washington, May. 
S. Niessen, S. Vogel, H. Ney, and C. Tillmann. 
1998. A DP based search algorithm for statistical 
machine translation. In Proceedings of the 36th 
Annual Meeting of the Association \]or Computa- 
tional Linguistics (ACL) and 17th International 
Conference on Computational Linguistics (COL- 
ING) 1998, pages 960-967, MontrEal, Canada, 
August. 
Franz Jose\] Och, Christoph Tillmann, and Hermann 
Ney. 1999. Improved alignment models for statis- 
tical machine translation. In Proceedings of the 
~nd Conference on Empirical Methods in Natu- 
ral Language Processing (EMNLP), College Park, 
Maryland. 
Michel Simard, George F. Foster, and Pierre Is- 
abelle. 1992. Using cognates to align sentences in 
bilingual corpora. In Proceedings of the 4th Con- 
ference on Theoretical and Methodological Is- 
sues in Machine Translation (TMI), Montr@al, 
Qu@bec. 
Ye-yi Wang and Alex Waibel. 1998. Fast decod- 
ing for statistical machine translation. In Proceed- 
ings of the 5th International Conference on Spo- 
ken Language Processing (ICSLP) 1998, Sydney, 
Australia, December, pages 2775-2778. 
