Maximum Entropy Model Learning of the Translation Rules 
Kengo Sato and Masakazu Nakanishi 
Department of Computer Science 
Keio University 
3-14-1, Hiyoshi, Kohoku, Yokohama 223-8522, Japan 
e-mail: {satoken, czl}@nak, ics. keio. ac. jp 
Abstract 
This paper proposes a learning method of 
translation rules from parallel corpora. This 
method applies the maximum entropy prin- 
ciple to a probabilistic model of translation 
rules. First, we define feature functions 
which express statistical properties of this 
model. Next, in order to optimize the model, 
the system iterates following steps: (1) se- 
lects a feature function which maximizes log- 
likelihood, and (2) adds this function to the 
model incrementally. As computational cost 
associated with this model is too expensive, 
we propose several methods to suppress the 
overhead in order to realize the system. The 
result shows that it attained 69.54% recall 
rate. 
1 Introduction 
A statistical natural language modeling can 
be viewed as estimating a combinational dis- 
tribution X x Y -+ \[0, 1\] using training data 
(xl, yl>,..., <XT, YT> 6 X. x Y observed in 
corpora. For this topic, Baum (1972) pro- 
posed EM algorithm, which was basis of 
Forward-Backward algorithm for the hidden 
Markov model (HMM) and Inside-Outside 
algorithm (Lafferty, 1993) for the pr0babilis- 
tic context free grammar (PCFG). However, 
these methods have problems such as in- 
creasing optimization costs which is due to 
a lot of parameters. Therefore, estimating a 
natural language model based on the max- 
imum entropy (ME) method (Pietra et al., 
1995; Berger et al., 1996) has been high- 
lighted recently. 
On the other hand, dictionaries for multi- 
lingual natural language processing such as 
the machine translation has been made by 
human hand usually. However, since this 
work requires a great deal of labor and it 
is difficult to keep description of dictionar- 
ies consistent, the researches of automatical 
dictionaries making for machine translation 
(translation rules) from corpora become ac- 
tive recently (Kay and RSschesen, 1993; Kaji 
and Aizono, 1996). 
In this paper, we notice that estimating 
a language model based on ME method is 
suitable for learning the translation rules, 
and propose several methods to resolve prob- 
lems in adapting ME method to learning the 
translation rules. 
2 Problem Setting 
If there exist (xl, Yl>,..., {XT, YT) 6 X × Y 
such that each xi is translated into Yi in 
the parallel corpora X,Y, then its empiri- 
cal probability distribution/5 obtained from 
observed training data is defined by: 
p(x,y) - c(x,y) (1) 
Ex, c(x,y) 
where c(x, y) is the number of times that x 
is translated into y in the training data. 
However, since it is difficult to observe 
translating between words actually, c(x, y) is 
approximated with equation (2) for sentence 
aligned parallel corpora. 
<(x,y) c(x, y) = 
T (2) 
where X~ is i-th sentence in X. We denote 
that sentence Xi is translated into sentence 
Y/ in aligned parallel corpora. And c~(x, y) 
1171 
is the number of times that x and y appear 
in the i-th sentence. 
Our task is to learn the translation rules 
by estimating probability distribution p(yI x) 
that x E X is translated into y E Y from 
15(x, y) given above. 
3 Maximum Entropy Method 
3.1 Feature Function 
We define binary-valued indicator function 
f : X × Y -+ {0,1} which divide X x Y 
into two subsets. This is called feature func- 
tion, which expresses statistical properties of 
a language model. 
The expected value of f with respected to 
iS(x, y) is defined such as: 
p(f) =  p(x,y)f(x,y) (z) 
x,y 
Thus training data are summarized as the 
expected value of feature function f. 
The expected value of a feature function 
f with respected to p(yl x) which we would 
like to estimate is defined such as: 
p(f) = y~fi(x)p(ylx)f(x,y ) (4) 
x,y 
where 15(x) is the empirical probability dis- 
tribution on X. Then, the model which we 
would like to estimate is under constraint to 
satisfy an equation such as: 
p(f) =iS(f) (5) 
This is called the constraint equation. 
3.2 Maximum Entropy Principle 
When there are feature functions fi(i E 
{1, 2,..., n}) which are important to model- 
ing processes, the distribution p we estimate 
should be included in a set of distributions 
defined such as: 
C = {p E 7 9 I P(fi) =16(fi) for i E {1,2,...,n}} (6) 
where P is a set of all possible distributions 
onX×Y. 
For the distribution p, there is no assump- 
tion except equation (6), so it is reason- 
able that the most uniform distribution is 
the most suitable for the training corpora. 
The conditional entropy defined in equa- 
tion (7) is used as the mathematical measure 
of the uniformity of a conditional probability 
p(ylx). 
H(p) = - y~(x)p(ylx ) logp(ylx ) (7) 
x,y 
That is, the model p. which maximizes the 
entropy H should be selected from C. 
p. -- argmax H(p) (S) 
pet 
This heuristic is called the maximum entropy 
principle. 
3.3 Parameter Estimation 
In simple cases, we can find the solution 
to the equation (8) analytically. Unfortu- 
nately, there is no analytical solution in gen- 
eral cases, and we need a numerical algo- 
rithm to find the solution. 
By applying the Lagrange multiplier to 
equation (7), we can introduce the paramet- 
ric form of p. 
1 Px(YIx)- Z>,(x) exp hifi(x,y) (9) 
Z,x(x) = y~ exp (~,~ifi(x,y)) 
Y 
where each hi is the parameter for the fea- 
ture fi. P~ is known as Gibbs distribution. 
Then, to solve p. E C in equation (8) is 
equivalent to solve h. that maximize the log- 
likelihood: 
= -   (x)log zj,(z) + 
x i 
(10) 
h. = argmax kV(h) 
Such h. can be solved by one of the nu- 
merical algorithm called the Improved Itera- 
tire Scaling Algorithm (Berger et al., 1996). 
1. Start with hi = 0 for alli E {1,2,...,n} 
2. Do for each i E {1,2,...,n}: 
1172 
(a) Let AAi be the solution to 
~-~(x)p(ylx)$i(x,y)exp (AAif#(x,y)) = P(fi) 
x~y 
(11) 
where f#(x,y) = Ei~=t f~(x,y) 
(b) Update the value of Ai according to: 
Ai ~- A~ + AAi 
4 Maximum Entropy Model 
Learning of the Translation 
Rules 
The art of modeling with the maximum en- 
tropy method is to define an informative 
set of computationally feasible feature func- 
tions. In this section, we define two models 
of feature functions for learning the transla- 
tion rules. 
3. Go to step 2 if not all the Ai have con- 
verged 
To solve AAi in the step (2a), the Newton's 
method is applied to equation (11). 
3.4 Feature Selection 
In general cases, there exist a large collec- 
tion ~" of candidate features, and because 
of the limit of machine resources, we can- 
not expect to obtain all iS(f) estimated in 
real-life. However, the Maximum Entropy 
Principle does not explicitly state how to se- 
lect those particular constraints. We build a 
subset S C ~" incrementally by iterating to 
adjoin a feature f E ~" which maximizes log- 
likelihood of the model to S. This algorithm 
is called the Basic Feature Selection (Berger 
et al., 1996). 
Model 1: Co-occurrence Information 
The first model is defined with co-occurrence 
information between words appeared in the 
corpus X. 
{ 1 (x e W(d,w)) (12) fw(x,y) = 0 (otherwise) 
where W(d,w) is a set of words which ap- 
peared within d words from w E X (in our 
experiments, d = 5). fw(x,y) expresses the 
information on w for predicting that x is 
translated into y (Figure 1). 
..... W ............ X .......... ~'X 
pred~cti~ power" "/translation role 
............... y ............... ~ Y 
1. Start with S = O Figure 1: co-occurance information 
. Do for each candidate feature f E ~': 
Compute the model Psus using Improve 
Iterative Scaling Algorithm and the 
gain in the log-likelihood from adding 
this feature 
Model 2: Morphological Information 
The second model is defined with morpho- 
logical information such as part-of-speech. 
3. Check the termination condition 
4. Select the feature \] with maximal gain 
5. Adjoin f to S 
6. Compute Ps using Improve Iterative Al- 
gorithm 
7. Go to Step 2 
{ l osxtl 
ft,s(x, Y) = 1 and POS(y) s 
0 (otherwise) 
(13) 
where POS(x) is a part-of-speech tag for x. ft,u(x, y) 
expresses the information on part- 
of-speech t, s for predicting that x is trans- 
lated into y (Figure 2). If part-of-speech tag- 
1173 
t - eos 
................ predictive ~"/x .......... ~'-X 
power _ " l 
~'~'Jtranslation mle 
............... y ............... ,.y 
Figure 2: morphological information 
gers for each language work extremely ac- 
curate, then these feature functions can be 
generated automatically. 
5 Implementation 
Computational cost associated with the 
model described above is too expensive to 
realize the system for learning the transla- 
tion rules. We propose several methods to 
suppress the overhead. 
An estimated probability p~(yI x) for a pair 
of (x,y) E X x Y which has not been ob- 
served as the sample data in the parallel 
corpora X,Y should be kept lower. Ac- 
cording to equation (9), we can allow to let fi(x,y) 
= 0 (for all i E {1,...,n}) for non- 
observed (x, y). Therefore, we will accept 
observed (x, y) only instead of all possible 
(x, y) in summation in equation (11), so that p~(ylx) 
can be calculated much more effi- 
ciently. 
Suppose that a set of (x, y) such that each 
member activates a feature function f is de- 
fined by: 
D(f)= {(x,y) eX×rlf(x,y)= 1} (14) 
Shirai et al. (1996) showed that if D(fi) and D(fj) 
were exclusive to each other, that is 
D(fi) fq D(fj) = O, then Ai and Xj could 
be estimated independently. Therefore, we 
can split a set of candidate feature functions 
.T" into several exclusive subsets, and calcu- 
late Px(YlX) more efficiently by estimating on 
each subset independently. 
6 Experiments and Results 
As the training corpora, we used 6,057 pairs 
of sentences included in Kodansya Japanese- 
English Dictionary, a machine-readable dic- 
tionary made by the Electrotechnical Lab- 
oratory. By applying morphological anal- 
ysis for the corpora, each word was trans- 
formed to the infinitive form. We excluded 
words which appeared below 3 times or over 
1,000 times from the target of learning. Con- 
sequently, our target for the experiments 
included 1,375 English words and 1,195 
Japanese words, and we prepared 1,375 fea- 
ture functions for model 1 and 2,744 for 
model 2 (56 part-of-speech for English and 
49 part-of-speech for Japanese). 
We tried to learn the translation rules 
from English to Japanese. We had two ex- 
periments: one of model 1 as the set of fea- 
ture functions, and one of model 1 + 2. For 
each experiment, 500 feature functions were 
selected according to the feature selection 
algorithm described in section 3.4, and we 
calculated p(yI x) in equation (9), that is, 
the probability that English word x is trans- 
lated into Japanese word y. For each English 
word, all Japanese word were ordered by es- 
timated probability p(yix), and we evaluated 
the recall rates by comparing the dictionary. 
Table 1 shows the recall rates for each ex- 
periment. The numbers for 15(x,y) are the 
Ta )le 1: rec 1st 
y) 44.55% 
model 1 41.58% 
model 1 +2 58.29% 
dl rates -~ 3rd 
53.47% 
63.37% 
69.54% 
,-~ 10th 
58.42% 
76.24% 
80.13% 
recall rates when the empirical probability 
defined by equation (1) was used instead of 
the estimated probability. It is showed that 
the model 1 + 2 attains higher recall rates 
than the model 1 and ~(x, y). 
Figure 3 shows the log-likelihood for each 
model plotted by the number of feature func- 
tions in the feature selection algorithm. No- 
tice that the log-likelihood for the model 1+2 
is always higher than the model 1. 
Thus, the model 1 + 2 is more'effective 
than the model 1 for learning the translation 
rules. 
However, the result shows that the recall 
1174 
+9.02 
-9.04 
.11.06 
*&08 
-Ik12 
-&14 
-9.14 
.9.16 I I I I I I I I I 
50 100 1~0 290 2~ ~0 350 400 ~ 500 
Ihe nun~ od ~t~ll 
Figure 3: log-likelihood 
rates of the '1st' for all models are not fa- 
vorable. We consider that it is the reason 
for this to assume word-to-word translation 
rules implicitly. 
7 Conclusions 
We have described an approach to learn the 
translation rules from parallel corpora based 
on the maximum entropy method. As fea- 
ture functions, we have defined two mod- 
els, one with co-occurrence information and 
the other with morphological information. 
As computational cost associated with this 
method is too expensive, we have proposed 
several methods to suppress the overhead in 
order to realize the system. We had experi- 
ments for each model of features, and the re- 
sult showed the effectiveness of this method, 
especially for the model of features with co- 
occurrence and morphological information. 
Acknowledgments 
We would like to thank the Electrotechni- 
cal Laboratory for giving us the machine- 
readable dictionary which was used as the 
training data. 

References 
L. E. Baum. 1972. An inequality and associ- 
ated maximumization technique in statis- 
tical estimation of probabilistic functions 
of a markov process. Inequalities, 3:1-8. 
Adam L. Berger, Stephen A. Della Pietra, 
and Vincent J. Della Pietra. 1996. A max- 
imum entropy approach to natural lan- 
guage processing. Computational Linguis- 
tics, 22(1):39-71. 
Hiroyuki Kaji and Toshiko Aizono. 1996. 
Extracting word correspondences from 
bilingual corpora based on word co- 
occurrence information. In Proceedings 
of the 16th International Conference on 
Computational Linguistics, pages 23-28. 
M. Kay and M. RSschesen. 1993. Text 
translation alignment. Computational 
Linguistics, 19(1):121-142. 
J. D. Lafferty. 1993. A derivation of the 
inside-outside algorithm from the EM al- 
gorithm. IBM Research Report. IBM T.J. 
Watson Research Center. 
Stephen Della Pietra, Vincent Della Pietra, 
and John Lafferty. 1995. Inducing fea- 
tures of random fields. Technical Report 
CMU-CS-95-144, Carnegie Mellon Univer- 
sity, May. 
Adwait Ratnaparkhi. 1997. A linear ob- 
served time statistical parser based on 
maximum entropy models. In Proceedings 
of Second Conference On Empirical Meth- 
ods in Natural Language Processing. 
Jeffrey C. Reynar and Adwait Ratnaparkhi. 
1997. A maximum entropy approach to 
identifying sentence boundaries. In Pro- 
ceedings of the 5th Applied Natural Lan- 
guage Processing Conference. 
Ronald Rosenfeld. 1996. A maximum en- 
tropy approach to adaptive statistical lan- 
guage modeling. Computer, Speech and 
Language, (10):187-228. 
Kiyoaki Shirai, Kentaro Inui, Takenobu 
Tokunaga, and Hozumi Tanaka. 1996. 
A maximum entropy model for estimat- 
ing lexical bigrams (in Japanese). In SIG 
Notes of the Information Processing Soci- 
ety of Japan, number 96-NL-116. 
Takehito Utsuro, Takashi Miyata, and Yuji 
Matsumoto. 1997. Maximum entropy 
model learning of subcategorizatoin pref- 
erence. In Proceedings of the 5th Work- 
shop on Very Large Corpora, pages 246- 
260, August. 
