A Maximum Entropy Approach 
to Natural Language Processing 
Adam L. Berger t 
Columbia University 
Vincent J. Della Pietra ~ 
Renaissance Technologies 
Stephen A. Della Pietra ~ 
Renaissance Technologies 
The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only 
recently, however, have computers become powerful enough to permit the widescale application 
of this concept to real world problems in statistical estimation and pattern recognition. In this 
paper, we describe a method for statistical modeling based on maximum entropy. We present 
a maximum-likelihood approach for automatically constructing maximum entropy models and 
describe how to implement this approach efficiently, using as examples several problems in natural 
language processing. 
1. Introduction 
Statistical modeling addresses the problem of constructing a stochastic model to predict 
the behavior of a random process. In constructing this model, we typically have at our 
disposal a sample of output from the process. Given this sample, which represents an 
incomplete state of knowledge about the process, the modeling problem is to parlay 
this knowledge into a representation of the process. We can then use this representation 
to make predictions about the future behavior about the process. 
Baseball managers (who rank among the better paid statistical modelers) employ 
batting averages, compiled from a history of at-bats, to gauge the likelihood that a 
player will succeed in his next appearance at the plate. Thus informed, they manipu- 
late their lineups accordingly. Wall Street speculators (who rank among the best paid 
statistical modelers) build models based on past stock price movements to predict to- 
morrow's fluctuations and alter their portfolios to capitalize on the predicted future. 
At the other end of the pay scale reside natural language researchers, who design 
language and acoustic models for use in speech recognition systems and related ap- 
plications. 
The past few decades have witnessed significant progress toward increasing the 
predictive capacity of statistical models of natural language. In language modeling, for 
instance, Bahl et al. (1989) have used decision tree models and Della Pietra et al. (1994) 
have used automatically inferred link grammars to model long range correlations in 
language. In parsing, Black et al. (1992) have described how to extract grammatical 
* This research, supported in part by ARPA under grant ONR N00014-91-C-0135, was conducted while the authors were at the IBM T. J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598 
t Now at Computer Science Department, Columbia University. 
:~ Now at Renaissance Technologies, Stony Brook, NY. 
© 1996 Association for Computational Linguistics 
Computational Linguistics Volume 22, Number 1 
rules from annotated text automatically and incorporate these rules into statistical 
models of grammar. In speech recognition, Lucassen and Mercer (1984) have intro- 
duced a technique for automatically discovering relevant features for the translation 
of word spelling to word pronunciation. 
These efforts, while varied in specifics, all confront two essential tasks of statistical 
modeling. The first task is to determine a set of statistics that captures the behavior of 
a random proceSs. Given a set of statistics, the second task is to corral these facts into 
an accurate model of the process--a model capable of predicting the future output 
of the process. The first task is one of feature selection; the second is one of model 
selection. In the following pages we present a unified approach to these two tasks 
based on the maximum entropy philosophy. 
In Section 2 we give an overview of the maximum entropy philosophy and work 
through a motivating example. In Section 3 we describe the mathematical structure of 
maximum entropy models and give an efficient algorithm for estimating the parame- 
ters of such models. In Section 4 we discuss feature selection, and present an automatic 
method for discovering facts about a process from a sample of output from the process. 
We then present a series of refinements to the method to make it practical to imple- 
ment. Finally, in Section 5 we describe the application of maximum entropy ideas to 
several tasks in stochastic language processing: bilingual sense disambiguation, word 
reordering, and sentence segmentation. 
2. A Maximum Entropy Overview 
We introduce the concept of maximum entropy through a simple example. Suppose we 
wish to model an expert translator's decisions concerning the proper French rendering 
of the English word in. Our model p of the expert's decisions assigns to each French 
word or phrase f an estimate, p(f), of the probability that the expert would choose f as 
a translation of in. To guide us in developing p, we collect a large sample of instances 
of the expert's decisions. Our goal is to extract a set of facts about the decision-making 
process from the sample (the first task of modeling) that will aid us in constructing a 
model of this process (the second task). 
One obvious clue we might glean from the sample is the list of allowed trans- 
lations. For example, we might discover that the expert translator always chooses 
among the following five French phrases: {dans, en, ?l, au cours de, pendant}. With this 
information in hand, we can impose our first constraint on our model p: 
p(dans) + p(en) + p(h) + p(au cours de) + p(pendant) = 1 
This equation represents our first statistic of the process; we can now proceed to 
search for a suitable model that obeys this equation. Of course, there are an infinite 
number of models p for which this identity holds. One model satisfying the above 
equation is p(dans) = 1; in other words, the model always predicts dans. Another 
model obeying this constraint predicts pendant with a probability of 1/2, and ~ with a 
probability of 1/2. But both of these models offend our sensibilities: knowing only that 
the expert always chose from among these five French phrases, how can we justify 
either of these probability distributions? Each seems to be making rather bold assump- 
tions, with no empirical justification. Put another way, these two models assume more 
than we actually know about the expert's decision-making process. All we know is 
that the expert chose exclusively from among these five French phrases; given this, 
40 

Computational Linguistics Volume 22, Number I 
these questions, how do we go about finding the most uniform model subject to a set 
of constraints like those we have described? 
The maximum entropy method answers both of these questions, as we will demon- 
strate in the next few pages. Intuitively, the principle is simple: model all that is known 
and assume nothing about that which is unknown. In other words, given a collection 
of facts, choose a model consistent with all the facts, but otherwise as uniform as 
possible. This is precisely the approach we took in selecting our model p at each step 
in the above example. 
The maximum entropy concept has a long history. Adopting the least complex 
hypothesis possible is embodied in Occam's razor ("Nunquam ponenda est pluralitas 
sine necesitate.') and even appears earlier, in the Bible and the writings of Herotodus 
(Jaynes 1990). Laplace might justly be considered the father of maximum entropy, 
having enunciated the underlying theme 200 years ago in his "Principle of Insufficient 
Reason:" when one has no information to distinguish between the probability of two 
events, the best strategy is to consider them equally likely (Guiasu and Shenitzer 
1985). As E. T. Jaynes, a more recent pioneer of maximum entropy, put it (Jaynes 
1990): 
... the fact that a certain probability distribution maximizes entropy 
subject to certain constraints representing our incomplete information, 
is the fundamental property which justifies use of that distribution 
for inference; it agrees with everything that is known, but carefully 
avoids assuming anything that is not known. It is a transcription into 
mathematics of an ancient principle of wisdom ... 
3. Maximum Entropy Modeling 
We consider a random process that produces an output value y, a member of a finite set 
3;. For the translation example just considered, the process generates a translation of the 
word in, and the output y can be any word in the set {dans, en, ?~, au cours de, pendant}. 
In generating y, the process may be influenced by some contextual information x, a 
member of a finite set X. In the present example, this information could include the 
words in the English sentence surrounding in. 
Our task is to construct a stochastic model that accurately represents the behavior 
of the random process. Such a model is a method of estimating the conditional prob- 
ability that, given a context x, the process will output y. We will denote by p(ylx) the 
probability that the model assigns to y in context x. With a slight abuse of notation, we 
will also use p(ylx) to denote the entire conditional probability distribution provided 
by the model, with the interpretation that y and x are placeholders rather than specific 
instantiations. The proper interpretation should be clear from the context. We will de- 
note by/~ the set of all conditional probability distributions. Thus a model p(y\[x) is, 
by definition, just an element of ~v. 
3.1 Training Data 
To study the process, we observe the behavior of the random process for some time, 
collecting a large number of samples (xl,yl), (x2, y2) ..... (XN, YN). In the example we 
have been considering, each sample would consist of a phrase x containing the words 
surrounding in, together with the translation y of in that the process produced. For 
now, we can imagine that these training samples have been generated by a human 
expert who was presented with a number of random phrases containing in and asked 
to choose a good translation for each. When we discuss real-world applications in 
42 

Computational Linguistics Volume 22, Number 1 
Combining (1), (2) and (3) yields the more explicit equation 
~(x)p(yix)f(x, y) = ~ ~(x, y)f(x, y) 
x,y x,y 
We call the requirement (3) a constraint equation or simply a constraint. By re- 
stricting attention to those models p(ylx) for which (3) holds, we are eliminating from 
consideration those models that do not agree with the training sample on how often 
the output of the process should exhibit the feature f. 
To sum up so far, we now have a means of representing statistical phenomena 
inherent in a sample of data (namely, ~(f)), and also a means of requiring that our 
model of the process exhibit these phenomena (namely, p(f) =/5(f)). 
One final note about features and constraints bears repeating: although the words 
"feature" and "constraint" are often used interchangeably in discussions of maximum 
entropy, we will be vigilant in distinguishing the two and urge the reader to do 
likewise. A feature is a binary-valued function of (x,y); a constraint is an equation 
between the expected value of the feature function in the model and its expected 
value in the training data. 
3.3 The Maximum Entropy Principle 
Suppose that we are given n feature functions fi, which determine statistics we feel 
are important in modeling the process. We would like our model to accord with these 
statistics. That is, we would like p to lie in the subset C of 7 ~ defined by 
C,=_{pEP\[p(fi)=P(fi) fori E {1,2 ..... n}} (4) 
Figure 1 provides a geometric interpretation of this setup. Here 7 ~ is the space of all 
(unconditional) probability distributions on three points, sometimes called a simplex. 
If we impose no constraints (depicted in (a)), then all probability models are allowable. 
Imposing one linear constraint Q restricts us to those p E P that lie on the region 
defined by C1, as shown in (b). A second linear constraint could determine p exactly, if 
the two constraints are satisfiable; this is the case in (c), where the intersection of C1 and 
C2 is non-empty. Alternatively, a second linear constraint could be inconsistent with the 
first--for instance, the first might require that the probability of the first point is 1/3 
and the second that the probability of the third point is 3/4--this is shown in (d). In the 
present setting, however, the linear constraints are extracted from the training sample 
and cannot, by construction, be inconsistent. Furthermore, the linear constraints in our 
applications will not even come close to determining p C/v uniquely as they do in (c); 
instead, the set C = Q ~ C2 M ... N C, of allowable models will be infinite. 
Among the models p E C, the maximum entropy philosophy dictates that we select 
the most uniform distribution. But now we face a question left open in Section 2: what 
does "uniform" mean? 
A mathematical measure of the uniformity of a conditional distribution p(y\[x) is 
provided by the conditional entropy 1 
H(p) - -  (x)v(ylx) log p(ylx) (5) 
x,y 
1 A more common notation for the conditional entropy is H(Y \[ X), where Y and X are random variables 
with joint distribution ~(x)p(y\[x). To emphasize the dependence of the entropy on the probability 
distribution p, we have adopted the alternate notation H(p). 
44 
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
Figure 1 
Four different scenarios in constrained optimization. ~ represents the space of all probability 
distributions. In (a), no constraints are applied, and all p C ~ are allowable. In (b), the 
constraint C1 narrows the set of allowable models to those that lie on the line defined by the 
linear constraint. In (c), two consistent constraints C1 and C2 define a single model p C CI A C2. 
In (d), the two constraints are inconsistent (i.e., Q N C3 = 0); no p E/~ can satisfy them both. 
The entropy is bounded from below by zero, the entropy of a model with no uncer- 
tainty at all, and from above by log lYl, the entropy of the uniform distribution over 
all possible lYl values of y. With this definition in hand, we are ready to present the 
principle of maximum entropy. 
Maximum Entropy Principle 
To select a model from a set C of allowed probability distributions, 
choose the model p. E C with maximum entropy H(p): 
p. = argmaxH(p) (6) 
pEC 
It can be shown that p. is always well-defined; that is, there is always a unique model 
p. with maximum entropy in any constrained set C. 
3.4 Parametric Form 
The maximum entropy principle presents us with a problem in constrained optimiza- 
tion: find the p. E C that maximizes H(p). In simple cases, we can find the solution to 
45 
Computational Linguistics Volume 22, Number 1 
this problem analytically. This was true for the example presented in Section 2 when 
we imposed the first two constraints on p. Unfortunately, the solution to the general 
problem of maximum entropy cannot be written explicitly, and we need a more in- 
direct approach. (The reader is invited to try to calculate the solution for the same 
example when the third constraint is imposed.) 
To address the general problem, we apply the method of Lagrange multipliers 
from the theory of constrained optimization. The relevant steps are outlined here; 
the reader is referred to Della Pietra et al. (1995) for a more thorough discussion of 
constrained optimization as applied to maximum entropy. 
• We will refer to the original constrained optimization problem, 
find p, -- argmaxH(p) 
pEC 
as the primal problem. 
For each feature fi we introduce a parameter hi (a Lagrange multiplier). 
We define the Lagrangian A(p, ~) by 
A(p, )~) _= H(p) + ~ ,~i (p(fi) - f2(fi)) 
i 
Holding A fixed, we compute the unconstrained maximum of the 
Lagrangian A(p, )~) over all p E ~. We denote by p~ the p where A(p, ,~) 
achieves its maximum and by ~(~) the value at this maximum: 
p;~ = argmaxA(p,)~) 
pE'P 
• (A) - A(p;~,A) 
(7) 
(8) 
(9) 
We call @(;~) the dual function. The functions p;~ and ~(;~) may be 
calculated explicitly using simple calculus. We find 
px(ylx)- Z~(x)eXp . ,Xifdx, y) (10) 
~(2~) = - ~Z~ ~(x)log Z),(x) + y~ ,~iP(fi) (11) 
x i 
where Z;,(x) is a normalizing constant determined by the requirement 
that ~,yp2~(ylx) = 1 for all x: 
Z,x(x)=~_exp(~)~ifi(x,y)) (12) 
Y 
Finally, we pose the unconstrained dual optimization problem: 
Find )~* = argmax ¢g()~) 
46 
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
At first glance it is not clear what these machinations achieve. However, a funda- 
mental principle in the theory of Lagrange multipliers, called generically the Kuhn- 
Tucker theorem, asserts that under suitable assumptions, the primal and dual problems 
are, in fact, closely related. This is the case in the present situation. Although a de- 
tailed account of this relationship is beyond the scope of this paper, it is easy to state 
the final result: Suppose that A* is the solution of the dual problem. Then Px* is the 
solution of the primal problem; that is p;~. = p,. In other words, 
The maximum entropy model subject to the constraints C has the para- 
metric form 2 p;~. of (10), where the parameter values A* can be deter- 
mined by maximizing the dual function ~(A). 
The most important practical consequence of this result is that any algorithm for 
finding the maximum A* of ~(A) can be used to find the maximum p, of H(p) for 
peC. 
3.5 Relation to Maximum Likelihood 
The log-likelihood L~(p) of the empirical distribution/5 as predicted by a model p is 
defined by 3 
L~ (p) = log H P(Ylx)P(X' y) = ~ ~(x, y) log p(ylx) 
x,y x,y 
(13) 
It is easy to check that the dual function ~(A) of the previous section is, in fact, just 
the log-likelihood for the exponential model p~; that is 
• (A) = L~(p;,) (14) 
With this interpretation, the result of the previous section can be rephrased as: 
The model p, E C with maximum entropy is the model in the para- 
metric family p:~(ylx) that maximizes the likelihood of the training 
sample ~. 
This result provides an added justification for the maximum entropy principle: If 
the notion of selecting a model p, on the basis of maximum entropy isn't compelling 
enough, it so happens that this same p, is also the model that can best account for the 
training sample, from among all models of the same parametric form (10). 
Table 1 summarizes the primal-dual framework we have established. 
2 It might be that the dual function tI,(A) does not achieve its maximum at any finite A*. In this case, the 
maximum entropy model will not have the form p~ for any A. However, it will be the limit of models 
of this form, as indicated by the following result, whose proof we omit: 
Suppose An is any sequence such that ~(An) converges to the maximum O f @(A). Then PAn 
converges to p.. 
3 We will henceforth abbreviate L~,(p) by L(p) when the empirical distribution ~ is clear from context. 
47 
Computational Linguistics Volume 22, Number 1 
Table 1 
The duality of maximum entropy and maximum likelihood is an example 
of the more general phenomenon of duality in constrained optimization. 
Primal Dual 
problem argmaxp~ c H(p) argmax~ • ()`) 
description maximum entropy maximum likelihood 
type of search constrained optimization unconstrained optimization 
search domain p E C real-valued vectors {)`1, ),2...} 
solution p . ),* 
Kuhn-Tucker theorem: p, = p~. 
3.6 Computing the Parameters 
For all but the most simple problems, the ;~* that maximize ~()~) cannot be found 
analytically. Instead, we must resort to numerical methods. From the perspective of 
numerical optimization, the function @()0 is well behaved, since it is smooth and 
convex-~ in )~. Consequently, a variety of numerical methods can be used to calcu- 
late )~*. One simple method is coordinate-wise ascent, in which )~* is computed by 
iteratively maximizing q~()~) one coordinate at a time. When applied to the maximum 
entropy problem, this technique yields the popular Brown algorithm (Brown 1959). 
Other general purpose methods that can he used to maximize ~()~) include gradient 
ascent and conjugate gradient. 
An optimization method specifically tailored to the maximum entropy problem 
is the iterative scaling algorithm of Darroch and Ratcliff (1972). We present here a 
version of this algorithm specifically designed for the problem at hand; a proof of the 
monotonicity and convergence of the algorithm is given in Della Pietra et al. (1995). 
The algorithm is applicable whenever the feature functions fi (x, y) are nonnegative: 
fi(x,y) >>_ 0 for all i, x, and y (15) 
This is, of course, true for the binary-valued feature functions we are considering here. 
The algorithm generalizes the Darroch-Ratcliff procedure, which requires, in addition 
to the nonnegativity, that the feature functions satisfy ~ifi(x, Y) = 1 for all x, y. 
Algorithm 1: Improved Iterative Scaling 
Input: Feature functions fl,f2 .... fn; empirical distribution ~(x,y) 
Output : Optimal parameter values )~*i; optimal model p~. 
. 
2. 
Start with )~i = 0 for all i E {1, 2 ..... n} 
Do for each i c {1,2 ..... n}: 
a. Let A~i be the solution to 
~(X)p(y\[x)fi(x, y)exp(A)~/f # (x, y)) = \]?(fi) 
x,y 
where 
Yl 
f#(x,y) =- ~fi(x,y) 
i=1 
(16) 
(17) 
48 
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
b. Update the value of hi according to: ,~i ~ '~i -'~ AAi 
3. Go to step 2 if not all the "~i have converged 
The key step in the algorithm is step (2a), the computation of the increments AAi 
that solve (16). If f#(x,y) is constant (f#(x,y) = M for all x,y, say) then AAi is given 
explicitly as 
1  (fi) -- ft. log 
P;,(fi) IVI 
If f# (x, y) is not constant, then A,~ i must be computed numerically. A simple and 
effective way of doing this is by Newton's method. This method computes the solution 
a, of an equation g(a,) = 0 iteratively by the recurrence 
(18) 
with an appropriate choice for a0 and suitable attention paid to the domain of g. 
4. Feature Selection 
Earlier we divided the statistical modeling problem into two steps: finding appropriate 
facts about the data, and incorporating these facts into the model. Up to this point we 
have proceeded by assuming that the first task was somehow performed for us. Even 
in the simple example of Section 2, we did not explicitly state how we selected those 
particular constraints. That is, why is the fact that dans or ~ was chosen by the expert 
translator 50% of the time any more important than countless other facts contained in 
the data? In fact, the principle of maximum entropy does not directly concern itself 
with the issue of feature selection, it merely provides a recipe for combining constraints 
into a model. But the feature selection problem is critical, since the universe of possible 
constraints is typically in the thousands or even millions. In this section we introduce a 
method for automatically selecting the features to be included in a maximum entropy 
model, and then offer a series of refinements to ease the computational burden. 
4.1 Motivation 
We begin by specifying a large collection ~" of candidate features. We do not require 
a priori that these features are actually relevant or useful. Instead, we let the pool be 
as large as practically possible. Only a small subset of this collection of features will 
eventually be employed in our final model. 
If we had a training sample of infinite size, we could determine the "true" expected 
value for a candidate feature f E ~- simply by computing the fraction of events in the 
sample for which f(x, y) = 1. In reaMife applications, however, we are provided with 
only a small sample of N events, which cannot be trusted to represent the process 
fully and accurately. Specifically, we cannot expect that for every feature f E ~', the 
estimate of ~(f) we derive from this sample will be close to its value in the limit as 
n grows large. Employing a larger (or even just a different) sample of data from the 
same process might result in different estimates of/5(f) for many candidate features. 
We would like to include in the model only a subset $ of the full set of candidate 
features jr. We will call 8 the set of active features. The choice of 8 must capture 
as much information about the random process as possible, yet only include features 
whose expected values can be reliably estimated. 
49 

Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
By adding feature f to S, we obtain a new set of active features S U f. Following (19), 
this set of features determines a set of models 
C($Uf) ~ {pE'PIp(f)=~(f)forallfESUf} (21) 
The optimal model in this space of models is 
Psu~ ~ argmaxH(p) (22) p~C(Suf) 
Adding the feature f allows the model Paul to better account for the training sample; 
this results in a gain AL($,f) in the log-likelihood of the training data 
AL(S,f) = L(psu~) - L(ps) (23) 
At each stage of the model-construction process, our goal is to select the candidate 
feature f E ~" which maximizes the gain AL($,f); that is, we select the candidate 
feature which, when adjoined to the set of active features S, produces the greatest 
increase in likelihood of the training sample. This strategy is implemented in the 
algorithm below. 
Algorithm 2: Basic Feature Selection 
Input: Collection b v of candidate features; empirical distribution \]5(x, y) 
Output : Set S of active features; model Ps incorporating these features 
1. Start with S = 0; thus Ps is uniform 
2. Do for each candidate feature f E ~v: 
Compute the model Psuf using Algorithm 1 
Compute the gain in the log-likelihood from adding this feature 
using (23) 
3. Check the termination condition 
4. Select the feature f with maximal gain AL(S,f) 
5. Adjoinf to S 
6. Compute Ps using Algorithm 1 
7. Go to step 2 
One issue left unaddressed by this algorithm is the termination condition. Obvi- 
ously, we would like a condition which applies exactly when all the "useful" features 
have been selected. One reasonable stopping criterion is to subject each proposed fea- 
ture to cross-validation on a sample of data withheld from the initial data set. If the 
feature does not lead to an increase in likelihood of the withheld sample of data, 
the feature is discarded. We will have more to say about the stopping criterion in 
Section 5.3. 
51 
Computational Linguistics Volume 22, Number 1 
4.3 Approximate Gains 
Algorithm 2 is not a practical method for incremental feature selection. For each can- 
didate feature f E ~" considered in step 2, we must compute the maximum entropy 
model p u ' a task that is computationally costly even with the efficient iterative scaling 
algorith~ ~ntroduced earlier. We therefore introduce a modification to the algorithm, 
making it greedy but much more feasible. We replace the computation of the gain 
AL(S,f) of a feature f with an approximation, which we will denote by ~AL(S,f). 
Recall that a model p has a set of parameters )~, one for each feature in S. The 
model Ps contains thisaset of parameters, plus a single new parameter c~, corre- 
sponding ~fo f.4 Given this structure, we might hope that the optimal values for )~ do 
not change as the feature f is adjoined to S. Were this the case, imposing an addi- 
tional constraint would require only optimizing the single parameter ~ to maximize 
the likelihood. Unfortunately, when a new constraint is imposed, the optimal values 
of all parameters change. 
However, to make the feature-ranking computation tractable, we make the approx- 
imation that the addition of a feature f affects only o~, leaving the )~-values associated 
with other features unchanged. That is, when determining the gain of f over the model 
Ps' we pretend that the best model containing features $ U f has the form 
P~s,f = 1 Z~(x)Ps(Ylx)e~f(x'Y), for some real valued c~ (24) 
where Z~ (x) = ~ Ps (Ylx)e~f(x'Y) (25) 
Y 
The only parameter distinguishing models of the form (24) is c~. Among these models, 
we are interested in the one that maximizes the approximate gain 
G,s,f(o~) =- L(p~,f) - L(P,s ) 
= - ~/3(x) log Z~ (x) + c~(f) (26) 
x 
We will denote the gain of this model by 
,,,AL(S,f) _~ max Gs,f(~) (27) 
and the optimal model by 
~' P,suf = argmax G s,f (c~) (28) P~,f 
Despite the rather unwieldy notation, the idea is simple. Computing the approxi- 
mate gain in likelihood from adding feature f to Ps has been reduced to a simple one- 
dimensional optimization problem over the single parameter ~, which can be solved 
by any popular line-search technique, such as Newton's method. This yields a great 
savings in computational complexity over computing the exact gain, an n-dimensional 
4 Another way to think of this is that the models Psu\[ and Ps have the same number of parameters, but 
c~ = 0 for Ps' 
52 
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
(a) 
L(p) 
L(p) 
(b) 
(~ 
Figure 3 
The likelihood L(p) is a convex function of its parameters. If we start from a one-constraint 
model whose optimal parameter value is A = A0 and consider the increase in L~(p) from 
adjoining a second constraint with the parameter a, the exact answer requires a search over 
(A, a). We can simplify this task by holding A = A0 constant and performing a line search over 
the possible values of the new parameter a. In (a), the darkened line represents the search 
space we restrict attention to. In (b), we show the reduced problem: a line search over a. 
optimization problem requiring more sophisticated methods such as conjugate gradi- 
ent. But the savings comes at a price: for any particular feature f, we are probably 
underestimating its gain, and there is a reasonable chance that we will select a feature 
f whose approximate gain ,,~AL($,f) was highest and pass over the feature f with 
maximal gain AL($,f). 
A graphical representation of this approximation is provided in Figure 3. Here the 
log-likelihood is represented as an arbitrary convex function over two parameters: A 
corresponds to the "old" parameter, and a to the "new" parameter. Holding A fixed 
and adjusting a to maximize the log-likelihood involves a search over the darkened 
line, rather than a search over the entire space of (A, a). 
The actual algorithms, along with the appropriate mathematical framework, are 
presented in the appendix. 
5. Case Studies 
In the next few pages we discuss several applications of maximum entropy modeling 
within Candide, a fully automatic French-to-English machine translation system under 
development at IBM. Over the past few years, we have used Candide as a test bed for 
exploring the efficacy of various techniques in modeling problems arising in machine 
translation. 
We begin in Section 5.1 with a review of the general theory of statistical translation, 
describing in some detail the models employed in Candide. In Section 5.2 we describe 
how we have applied maximum entropy modeling to predict the French translation of 
an English word in context. In Section 5.3 we describe maximum entropy models that 
predict differences between French word order and English word order. In Section 5.4 
we describe a maximum entropy model that predicts how to divide a French sentence 
into short segments that can be translated sequentially. 
53 
Computational Linguistics Volume 22, Number 1 
The1 dog2 ate3 my4 homeworks 
Lel chien2 a3 mangd4 rues5 devoirs6 
Figure 4 
Alignment of a French-English sentence pair. The subscripts give the position of each word in 
its sentence. Here al = 1, a2 = 2, a3 = a4 = 3, as = 4, and a6 = 5. 
5.1 Review of Statistical Translation 
When presented with a French sentence F, Candide's task is to find the English sen- 
tence E which is most likely given F: 
= argmax p(EIF ) (29) 
E 
By Bayes' theorem, this is equivalent to finding/~ such that 
/~ = argmax p(FIE)p(E ) (30) 
E 
Candide estimates p(E)--the probability that a string E of English words is a well- 
formed English sentence--using a parametric model of the English language, com- 
monly referred to as a language model. The system estimates p(FIE)--the probability 
that a French sentence F is a translation of E--using a parametric model of the process 
of English-to-French translation known as a translation model. These two models, 
plus a search strategy for finding the/~ that maximizes (30) for some F, comprise the 
engine of the translation system. 
We now briefly describe the translation model for the probability P(FIE); a more 
thorough account is provided in Brown et al. (1991). We imagine that an English sen- 
tence E generates a French sentence F in two steps. First, each word in E independently 
generates zero or more French words. These words are then ordered to give a French 
sentence F. We denote the ith word of E by ei and the jth word of F by yj. (We em- 
ploy yj rather than the more intuitive }~ to avoid confusion with the feature function 
notation.) We denote the number of words in the sentence E by IEI and the number 
of words in the sentence F by IFI. The generative process yields not only the French 
sentence F but also an association of the words of F with the words of E. We call this 
association an alignment, and denote it by A. An alignment A is parametrized by a 
sequence of IFI numbers aj, with 1 < ai < IE\[. For every word position j in F, aj is the 
word position in E of the English word that generates yj. Figure 4 depicts a typical 
alignment. 
The probability p(FIE ) that F is the translation of E is expressed as the sum over 
all possible alignments A between E and F of the probability of F and A given E: 
p(FIE ) = ~_, p(F, AIE ) (31) 
A 
The sum in equation (31) is computationally unwieldy; it involves a sum over all IE\] IFI 
possible alignments between the words in the two sentences. We sometimes make the 
54 
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
simplifying assumption that there exists one extremely probable alignment ,4, called 
the "Viterbi alignment," for which 
p(F\[E) ~ p(F, ~41E ) (32) 
Given some alignment A (Viterbi or otherwise) between E and F, the probability 
p(F, AIE ) is given by 
JEI fFI 
p(F, AlE ) = 1-I p(n(ei)lei) . II P(YJ \[ea,) " d(ArE, F) 
i=1 j=l 
(33) 
where n(ei) denotes the number of French words aligned with ei. In this expression 
• p(nre ) is the probability that the English word e generates n French 
words, 
• p(yle) is the probability that the English word e generates the French 
word y; and 
• d(AIE, F ) is the probability of the particular order of French words. 
We call the model described by equations (31) and (33) the basic translation model. 
We take the probabilities p(nle ) and p(yr e) as the fundamental parameters of the 
model, and parametrize the distortion probability in terms of simpler distributions. 
Brown et al. (1991) describe a method of estimating these parameters to maximize the 
likelihood of a large bilingual corpus of English and French sentences. Their method is 
based on the Estimation-Maximization (EM) algorithm, a well-known iterative tech- 
nique for maximum likelihood training of a model involving hidden statistics. For the 
basic translation model, the hidden information is the alignment A between E and F. 
We employed the EM algorithm to estimate the parameters of the basic trans- 
lation model so as to maximize the likelihood of a bilingual corpus obtained from 
the proceedings of the Canadian Parliament. For historical reasons, these proceedings 
are sometimes called "Hansards." Our Hansard corpus contains 3.6 million English- 
French sentence pairs, for a total of a little under 100 million words in each language. 
Table 2 shows our parameter estimates for the translation probabilities p(y\[in). The ba- 
sic translation model has worked admirably: given only the bilingual corpus, with no 
additional knowledge of the languages or any relation between them, it has uncovered 
some highly plausible translations. 
Nevertheless, the basic translation model has one major shortcoming: it does not 
take the English context into account. That is, the model does not account for surround- 
ing English words when predicting the appropriate French rendering of an English 
word. As we pointed out in Section 3, this is not how successful translation works. 
The best French translation of in is a function of the surrounding English words: if a 
month's time are the subsequent words, pendant might be more likely, but if thefiscal year 
1992 are what follows, then dans is more likely. The basic model is blind to context, 
always assigning a probability of 0.3004 to dans and 0.0044 to pendant. 
This can yield errors when Candide is called upon to translate a French sentence. 
Examples of two such errors are shown in Figure 5. In the first example, the system has 
chosen an English sentence in which the French word sup&ieures has been rendered as 
superior when greater or higher is a preferable translation. With no knowledge of context, 
an expert translator is also quite likely to select superior as the English word generating 
55 
Computational Linguistics Volume 22, Number 1 
Table 2 
Most frequent French translations of in as 
estimated using EM-training. (OTHER) 
represents a catch-all classifier for any 
French phrase not listed, none of which 
had a probability exceeding 0.0043. 
Translation Probability 
dans 0.3004 
0.2275 
de 0.1428 
en 0.1361 
pour 0.0349 
(OTHER) 0.0290 
au cours de 0.0233 
0.0154 
sur 0.0123 
par 0.0101 
pendant 0.0044 
Je dirais m~me que les chances sont sup&ieures ?~ 50%. 
I would even say that the odds are superior to 50%. 
I1 semble que Bank of Boston ait pratiquement achev~ son r~examen de Shawmut. 
He appears that Bank of Boston has almost completed its review of Shawmut. 
Figure 5 
Typical errors encountered in using EM-based model of Brown et al. in a French-to-English 
translation system. 
sup~rieures. But an expert privy to the fact that 50% was among the next few words 
might be more inclined to select greater or higher. Similarly, in the second example, the 
incorrect rendering of II as He might have been avoided had the translation model 
used the fact that the word following it is appears. 
5.2 Context-Dependent Word Models 
In the hope of rectifying these errors, we consider the problem of context-sensitive 
modeling of word translation. We envision, in practice, a separate maximum entropy 
model, pe(y\]x), for each English word e, where pe(ylx) represents the probability that an 
expert translator would choose y as the French rendering of e, given the surrounding 
English context x. This is just a slightly recast version of a longstanding problem 
in computational linguistics, namely, sense disambiguation--the determination of a 
word's sense from its context. 
We begin with a training sample of English-French sentence pairs (E, F) randomly 
extracted from the Hansard corpus, such that E contains the English word in. For each 
sentence pair, we use the basic translation model to compute the Viterbi alignment 
between E and F. Using this alignment, we then construct an (x, y) training event. The 
event consists of a context x containing the six words in E surrounding in and a future 
56 
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
Table 3 
Several actual training events for the maximum entropy translation model for 
in, extracted from the transcribed proceedings of the Canadian Parliament. 
Translation e-3 e-2 e-1 e+l e+2 e+3 
the committee 
work was 
dans stated 
?l required 
au cours de 
dans by the government 
of diphtheria reported 
de not given notice 
a letter to 
respect of the 
the fiscal year 
the same postal 
Canada , by 
the ordinary way 
y equal to the French word which is (according to the Viterbi alignment A) aligned 
with in. A few actual examples of such events for in are depicted in Table 3. 
Next we define the set of candidate features. For this application, we employ 
features that are indicator functions of simply described sets. Specifically, we consider 
functions f(x,y) that are one if y is some particular French word and the context 
x contains a given English word, and are zero otherwise. We employ the following 
notation to represent these features: 
fl(x,y) = { 10 otherwiseify=enandAprilEI \] I \]'1 I I 
fa(x,y) {10 otherwiseifY pen antand weeks  I I I" I" I" J 
Here fl = 1 when April follows in and en is the translation of in; f2 = 1 when weeks is 
one of the three words following in and pendant is the translation. 
The set of features under consideration is vast, but may be expressed in abbrevi- 
ated form in Table 4. In the table, the symbol O is a placeholder for a possible French 
word and the symbol \[\] is a placeholder for a possible English word. The feature h 
mentioned above is thus derived from template 2 with O = en and \[\] = April; the 
feature f2 is derived from template 5 with O = pendant and \[\] = weeks. If there are IVEI 
total English words and IVy:I total French words, there are \]Vfl template-1 features, 
and IVEI. IVy l features of templates 2, 3, 4, and 5. 
Template 1 features give rise to constraints that enforce equality between the prob- 
ability of any French translation y of in according to the model and the probability of 
that translation in the empirical distribution. Examples of such constraints are 
p(y=dans) = ~(y-~dans) 
p(y=de) = ~(y=de) 
p(y~-en) = ~(y=en) 
57 
Computational Linguistics Volume 22, Number 1 
Table 4 
Feature templates for word-translation modeling. I el is the size of the 
English vocabulary; 1~271 the size of the French vocabulary. 
Number of 
Template Actual Features f(x,y) --- 1 if and only if ... 
1 Iv,~t y = e 
2 IV~-I IVel !/=o and r-lE\[ I I I'111 
I  llv t ,-o and I I'll I I 
A maximum entropy model that uses only template 1 features predicts each French 
translation y with the probability ~(y) determined by the empirical data. This is exactly 
the distribution employed by the basic translation model. 
Since template 1 features are independent of x, the maximum entropy model that 
employs only constraints derived from template 1 features takes no account of contex- 
tual information in assigning a probability to y. When we include constraints derived 
from template 2 features, we take our first step towards a context-dependent model. 
Rather than simply constraining the expected probability of a French word y to equal 
its empirical probability, these constraints require that the expected joint probability of 
the English word immediately following in and the French rendering of in be equal 
to its empirical probability. An example of a template 2 constraint is 
p(y = pendant, e+l = several) = ~(y = pendant, e+\] = several) 
A maximum entropy model that incorporates this constraint will predict the transla- 
tions of in in a manner consistent with whether or not the following word is several. 
In particular, if in the empirical sample the presence of several led to a greater prob- 
ability for pendant, this will be reflected in a maximum entropy model incorporating 
this constraint. We have thus taken our first step toward context-sensitive translation 
modeling. 
Templates 3, 4, and 5 consider, each in a different way, various parts of the context. 
For instance, template 5 constraints allow us to model how an expert translator is 
biased by the appearance of a word somewhere in the three words following the word 
being translated. If house appears within the next three words (e.g., the phrases in the 
house and in the red house), then dans might be a more likely translation. On the other 
hand, if year appears within the same window of words (as in in the year 1941 or in 
that fateful year), then au cours de might be more likely. Together, the five constraint 
templates allow the model to condition its assignment of probabilities on a window 
of six words around e0, the word in question. 
We constructed a maximum entropy model Pin (ylx) by the iterative model-growing 
method described in Section 4. The automatic feature selection algorithm first selected 
a template 1 constraint for each of the translations of in seen in the sample (12 in 
58 
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
Table 5 
Maximum entropy model to predict French translation of in. Features shown 
here were the first features selected not from template 1. \[verb markerJ denotes a 
morphological marker inserted to indicate the presence of a verb as the next 
word. 
Featuref(x,y) ,,~AL($,f) L(p) 
y=~ and Canada E 
y=~ and House E 
y=en and the E 
y=pour and order E 
y=dans and speech C 
y=dans and area E 
y=de and increase C 
y=\[verb marker\] and my E 
y=dans and case E 
y=au cours de and year C 
I I.Ill 0~15 
I I.Ill 00361 
I I.Ill 0022, 
l I.Ill 0.0224 
I I.l.l.I 00190 
l I-Ill 00,4~ 
II I.l.l.I 00~,0 
II I.l.l.I 00~04 
-2.9674 
-2.9281 
-2.8944 
-2.8703 
-2.8525 
-2.8377 
-2.8209 
-2.8034 
-2.7918 
-2.7792 
all), thus constraining the model's expected probability of each of these translations 
to their empirical probabilities. The next few constraints selected by the algorithm are 
shown in Table 5. The first column gives the identity of the feature whose expected 
value is constrained; the second column gives ,-~AL($,f), the approximate increase in 
the model's log-likelihood on the data as a result of imposing this constraint; the third 
column gives L(p), the log-likelihood after adjoining the feature and recomputing the 
model. 
Let us consider the fifth row in the table. This constraint requires that the model's 
expected probability of dans, if one of the three words to the right of in is the word 
speech, is equal to that in the empirical sample. Before imposing this constraint on the 
model during the iterative model-growing process, the log-likelihood of the current 
model on the empirical sample was -2.8703 bits. The feature selection algorithm de- 
scribed in Section 4 calculated that if this constraint were imposed on the model, the 
log-likelihood would rise by approximately 0.019059 bits; since this value was higher 
than for any other constraint considered, the constraint was selected. After applying 
iterative scaling to recompute the parameters of the new model, the likelihood of the 
empirical sample rose to -2.8525 bits, an increase of 0.0178 bits. 
59 
Computational Linguistics Volume 22, Number 1 
Table 6 
Maximum entropy model to predict French translation of to run: 
top-ranked features not from template 1. 
Featuref(x,y) ,-~ AL($,f) L(p) 
y=dpuiser and out C 
y=manquer and out E 
y=~couler and time C 
y=accumuler and up C 
y=nous and we C 
y=aller and counter C 
y=candidat and for E 
y=diriger and the E 
I1°°.1 0.0252 
I1.,°1 0.0221 
IIo°°l 0.0157 
II • 10.0149 
I'l I 0"0140 
I I''.1 0.0131 
I I e e e I 0.0124 
I I...I 00 23 
-4.8499 
-4.8201 
-4.7969 
-4.7771 
-4.7582 
-4.7445 
-4.7295 
-4.7146 
Table 6 lists the first few selected features for the model for translating the En- 
glish word run. The "Hansard flavor'--the rather specific domain of parliamentary 
discourse related to Canadian affairs--is easy to detect in many of the features in this 
table. 
It is not hard to incorporate the maximum entropy word translation models into a 
translation model P(FIE ) for a French sentence given an English sentence. We merely 
replace the simple context-independent models p(yI e) used in the basic translation 
model (33) with the more general context-dependent models pe(y\]x): 
IEI IFI 
p(F, AlE) = II p(n (el)\[ei). I-\[ Peaj (Yj \[Xa,) " d(AIE, F) 
i=1 j=l 
where xaj denotes the context of the English word eaj. 
Figure 6 illustrates how using this improved translation model in the Candide 
system led to improved translations for the two sample sentences given earlier. 
5.3 Segmentation 
Though an ideal machine translation system could devour input sentences of unre- 
stricted length, a typical stochastic system must cut the French sentences into polite 
lengths before digesting them. If the processing time is exponential in the length of the 
input passage (as is the case with the Candide system), then failing to split the French 
sentences into reasonably-sized segments would result in an exponential slowdown 
in translation. 
Thus, a common task in machine translation is to find safe positions at which 
60 
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
Je dirais m~me que les chances sont sup~rieures ~ 50%. 
1 I would even say that the odds are greater than 50%. 
I1 semble que Bank of Boston ait pratiquement achevd son rdexamen de Shawmut. 
1 It appears that Bank of Boston has almost completed its review of Shawmut . 
Figure 6 
Improved French-to-English translations resulting from maximum entropy-based system. 
The1 
Lel 
dog2 ate3 my4 homeworks 
chien2 a3 II mang~4 mess II devoirs6 
Figure 7 
Example of an unsafe segmentation. A word in the translated sentence (e3) is aligned to words 
(y3 and y4) in two different segments of the input sentence. 
to split input sentences in order to speed the translation process. "Safe" is a vague 
term; one might, for instance, reasonably define a safe segmentation as one which 
results in coherent blocks of words. For our purposes, however, a safe segmentation 
is dependent on the Viterbi alignment A between the input French sentence F and its 
English translation E. 
We define a rift as a position j in F such that for all k < j, ak <_ aj and for all k > j, 
ak > aj. In other words, the words to the left of the French word yj are generated by 
words to the left of the English word %, and the words to the right of yj are generated 
by words to the right of %. In the alignment of figure 4, for example, there are rifts 
at positions j = 1, 2, 4, 5 in the French sentence. One visual method of determining 
whether a rift occurs after the French word j is to try to trace a line from the last 
letter of yj up to the last letter of e~; if the line can be drawn without intersecting any 
alignment lines, position f is a rift. 
Using our definition of rifts, we can redefine a safe segmentation as one in which 
the segment boundaries are located only at rifts. Figure 7 illustrates an unsafe seg- 
mentation, in which a segment boundary (denoted by the II symbol) lies between 
a and mangd, where there is no rift. Figure 8, on the other hand, illustrates a safe 
segmentation. 
The reader will notice that a safe segmentation does not necessarily result in se- 
mantically coherent segments: mes and devoirs are certainly part of one logical unit, 
yet are separated in this safe segmentation. Once such a safe segmentation has been 
applied to the French sentence, we can make the assumption while searching for the 
appropriate English translation that no word in the translated English sentence will 
have to account for French words located in multiple segments. Disallowing inter- 
61 
Computational Linguistics Volume 22, Number 1 
The1 dog2 
Lel II chien2 
ate3 my4 homeworks 
a3 mang~4 mess II devoirs6 
Figure 8 
Example of a safe segmentation. 
%31 l e°i  31tag eai -3 1 I tag e°i  3 lclass % -3 1 I class e°i  
Y Figure 9 
(x,y) for sentence segmentation. 
segment alignments dramatically reduces the scale of the computation involved in 
generating a translation, particularly for large sentences. We can consider each seg- 
ment sequentially while generating the translation, working from left to right in the 
French sentence. 
We now describe a maximum entropy model that assigns to each location in a 
French sentence a score that is a measure of the safety in cutting the sentence at 
that location. We begin as in the word translation problem, with a training sample of 
English-French sentence pairs (E, F) randomly extracted from the Hansard corpus. For 
each sentence pair we use the basic translation model to compute the Viterbi alignment 
between E and F. We also use a stochastic part-of-speech tagger as described in 
Merialdo (1990) to label each word in F with its part of speech. For each position j in F 
we then construct a (x, y) training event. The value y is rift if a rift belongs at position 
j and is no-rift otherwise. The context information x is reminiscent of that employed 
in the word translation application described earlier. It includes a six-word window 
of French words: three to the left of yj and three to the right of yj. It also includes the 
part-of-speech tags for these words, and the classes of these words as derived from a 
mutual-information clustering scheme described in Brown et al. (1990). The complete 
(x, y) pair is illustrated in Figure 9. 
In creating p(riftlx), we are (at least in principle) modeling the decisions of an 
expert French segmenter. We have a sample of his work in the training sample ~(x, y), 
and we measure the worth of a model by the log-likelihood L~(p). During the itera- 
tive model-growing procedure, the algorithm selects constraints on the basis of how 
much they increase this objective function. As the algorithm proceeds, more and more 
constraints are imposed on the model p, bringing it into ever-stricter compliance with 
the empirical data ~(x,y). This is useful to a point; insofar as the empirical data em- 
bodies the expert knowledge of the French segmenter, we would like to incorporate 
this knowledge into a model. But the data contains only so much expert knowledge; 
the algorithm should terminate when it has extracted this knowledge. Otherwise, the 
model p(ylx) will begin to fit itself to quirks in the empirical data. 
A standard approach in statistical modeling, to avoid the problem of overfitting 
the training data, is to employ cross-validation techniques. Separate the training data 
62 
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
-0.75 , , , 
-0.8 
-0.85 
O / o/ 
-0.95 
i 
Training <~ 
Held-out ..... 
0 40 80 100 120 
Number of Features 
Figure 10 
Change in log-likelihood during segmenting model-growing. (Overtraining begins to occur at 
about 40 features.) 
\]~(x, y) into a training portion, pr, and a withheld portion, \]~h. Use only \]gr in the model- 
growing process; that is, select features based on how much they increase the likeli- 
hood L~r (p). As the algorithm progresses, L~, (p) thus increases monotonically. As long 
as each new constraint imposed allows p to better account for the random process 
that generated both Pr and Ph, the quantity L~h(p ) also increases. At the point when 
overfitting begins, however, the new constraints no longer help p model the random 
process, but instead require p to model the noise in the sample Pr itself. At this point, 
L~r (p) continues to rise, but L~h (p) no longer does. It is at this point that the algorithm 
should terminate. 
Figure 10 illustrates the change in log-likelihood of training data L~r(p ) and with- 
held data L~h (p). Had the algorithm terminated when the log-likelihood of the withheld 
data stopped increasing, the final model p would contain slightly less than 40 features. 
We have employed this segmenting model as a component in a French-English ma- 
chine translation system in the following manner: The model assigns to each position 
in the French sentence a score, p(r±ft I x), which is a measure of how appropriate a 
split would be at that location. A dynamic programming algorithm then selects, given 
the "appropriateness" score at each position and the requirement that no segment may 
contain more than 10 words, an optimal (or, at least, reasonable) splitting of the sen- 
tence. Figure 11 shows the system's segmentation of four sentences selected at random 
from the Hansard data. We remind the reader to keep in mind when evaluating Figure 
11 that the segmenter's task is not to produce logically coherent blocks of words, but 
to divide the sentence into blocks which can be translated sequentially from left to 
right. 
63 
Computational Linguistics Volume 22, Number 1 
Monsieur l' Orateur 
l 
j'aimerais poser une question au 
Ministre des Transports. 
A quelle date le 
nouveau r~glement devrait il entrer en vigeur? 
Quels furent les crit~res utilis~s 
pour l'dvaluation 
de ces biens. 
Nous 
savons que si nous pouvions contr61er la folle avoine 
dans l'ouest du Canada, en 
un an nous 
augmenterions notre rendement en 
c~r~ales de I milliard de dollars. 
Figure 11 
Maximum entropy segmenter behavior on four sentences selected at random from the 
Hansard data. 
5.4 Word Reordering 
Translating a French sentence into English involves not only selecting appropriate 
English renderings of the words in the French sentence, but also selecting an ordering 
for the English words. This order is often very different from the French word order. 
One way Candide captures word-order differences in the two languages is to allow for 
alignments with crossing lines. In addition, Candide performs, during a preprocessing 
stage, a reordering step that shuffles the words in the input French sentence into an 
order more closely resembling English word order. 
One component of this word reordering step deals with French phrases which have 
the NOUN de NOUN form. For some NOUN de NOUN phrases, the best English transla- 
tion is nearly word for word: conflit d'intOr~t, for example, is almost always rendered as 
conflict of interest. For other phrases, however, the best translation is obtained by inter- 
changing the two nouns and dropping the de. The French phrase taux d'int&Ot, for ex- 
ample, is best rendered as interest rate. Table 7 gives several examples of NOUN de NOUN 
phrases together with their most appropriate English translations. 
In this section we describe a maximum entropy model that, given a French NOUN 
de NOUN phrase, estimates the probability that the best English translation involves 
an interchange of the two nouns. We begin with a sample of English-French sen- 
tence pairs (E, F) randomly extracted from the Hansard corpus, such that F contains 
a NOUN de NOUN phrase. For each sentence pair we use the basic translation model to 
compute the Viterbi alignment ,~ between the words in E and F. Using A we construct 
an (x, y) training event as follows: We let the context x be the pair of French nouns 
(NOUNL, NOUNR). We let y be no-interchange if the English translation is a word-for- 
word translation of the French phrase and y = interchange if the order of the nouns 
in the English and French phrases are interchanged. 
We define candidate features based upon the template features shown in Table 8. 
64 
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
Table 7 
NOUN de NOUN phrases and their English equivalents. 
Word-for-word Phrases 
somme d'argent sum of money 
pays d'origin country of origin 
question de privilege question of privilege 
conflit d'intdrOt conflict of interest 
Interchanged Phrases 
bureau de poste post office 
taux d'int&Ot interest rate 
compagnie d'assurance insurance company 
gardien de prison prison guard 
In this table, the symbol ~ is a placeholder for either interchange or no-interchange 
and the symbols \[\]1 and 02 are placeholders for possible French words. If there are 
1~71 total French words, there are 2IVy- I possible features of templates 1 and 2 and 
2Ddy, I 2 features of template 3. 
Template 1 features consider only the left noun. We expect these features to be 
relevant when the decision of whether to interchange the nouns is influenced by the 
identity of the left noun. For example, including the template 1 feature 
1 if y=interchange and NOUNL = systOme 
f (x, y) = 0 otherwise 
gives the model sensitivity to the fact that the nouns in French NOUN de NOUN phrases 
beginning with systOme (such as syst~me de surveillance.and systOme de quota) are more 
likely to be interchanged in the English translation. Similarly, including the template 1 
feature 
1 if y=no-interchange and NOUNL = mois 
f(x,y)= 0 otherwise 
gives the model sensitivity to the fact that French NOUN de NOUN phrases beginning 
with mois, such as mois de mai (month of May) are more likely to be translated word for 
word. 
Template 3 features are useful in dealing with translating NOUN de NOUN phrases 
in which the interchange decision is influenced by both nouns. For example, NOUN de 
NOUN phrases ending in intdrOt are sometimes translated word for word, as in conflit 
d'intdrOt (conflict of interest) and are sometimes interchanged, as in taux d'intdrOt (interest 
rate). 
We used the feature-selection algorithm of section 4 to construct a maximum en- 
tropy model from candidate features derived from templates 1, 2, and 3. The model 
was grown on 10,000 training events randomly selected from the Hansard corpus. The 
final model contained 358 constraints. 
To test the model, we constructed a NOUN de NOUN word-reordering module which 
interchanges the order of the nouns if p(interchange\[x) > 0.5 and keeps the order the 
same otherwise. Table 9 compares performance on a suite of test data against a baseline 
NOUN de NOUN reordering module that never swaps the word order. 
65 
Computational Linguistics Volume 22, Number 1 
Table 8 
Template features for NOUN de NOUN model. 
Number of 
Template Actual Features f(x,y) = 1 if and only if ... 
1 2}v;I y = ~ and NOUNL = \[\] 
2 21V;I y = ~ and NOUNR = \[\] 
3 2\[Vyf y = ~ and NOUNL = rnl and NOUNa = 772 
Table 9 
NOUN de NOUN model performance: simple approach vs. maximum entropy. 
Test data Simple Model Maximum Entropy 
Accuracy Model Accuracy 
50,229 not interchanged 100% 93.5% 
21,326 interchanged 0% 49.2% 
71,555 total 70.2% 80.4% 
I i/ . 
lli o, I t i i: 1 
.006 .018 .043 .195 .206 .224 .440 ,566 .723 .845 .911 .922 ,997 
smaller.., p{mterchange} ...larger 
Figure 12 
Predictions of the NOUN de NOUN interchange model on phrases selected from a corpus unseen 
during the training process. 
Table 12 shows some randomly-chosen NOUN de NOUN phrases extracted from 
this test suite along with p(interchangelx), the probability assigned by the model to 
inversion. On the right are phrases such as saison d'hiver for which the model strongly 
predicted an inversion. On the left are phrases the model strongly prefers not to 
interchange, such as somme d'argent, abus de privil~ge and chambre de commerce. Perhaps 
most intriguing are those phrases that lie in the middle, such as taux d'inflation, which 
can translate either to inflation rate or rate of inflation. 
6. Conclusion 
We began by introducing the building blocks of maximum entropy modeling--real- 
valued features and constraints built from these features. We then discussed the max- 
imum entropy principle. This principle instructs us to choose, among all the models 
66 
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
consistent with the constraints, the model with the greatest entropy. We observed that 
this model was a member of an exponential family with one adjustable parameter for 
each constraint. The optimal values of these parameters are obtained by maximizing 
the likelihood of the training data. Thus two different philosophical approaches-- 
maximum entropy and maximum likelihood--yield the same result: the model with 
the greatest entropy consistent with the constraints is the same as the exponential 
model which best predicts the sample of data. 
We next discussed algorithms for constructing maximum entropy models, concen- 
trating our attention on the two main problems facing would-be modelers: selecting 
a set of features to include in a model, and computing the parameters of a model 
containing these features. The general feature-selection process is too slow in practice, 
and we presented several techniques for making the algorithm feasible. 
In the second part of this paper we described several applications of our algo- 
rithms, concerning modeling tasks arising in Candide, an automatic machine transla- 
tion system under development at IBM. These applications demonstrate the efficacy 
of maximum entropy techniques for performing context-sensitive modeling. 
Acknowledgments 
The authors wish to thank Harry Printz and 
John Lafferty for suggestions and comments 
on a preliminary draft of this paper, and 
Jerome Bellegarda for providing expert 
French knowledge. 
References 
Bahl, L.; Brown, P.; de Souza, P.; and Mercer, 
R. (1989). A tree-based statistical language 
model for natural language speech 
recognition. IEEE Transactions on Acoustics, 
Speech, and Signal Processing, 37(7). 
Berger, A.; Brown, P.; Della Pietra, S.; Della 
Pietra, V.; Gillett, J.; Lafferty, J.; Printz, H.; 
and Urea, L. (1994). The Candide System 
for Machine Translation. In Proceedings, 
ARPA Conference on Human Language 
Technology, Plainsborough, New Jersey. 
Black, E.; Jelinek, E; Lafferty, J.; Magerman, 
D.; Mercer, R.; and Roukos, S. (1992). 
Towards History-based Grammars: Using 
Richer Models for Probabilistic Parsing. 
In Proceedings, DARPA Speech and Natural 
Language Workshop, Arden House, New 
York. 
Brown, D. (1959). A Note on 
Approximations to Discrete Probability 
Distributions. Information and Control, 
2:386-392. 
Brown, P.; Della Pietra, S.; Della Pietra, V.; 
and Mercer, R. (1993). The Mathematics of 
Statistical Machine Translation: Parameter 
Estimation. Computational Linguistics, 
19(2):263--311. 
Brown, P.; Cocke, J.; Della Pietra, S.; Della 
Pietra, V.; Jelinek, E; Lafferty, J.; Mercer, 
R.; and Roossin, P. (1990). A Statistical 
Approach to Machine Translation. 
Computational Linguistics, 16:79-85. 
Brown, P.; Della Pietra, V.; de Souza, P.; and 
Mercer, R. (1990). Class-based N-Gram 
Models of Natural Language. Proceedings, 
IBM Natural Language ITL, 283-298. 
Brown, P.; Della Pietra, S.; Della Pietra, V.; 
and Mercer, R. (1991). A Statistical 
Approach to Sense Disambiguation in 
Machine Translation. In Proceedings, 
DARPA Speech and Natural Language 
Workshop, 146-151. 
Cover, T. and Thomas, J. (1991). Elements of 
Information Theory. John Wiley & Sons. 
Csiszdr, I. (1975). I-Divergence Geometry of 
Probability Distributions and 
Minimization Problems, The Annals of 
Probability, 3(1):146-158. 
ibid. (1989). A Geometric Interpretation of 
Darroch and Ratcliff's Generalized 
Iterative Scaling. The Annals of Statistics, 
17(3):1409-1413. 
Csisz~ir, L. and Tusn~idy, G. (1984). 
Information Geometry and Alternating 
Minimization Procedures. Statistics & 
Decisions, Supplemental Issue, no. 1, 
205-237. 
Darroch, J. N. and Ratcliff, D. (1972). 
Generalized Iterative Scaling for 
Log-linear Models. Annals of Mathematical 
Statistics, no. 43, 1470-1480. 
Della Pietra, S.; Della Pietra, V.; Gillett, J.; 
Lafferty, J.; Printz, H.; and Urea, L. (1994). 
"Inference and Estimation of a 
Long-range Trigram Model." In Lecture 
Notes in Artificfal Intelligence, 862. 
Springer-Verlag, 78-92. 
Della Pietra, S.; Della Pietra, V.; and 
Lafferty, J. (1995). "Inducing features of 
67 
Computational Linguistics Volume 22, Number 1 
random fields" CMU Technical Report 
CMU-CS-95-144. 
Dempster, A. P.; Laird, N. M.; and Rubin, 
D. B. (1977). Maximum Likelihood from 
Incomplete Data via the EM Algorithm. 
Journal of the Royal Statistical Society, 
39(B):1-38. 
Guiasu, S. and Shenitzer, A. (1985). The 
Principle of Maximum Entropy. The 
Mathematical Intelligencer , 7(1). 
Jaynes, E. T. (1990) "Notes on Present Status 
and Future Prospects." In Maximum 
Entropy and Bayesian Methods, edited by 
W. T. Grandy and L. H. Schick. Kluwer, 
1-13. 
Jelinek, F. and Mercer, R. L. (1980). 
Interpolated Estimation of Markov Source 
Parameters from Sparse Data. In 
Proceedings, Workshop on Pattern Recognition 
in Practice, Amsterdam, The Netherlands. 
Lucassen, J. and Mercer, R. (1984). An 
Information Theoretic Approach to 
Automatic Determination of Phonemic 
Baseforms. In Proceedings, IEEE 
International Conference on Acoustics, Speech 
and Signal Processing, San Diego, CA, 
42.5.1-42.5.4. 
Merialdo, B. (1990). Tagging Text with a 
Probabilistic Model. In Proceedings, IBM 
Natural Language ITL, Paris, France, 
161-172. 
Nddas, A.; Mercer, R.; Bahl, L.; Bakis, R.; 
Cohen, P.; Cole, A.-; Jelinek, F.; and Lewis, 
B. (1981). Continuous Speech Recognition 
with Automatically Selected Acoustic 
Prototypes Obtained by either 
Bootstrapping or Clustering. In 
Proceedings, IEEE International Conference on 
Acoustics, Speech and Signal Processing, 
Atlanta, GA, 1153-1155. 
Sokolnikoff, I. S. and Redheffer, R. M. 
(1966). Mathematics of Physics and Modern 
Engineering, Second Edition, McGraw-Hill 
Book Company. 
68 
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
Appendix: Efficient Algorithms for Feature Selection 
Computing the Approximate Gain of One Feature 
This section picks up where section 4 left off, describing in some detail a set of algo- 
rithms to implement the feature selection process efficiently. 
We first describe an iterative algorithm for computing ~AL(S,f) = maxa Gs,f(a) 
for a candidate feature f. The algorithm is based on the fact that the maximum of 
Gs,/(a) occurs (except in rare cases) at the unique value a* at which the derivative 
G's,f (a*) is zero. To find this zero, we apply Newton's iterative root-finding method. 
An important twist is that we do not use the updates obtained by applying Newton's 
method directly in the variable a. This is because there is no guarantee that Gs,f(an) 
increases monotonically for such updates. Instead, we use updates derived by applying 
Newton's method in the variables e a or e -a. A convexity argument shows that using 
these updates, the sequence of G&/(an) converges monotonically to the maximum 
approximate gain ,,~AL($,f) =_ G&f(a*) and that an increases monotonically to a*. 
The value a* that maximizes Gs\[ (a) can be found by solving the equation G'&/(a*) 
- - 0. Moreover, if an is any sequence for which G's,f(an) converges monotonically to 0, 
then Gs,f(an) will increase monotonically. This is a consequence of the convexity of 
G$,f(a) in a. 
We can solve an equation g(a) = 0 by Newton's method, which produces a se- 
quence an by the recurrence given in (18), repeated here for convenience: 
g(an) (34) O~n+l ~- Oln g,(an ) 
If we start with a0 sufficiently close to a,, then the sequence an will converge to a, 
and g(an) will converge to zero. In general, though, the g(an) will not be monotonic. 
However, it can be shown that the sequence is monotonic in the following important 
cases: if a0 G a. and g(a) is either decreasing and convex-U or increasing and convex- 
N. 
The function G's,/(a) is neither convex-V~ or convex-U as a function of a. However, 
it can be shown (by taking derivatives) that G's,d(a) is decreasing and convex-U in 
e% and is increasing and convex-N in e-% Thus, if a* > 0 so that e ° < e ~*, we can 
apply Newton's method in e a to obtain a sequence of an for which G'S,d (an) increases 
monotonically to zero. Similarly, if a* < 0 so that e ° < e -a*, we can apply Newton's 
method in e -~ to obtain a sequence an for which G's,f(an) decreases monotonically to 
zero. In either case, Gs,f(an) increases monotonically to its maximum Gs,f(a*). 
The updates resulting from Newton's method applied in the variable er% for r = 1 
or r = -1 are easily computed: 
an+l=an+llog (1 1G's,f(an))rG,,&f(an) (35) 
In order to solve the recurrence in (35), we need to compute G's,f and G"s,f. The 
zeroth, first, and second derivatives of G are 
G&I (a) = - y~ \]~(x) log Za (x) + a~(f) 
x 
G's,f(a) = ~(f) - ~-~" ~(x)p§,f(flx) 
x 
(36) 
(37) 
69 
Computational Linguistics Volume 22, Number 1 
I! c si(~) = - ~(x)p~,f(~ - p~,f(flx))21x) 
where 
p~,f(hlx ) =- ~ P~,f(ylx)h( x, Y) 
Y 
With these in place, we are ready to enumerate 
Algorithm 3: Computing the Gain of a Single Feature 
Input: Empirical distribution ~(x, y); initial model Ps; candidate feature f 
Output : Approximate gain ~AL(S,f) of feature f 
(38) 
(39) 
1. Let 
1 if~(f) <_ Ps(f) 
r = -1 otherwise (40) 
2. Set c~0 ~- 0 
3. Repeat the following until Gs,f(O~n) has converged: 
Compute c~,+1 from ~n using (35) 
Compute G8,f(O@+l) using (26) 
4. Set ~AL(8,f) ,-- Gs,f(O~n) 
Computing Approximate Gains in Parallel 
For the purpose of incremental model growing as outlined in Algorithm 2, we need to 
compute the maximum approximate gain ,-~ AL(8,f) for each candidate feature f E ~-. 
One obvious approach is to cycle through all candidate features and apply Algorithm 3 
for each one sequentially. Since Algorithm 3 requires one pass through every event 
in the training sample per iteration, this could entail millions of passes through the 
training sample. Because a significant cost often exists for reading the training data--if 
the data cannot be stored in memory but must be accessed from disk, for example--an 
algorithm that passes a minimal number of times through the data may be of some 
utility. We now give a parallel algorithm specifically tailored to this scenario. 
Algorithm 4: Computing Approximate Gains for A Collection of Features 
Input: Collection ~" of candidate features; empirical distribution ~(x, y); 
initial model Ps 
Output: Approximate gain ~AL($,f) for each candidate feature f E 
1. 
2. 
For each f E ~-, calculate ~(f), the expected value of f in the training data 
For each x, determine the set ~C(x) c_ 3 v of f that are active for x: 
,~(x) _~ {f E .~ I f(x,y)ps(ylx)~(x ) > 0 for some y} (41) 
70 
Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach 
3. For each f, let 
, 
5. 
(a) 
1 if ~(f) ~ Ps(f) 
r(f)= -1 otherwise 
For each f E $-, initialize c~ff) ~ 0 
Repeat the following until c~(f) converges for each f c .~: 
For each f E ~r, set 
(42) 
c'ff) p(f) 
G"(f) 0 
(b) For each x, do the following: 
For each f E $-(x), update G'(f) and G"(f) by 
G'(f) ~-- G'(f) - ~;(x)p~,f(flx ) (43) 
G"(f) ~- G"(f) -~(x)p~d(( f- p~,f(flx))21 x) (44) 
where p~,/(f Ix) = Gy PTs,/(Ylx)f (x, y) 
(c) For each f c ~-, update c~(f) by 
r-~ ( 1 G'(f)) (45) c~(f) ,-- c~(f) + log 1 r(f) G"(f) 
6. For each f E ~, substitute c~(f) into (26) to determine ~AL(S,f). 
Convergence for this algorithm is guaranteed just as it was for algorithm 3--after 
each iteration of step 5, the value of c~(f) for each candidate feature f is closer to its 
optimal value c~*(f) and, more importantly, the gain Gs,f is closer to the maximal gain ,-,,AL(,S,f). 
71 

