A Statistical Approach to Sense Disambiguation in 
Machine Translation 
Peter F. Brown, Stepheu A. Della. Pietra, Vince,,t d. Della Pietra, Robert L. Mercer 
• -' S IBM Resea,rch Division, ql horns,: J. Wa, tson Resea,rch Center 
Yorktown Heights, NY 10598 
ABSTRACT 
We describe ~ statisticM technique for assigning senses 
to words. An instance of ~ word is assigned ;~ sense by 
asking a question about the context in which the word 
~tppears. The qttestlou is constructed to ha, re high mutua,1 
i~fformation with the word's translations. 
INTRODUCTION 
An a,lluring a,spect of the staMstica,1 a,pproa,ch to 
ins,chine tra,nsla,tion rejuvena.ted by Brown, et al., \[_1\] 
is the systems.tic framework it provides for a.tta.ck- 
ing the problem of lexicM dis~tmbigua.tion. For exam- 
ple, the system they describe tra,ns\]a.tes th.e French 
sentence Je vais prendre la ddeision a,s \[ will make 
the decision, thereby correctly interpreting prendre a.s 
make, The staMstica.l tra.nslation model, which sup- 
plies English. tra,nsla,tions of French words, prefers the 
more common tra.nslation take, but the trigram la.n- 
gu.age mode\] recognizes tha.t the three-word sequence 
make the decision is much more proba\])le tha.n take 
the decision. 
The system is not a.lwa,ys so successful. It incor- 
rectly renders Je vats prendre ma propre ddcision a.s 1 
will take my own decision. Here, the la.nguage model 
does not realize tha, t take my own decision is improb- 
able beca,use take a,nd decision no longer fall within a. 
single trigram. 
Errors such a.s this a,re common because otlr sta,- 
tistical models o.ly capture loca,l phenomena,; if l, he 
context necessa,ry to determine ~ transla, tion fa,lls out- 
side the scope of our models, the word is likely to be 
tra,nsla,ted incorrectly. However, if the re\]evant co.- 
text is encoded locally, the word should be tra, nsla, ted 
correctly. We ca,n a,chieve this within the traditionM 
p,~radigm of a.na,lysis - tra,nsfer - synthesis by incorpo- 
ra,ting into the ana,lysis pha,se a, sense--disa, mbigu~tion 
compo,ent that assigns sense la, bels to French words. 
\]if prendre is labeled with one sense in the context of 
ddcisiou but wil.h a, different sense in other contexts, 
then the tra,nsla,tion model will learn from training 
data tha,t the first sense usua,lly tra.nslates to make, 
where.a,s the other sense usua,lly tra.nslates to take. 
In this paper, we describe a. sta, tistica,1 procedure 
for constructing a. sense-disambiguation eomponent 
that label words so as to elucida.te their translations. 
STATISTICAL TRANSLATION 
As described by Brown, et al. \[\]\], in the sta.tistica.1 
a.l)proa.ch to transla, tion, one chooses for tile tra,nsla,- 
tion of a. French sentence .F, tha.t English sentence E 
which ha.s the greatest l)robability, Pr(EIF), a.ccord- 
i,g to a, model of th.e tra, ns\]ation process. By Ba.yes' 
r,,le, Pr(EI ~') = Pr(E) Pr(FIE )/Pr(.F). Since the 
(lenomina.tor does not del)end on E, the sentence for 
which Pr(EIF ) is grea, test is also the sentence for 
which the product Pr(E) Pr(FIE ) is grea~test. The 
first term in this product is a~ sta, tisticM cha.ra.cteriza- 
tion of the, English \]a.nguage a, nd the second term is 
a, statistical cha.ra.cteriza,timt of the process by which 
English sentences are tra.nslated into French. We can 
compute neither of these probabilities precisely. Rather, 
in statistical tra.nslat, iou, we employ a. language model 
P,,od~l(E) which 1)rovide, s a,n estima.te of Pr (E) and a, 
lrav, slatiov, model which provides a,n estimate of 
t'r ( Vl/~:). 
146 
The performance of the system depends on the 
extent to which these statistical models approximate 
the actual probabilities. A useful gauge of this is tile 
cross entropy 1 
H(EIF)-= - ~ Pr(E,F) log PmoZ~,(EI F) (1) 
E,F 
which measures the average uncertainty that the model 
has about the English translation E of a French sen- 
tence F. A better model has less uncertainty and thus 
a lower cross entropy. 
A shortcoming of the architecture described above 
is that it requires the statistical models to deal di- 
rectly with English and French sentences. Clearly the 
probability distributions Pr(E) and Pr(FIE ) over 
sentences are immensely complicated. On the other 
hand, in practice the statistical models must be rela- 
tively simple in order that their parameters can be re- 
liably estimated from a manageable amount of train- 
ing data. This usually means that they are restricted 
to the modeling of local linguistic phenonrena. As a. 
result, the estimates Pmodcz(E) and Pmodd(F I E) will 
be inaccurate. 
This difficulty can be addressed by integrating sta- 
tistical models into the traditional machine transla- 
tion architecture of analysis-transfer-synthesis. The 
resulting system employs 
1. An analysis component which encodes a French 
sentence F into an intermediate structure F< 
2. A statistical transfer component which trans- 
lates F t a corresponding intermediate English 
structure E'. This component incorporates a 
language model, a translation model, and a de- 
coder as before, but here these components deal 
with the intermediate structures rather than the 
sentences directly. 
3. A synthesis component which reconstructs an 
English sentence E from E t. 
For statistical modeling we require that the synthe- 
sis transformation E ~ ~ E be invertible. Typically, 
analysis and synthesis will involve a sequence of suc- 
cessive transformations in which F p is incrementally 
tin this equation and in the remainder of the paper, we use 
bold face letters (e.g. E) for random variables and roman letters 
(e.g. E) for the values of random variables. 
constructed from F, or E is incrementally recovered 
from E I. 
'File purpose of analysis and synthesis is to facili- 
tate the task of statistical transfer. This will be the 
case if the probability distribution Pr (E ~, F ~) is eas- 
ier to model then the original distribution Pr (E, F). 
In practice this nleans that E' and F' should encode 
global linguistic facts about E and F in a local form. 
The utility of tile analysis and synthesis transfor- 
matious can be measured in terms of cross-entropy. 
Thus transfotma.tions F -+ F' and t~/ ---+ E are use- 
ful if we Call construct models ' P ,~od~t( F I E') and 
P',,,oa+,(E') such that H(E' I r') < H(EIF ). 
SENSE DISAMBIGUATION 
In this paper we present a statistical method for 
automatically constructing analysis and synthesis trans- 
formations which perform cross-lingual word-sense labeling. 
The goal of such transformations is to label the words 
of a French sentence so as to ehlcidate their English. 
trauslations, and, conversely, to label the words of an 
English sentence so as to elucidate their French trans- 
lations. For exa.mple, in some contexts the French 
verb prendre translates as to take, but in other con- 
texts it translates as to make. A sense disambiguation 
transformation, by examining the contexts, might la- 
bel occurrences of prendre that likely mean to take 
with one lal)el, and other occurrences of prendre with 
another label. Then the uncertainty in the transla- 
tion of prendre given the label would be less than the 
uncertainty in the translation of prendre without the 
label. All, hough tile label does not provide any infof 
mation that is not already present in the context, it 
encodes this information locally. Thus a local statisti- 
cal model for the transfer of labeled sentences should 
be more accurate than one for the transfer of unla- 
l)eled ones. 
While the translation o:f a word depends on many 
woMs in its context, we can often obtain information 
by looking at only a single word. For example, in the 
sentence .Ic vats prendre ma propre ddeision (I will 
'make my own decisiou), tile verb prendre should be 
translated as make because its object is ddcision. If 
we replace ddcision by voiture then prendre should be 
translated as take: Je vais prendre ma propre voiture 
(l will take my own car). Thus we can reduce the 
uncertainity in the translation of prendre by asking a 
question about its object, which is often the first noun 
147 
to its right, and we might assign a sense to prendre 
based upon the answer to this question. 
In It doute que Ins ndtres gagnent (He doubts that 
we will win), the word il should be translated as he. 
On the other hand, if we replace doute by faut then 
il should be translated as it: It faut que les nStres 
gagnent (It is necessary that we win). Here, we might 
assign a sense label to il by asking a,bout the identity 
of the first verb to its right. 
These examples motivate a. sense-labeling scheme 
in which the la.bel of a word is determined by a ques- 
tion aJ)out an informant word in its context. In the 
first example, the informant of prendre is the first 
noun to the right; in. the second example, the infof 
mant of ilis the first verb to the right. If we want 
to assign n senses to a word then we can consider a 
question with n answers. 
We can fit this scheme into the fl:amework of the 
previous section a.s follows: 
The Intermediate Structures. The intermediate struc- 
tures E' and F r consist of sequences of words 
labeled by their senses. Thus F' is a sentence 
over the expanded vocabulary whose 'words' f' 
are pairs (f,l) where f is a word in the origi- 
nal French vocabulary and 1 is its sense label. 
Similarly, E ¢ is a sentence over the expanded 
vocabulary whose words e t are pairs (e, l) where 
e is a.n English word and l is its sense label. 
The analysis and synthesis transformations. For each 
French word and each English word we choose 
an informant site, such as first noun to the left, 
and an n-ary question about the va,lue of the in- 
formant at that site. The analysis transforma- 
tion F ~ U and the inverse synthesis transfof 
marion E ~ E ~ map a sentence to the interme- 
diate structure in which each word is labeled by 
a sense determined by the question a\])out its in- 
formant. The synthesis transformation E ~ ~ E 
maps a labeled sentence to a sentence in which 
the labels have been removed. 
The probability models. We use the translation model 
that was discussed in \[l\] for both 
e;~oaet(F'lE') and for P,nodd(FIE). We use 
a trigram language model. \[1\] for P,,~oa~a(E) and 
In order to construct these tra.nsformations we need 
to choose for each English and French word a.n infor- 
mant and a question. As suggested in the previous 
section, a criterion for doing this is that of minimiz- 
ing the (:ross entropy H(E' I F'). In the remainder of 
the l)aper we present an algorithm for doing this. 
THE TRANSLATION MODEL 
We begin by reviewing our statistical model for the 
translation of a sentence from one language to another 
\[1\]. In statistical French to English translation system. 
we need to model transformations from English sen- 
tences E to French sentences F, or from intermediate 
English structures E' to intermediate French struc- 
tures F t. ltowever, it is clarifying to consider trans- 
formations from an arbitrary source language to an 
arbitrary target language. 
Review of the Model 
The l)urpose of a translation model is to compute 
the prol)al)i\]ity P,,odet(T \[ S) of transforming a source 
sentence S into a. target sentence T. For our simple 
mode\], we assume that each word of S independent\]y 
I)rodnces zero or mote words from the target vocabu- 
lary and that these words are then ordered to produce 
T. We use the term alignment to refer to an associa- 
tion between words in T and words in S. The proba- 
bility P,,oda(T I S) is the sum of the probabilities of 
all possible alignnmnts A between S and T 
I S) = e,,oa t(T, A is). 
A 
(2) 
The joint probal)ility P,,odft(7', A I S) of T and a pat- 
ticula.r a.\]ignmeut is given by 
1',,,od¢,(7', A IS) = (a) 
H P(tl"~A(t)) II P(iZA(s) ls)-Pdi.'toTtio'(T, A I S). 
t6T s6.5' 
llere .iA(t) is tile word of ,5' aligned with t in the align- 
men t A, a.nd fi.A (s) is the number of words of T aligned 
with s ill A. Tile distortion model Pdistortlon describes 
tile ordering of tile words of T. We will not give it 
explicitly. The parameters in (3) are 
I. The l)robabilities p(n \] s) that a word s in the 
source language generates n target words; 
2. "File prol)abilities p(t I s) that s generates the 
word t; 
3. The pa.ra,meters of the distortion model. 
148 
We determine values for these parameters using max- 
imv.m likelihood training. Thus we collect a large 
bilingual corpus consisting of pairs of sentences (S, T) 
which are translations of one another, and we seek 
parameter va.lues that maximize the likelihood of this 
training data as computed by the model. This is 
equivalent to minimizing the cross entropy 
If(T IS) = - ~ Pt~,i,,(S,T) log P,,,od,t(TI S) (4) 
S,T 
where Ptr~.i,~(S,T) is the empirical distribution ob- 
tained by counting the number of times that the pair 
(S, T) occurs in the training corpus. 
The Viterbi Approximation 
The sum over alignments in (2) is too expensive 
to compute directly since the number of alignments 
increases exponentially with sentence length. It is 
useful to approximate this sum by the single term 
corresponding to the alignment, A(S,T), with great- 
est probability. We refer to this approximation as the 
Viterbi approzimation and to A(S,T) as the Viterbi 
alignment. 
Let c(s,t) be the expected number of times that 
s is aligned with t in the Viterbi alignnmnt of a pair 
of sentences drawn at random from the training data.. 
Let c(s, n) be the expected number of times that s is 
aligned with n words. Then 
c(s,t) = ~ P,~o,~(S,T)c(s,t l J(S,T) ) 
S,T 
e(s,n) = ~ Pt,~i,(S,T)c(s, n I A(S,T) ) (5) 
S,T 
where c(.s,t I A) is the number of times that s is 
aligned with t in the alignment A, and c(s, n I A) is 
the number of times that s generates n target words 
in A. It can be shown \[2\] that these counts are also 
averages with respect to the model 
c(s, t) = ~ P,,,oda(S, T) c(s, t I A(,5', T) ) 
S,T 
~(s,~) = ~ P.,o~,(S,T)e(s,,~ I A(S,T)). (6) 
S,T 
By normalizing the counts c(s,t) and c(s,n) we 
obtain probability distributions p(s, t) and p(s, n) 2 
1 1 p(~, t) _ It(S, t) p(~, ,,) _ I ~(s, ,3. (7) 
7ZOT77~ ~OT77Z 
~In these equations and in the remainder of the paper, we 
The conditional distributions p(t I s) and p(n Is) are 
the Viterbi approximation estimates \[or the parame- 
ters of the model. The marginals satisfy 
~p(.%,,) = ,,,(,,) ~p(,,t) : ~,(t) 
~qt 
~.p(s,t)- _2_~(~),,,(~) (8) 
t 
where u(s) and u(t) are the unigram distributions of s 
and t and Fz(s) = ~ p(n I s)n is the average number 
of target words aligned with s. These formulae reflect 
the fact that in any alignment each target word is 
aligned with exactly one source word. 
CROSS ENTROPY 
\]n this section we express the cross entropies 
H ( S I T ) and \]\[(S ~ I Tt) in terms of the in- 
formation between source and target words. 
In the Viterbi approximation the cross entropy 
H(T IS) is given by 
H(T I s) : Lr { H(t I s) + H(n t ~) } (9) 
where LT is the average length of the target sentences 
in the training data, and lt(t I s) and It(n I s) are the 
conditional entropies for the probability distributions 
1,(s, t) and p(.., ~): 
H(t Is) = -~p(s,t) log p(tls) 
,%t 
,"(,, I~) : - ~p(,,,,~) log v(.,l~). (10) 
.$ t~. 
We wa.nt a similar exl)ression for the cross entropy 
I\[(S IT). Since 
l ,,,oa~,(~, T) P,,,o~.dT I S) P,,~o~z(S), 
this cross entropy depends on both the translation 
model, \]',,,oact(T I S), and the language model, P,,.oact(S). 
We now show that with a suitable additional approx- 
itn ation 
H(S I T) : Lr { H(n I+) - ~(+,t) } + H(S) (~1) 
use the generic symbol ~ to denote ~ normalizing fa.ctor that norgn 
com, er!s counts to probabilities. We let the actua.1 value of .ol 
I,e implicit from the context. Thus, for example, in the left ha.nd 
equation of (7), the normalizing factor is norm = ~,,, c(s, t) 
which equals tile a,verage length of target sentences. In the 
right hand equation of (7), the normalizing fa.ctor is the average 
\]engt.h of source sentences. 
149 
where H(S) is the cross entropy of P,+od+t(S) and 
I(s, t) is tire mutual information between t and s for 
the probability distribution p(s, t). 
The additional approximation that we require is 
HiT) ,~ LTHit) =- --LT ~p(t)log pi t) 
t 
(12) 
where p(t) is the marginal of p(s,t). This amounts 
to approximating Pmod¢l(T) by the unigram distribu- 
tion that is closest to it in cross entropy. Granting 
this, formula (11) is a consequence of (9) and of the 
identities 
HCS IT) = Hi T I S) - HCT) + I/iS), 
HCt,) = HCt I +) + I(+, t). (13) 
Target Questions 
For sensing target sentences, a question about an 
informant is a f, nction ~ from the target vocabulary 
into the set of possible senses. If the informant of 
t is z, then t is assigned the sense 5(z). We want 
to choose the function fi(z) to minimize the cross en- 
tropy It(S IT'). Front formula (34), we see that this 
is equivale:,t to maximizing the conditional mutual 
i,formation I(s, t' I t) between s and t' 
p(s,~(z) I t) (15) ICs, t' I t ) = ~_,pC.s,x \[ t)log pCs 1 t)P(+(.+) t 0 
where p(s, t, x) is the probability distribution obtained 
by counting the number of times in the Viterbi align- 
ments that s is aligned with t and the value of the 
informa, t of t is x, 
Next consider H(S' I T'). Let S ~ S' and T 
T' be sense labeling transformations of the type dis- 
cussed in Section 2. Assume that these transforma- 
tions preserve Viterbi alignments; that is, if the words 
s and t are aligned in the Viterbi alignment for ($, T), 
then their sensed versions s ~ and t' are aligned in 
the Viterbi alignment for (SI,T'). It follows that 
the word translation probabilities obtained from the 
Viterbi align ntents satisfy p(s,t) = Zt'etP(S,t') = 
~,'oP('S',t) where the sums range over tire sensed 
versions t' of t and the sensed versions s' of ~. 
By applying (11) to the cross entropies HCS I T), 
It(S I T'), and H(S'I T), it is not hard. to verify that 
HCSIT') = HCSI T)- LT~PCO/Cs, t'I¢) 
t 
HCS'IT) = HiS IT)- (:14) 
L~ ~ ~(,){:(t, +' I s)+ .rCn, ~', Is)}. 
$ 
Here I(s, t' I t) is the conditional mutual information 
given a target word t between its translations s and its 
sensed versions t'; I(t, s' \[ s) is the conditional mutual 
information given a source word s between its trans- 
lations t a.nd its sensed versions s'; and I(n,s' I s) is 
the conditional mutual information given .s between 
n and its sensed versions s'. 
pC.+, ~, ~) - 
p Cs, +, +) - 
1 ~ P,ro+..CS, T) e(s, +, + I ACS, T)) 
71"()7+71~ S,T 
~ v0, t,:~). (16) 
?~Or77% x:e(z)=c 
An exhaustive search for the best ~ requires a com- 
putation that is exponential in the number of values of 
x and is not practical. In previous work \[3\] we found a 
good ~ usi,g the flip-flop algorithm \[4\], which is only 
al)l)licable if the number of senses is restricted to two. 
Since then, we have developed a different Mgorithm 
that can be used to find 5 for any number of senses. 
The algorithm uses the technique of alternating min- 
imization, and is similar to the k-means algorithm for 
determining pattern clusters and to the generalized 
Lloyd algorithm for designing vector quantitizers. A 
discussion of alternating minimization, together with 
refcrences, can be found in Chou \[5\]. 
The algorithm is ba,sed on tile fact that, up to 
a constant independent of 5, the mutual information 
l(s,t t I t) can be expressed as an infimum over condi- 
tional probal)ility distributions q(s I c), 
l(s, t' If) = (17) 
i.f ~ pix)D(pis I x,t) ; q(s I 5(x)) + constant 
q :r 
SELECTING QUESTIONS 
We now present an algorithm for finding good in- 
formants and questions for sensing. 
where 
Di~(+) ; q(+)) =- ~V(s)log p(') (18) + q(s) 
150 
The best value of the information is thus a.n infimiim 
over both the choice for 2. and the choice for the q. 
This suggests the following iterative procedure for ob- 
taining a good 2: 
1. For given q, find the best E: 
E(x) = argmin,D(p(s ( x,t) ; g(s ( c)). 
2. For this El find the best 3: 
3. Iterate steps (1) a.nd (2) ilntil no fnrther increase 
in I(s, t' I t) results. 
Source Questions 
For sensing source sentences, a, question a.bont an 
informant is a Iunction 2: from the source voca1)iila.ry 
int'o the set of possible senses. We want to chose 2. 
to minimize the entropy H(S1 I T). From ( 14) this is 
equivalent to ~na.ximizing the sum 
I(t,st I s) + T( n , s' I s ). In analogy to (18), 
and we can again find a good 2 by alternating mini- 
miza.tion. 
CONCLUSION 
In this paper we presented a general framework 
for integrating analysis and synthesis with statisti- 
cal translation, and within this framework we invcs- 
tigated cross-lingnal sense labeling. We gave an algo- 
rithm for antoinatically constructing a simple labeling 
transformation that assigns a sense to a word by ask- 
ing a question about a single word of the context. 
In a companion paper [3] we present results of trans- 
lation experiments using a sense-labeling cvnlponent 
that employs a similar algorithn~. We are currently 
studying the auton~atic construction of more complex 
transformations which utilize more detailed contex- 
tual informa tion. 
((A stat,istic:xl a.pproa.ch to madline transla.tion,)) 
Compufnlio1rrc.1 Ling.zl.istics, vol. 16, pp. 79-85, 
;June 1990. 
[2] P. Brown, S. Dellal'ietra, V. Uella.Pietra., and 
It. Mercer, "Initial estimates of word tra.nsla.tion 
prol)a.Bilities." In prepa,ra.tion. 
[3] P. Brown, S. Della.Pietra., V. DellaPietra, and 
R. Mercer, "Word sense disainbigua.tion wing 
statistica.1 metl~ods," in  proceeding.^ 29th Annual 
h4eeting of the ~'ssociatioltjor Comp~itationnl Lin- 
g~rislics, (Berkeley, CA), June 1991. 
[4] A. Na.das, D. Na.hamoo, M. Picheny, a.nd J. Pow- 
ell, "An iterat,ive "flip-flop"') a.pproximation of the 
most inIorma.tive split in the construction of de- 
cision trees," in Proceedings of the IEEE Inlernn- 
lionir.1 Con,jerence on Acoustics, Speech and Signal 
Processing, (Toronto, Cana.da.), May 1991. 
[5] 1'. Chon, Applicntions of I~?,jormatio~ Theory to 
Pnttcrn Recognition and the Design of Decision 
'I?ree.s and Trellises. PhD t,hesis, Sta.nford Univer- 
sit,y, .Inne 1988. 
REFERENCES 
[I] P. Brown, J. Cocke, S. DellaPietra, V. DellaPietra, 
F. Jelinek, .J. Lafferty, R. Mercer, and P. Roossin, 
