But Dictionaries Are Data Too 
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, 
Meredith J. Goldsmith, Jan Hajic, Robert L. Mercer, and Surya Mohanty 
IBM Thomas J. Watson Research Center 
Yorktown Heights, NY 10598 
ABSTRACT 
Although empiricist approaches to machine translation 
depend vitally on data in the form of large bilingual cor- 
pora, bilingual dictionaries are also a source of information. 
We show how to model at least a part of the information 
contained in a bilingual dictionary so that we can treat a 
bilingual dictionary and a bilingual corpus as two facets of 
a unified collection of data from which to extract values 
for the parameters of a probabilistic machine translation 
system. We give an algorithm for obtaining maximum iike- 
fihood estimates of the parameters of a probabilistic model 
from this combined data and we show how these param- 
eters are affected by inclusion of the dictionary for some 
sample words. 
There is a sharp dichotomy today between ratio- 
nalist and empiricist approaches to machine transla- 
tion: rationalist systems are based on information ca- 
joled fact by reluctant fact from the minds of human 
experts; empiricist systems are based on information 
gathered wholesale from data. The data most readily 
digested by our translation system is from bilingual 
corpora, but bilingual dictiona.ries are data too, and 
in this paper we show how to weave information from 
them into the fabric of our statistical model of the 
translation process. 
When a lexicographer creates an entry in a bilin- 
gum dictionary, he describes in one language the 
meaning and use of a word from another language. 
Often, he includes a list. of simple translations. For ex- 
ample, tile entry for disingenuousness in the Harper- 
Collins Robert French Dictionary \[1\] lists the trans- 
lations d(loyautd, manque de sincdritd, and fourberie. 
In constructing such a list., the lexicographer gath- 
ers, either through introspection or extrospection, in-202 
stances in which disingenuousness has been used in 
various ways and records those of the different trans- 
lations that he deems of sufficient importance. Al- 
though a dictionary is more than just a collection of 
lists, we will concentrate here on that portion of it 
that is made up of lists. 
We formalize an intuitive account of lexicographic 
behavior as follows. We imagine that a lexicographer, 
when constructing an entry for the English word or 
phrase e, first chooses a random size s, and then se- 
lects at random a sample of s instances of the use of e, 
each with its French translation. We imagine, further, 
that he includes in his entry for e a list consisting of 
all of tile translations that occur at least once in his 
random sample. The probability that he will, in this 
way, obtain tile list fi, ..., f,,, is 
Pr(fl,..., f,,, le) = 
sl>O 0 SI''" 
(1) 
s,~/Pr(sle) 1-I~=1 Pr(file) s', 
where Pr(f, le) is the probability from our statistical 
model that the phrase f, occurs as a translation of e, 
and Pr(sle) is the probability that the lexicographer 
chooses to sample s instances of e. The multinomial 
coefficient is defined by 
s ) s! 
sl...sk -s1!...sk!' (2) 
and satisfies the recursion 
(3) 
$ where (,t) is the usual binomial coefficient. 
In genera.I, the sum in Equation (1) cannot be eval- 
uated in closed form, but we can organize an efficient 
calculation of it as follows. Let 
a H p(f~le)". (4) E-. E 
s~>0 s~>0 i=1 
Clearly, 
P(fl,"" ,fml e) = ~ P(sle)ot(s, m). (5) 
$ 
Using Equation (3), it is easy to show that 
= p(f, le) - - 1), (6) 
and therefore, we can compute P(fl,'",fmle) in time 
proportional to s2m. By judicious use of thresholds, 
even this can be substantiMly reduced. 
In the special case that Pr(s\[e) is a Poisson distri- 
bution with mean l(e), i.e., that 
Pr(sle ) -- A(e) 'e-~(e) s! ' (7) 
we can carry out the sum in Equation (1) explicitly, 
Tr~ 
Pr(fl,-.-,fmle) = e -x(e) H(e x(e)p(f'le) - 1). (8) 
i=1 
This is the form that we will assume throughout the 
remainder of the paper because of its simplicity. No- 
tice that in this case, the probability of an entry is 
a product of factors, one for each of the translations 
that it contains. 
The series fi, .-., fm represents the translations 
of e that are included in the dictionary. We call this 
set of translations De. Because we ignore everything 
about the dictionary except for these lists, a complete 
dictionary is just. a collection of De's, one for each 
of the English phrases that has an entry. We treat 
each of these entries as independent and write the 
probability of the entire dictionary as 
Pr(D) ~ H Pr(Del e)' (9) 
eED 
the product here running over all entries. 
Equation (9) gives the probability of the dictio- 
nary in terms of the probabilities of the entries that 
203 
make it up. The probabilities of these entries in turn 
are given by Equation (8) in terms of the probabilities, 
p(fle), of individual French phrases given individual 
English phrases. Combining these two equations, we 
can write 
Pr(D)= H (ex(e)p(rle) -1) H e-X(e)" (10) 
(e,f)ED e~D 
We take p(fle) to be given by the statistical model 
described in detail by Brown et al. \[2\]. Their model 
has a set of translation probabilities, t(fle), giving for 
each French word f and each English word e the prob- 
ability that f will appear as (part of) a translation of 
e; a set of fertility probabilities, n(¢le), giving for each 
integer ¢ and each English word e the probability that 
e will be translated as a phrase containing ¢ French 
words; and a set of distortion probabilities governing 
the placement of French words in the translation of 
an English phrase. They show how to estimate these 
parameters so as to maximize the probability, 
er(//)= H p(rl ), (1\]) 
(e,f)EH 
of a collection of pairs of aligned translations, (e, f) E 
//. 
Let O represent the complete set of parameters of 
the model of Brown et al. \[2\], and let 0 represent 
any one of the parameters. We extend the method 
of Brown et al. to develop a scheme for estimating 
O so as to maximize the joint probability of the cor- 
pus and the dictionary, Pro(//,D). We assume that 
Pro(//, D) = Pro(//)Pro(D). In general, it is pos- 
sible only to find local maxima of Pro(//,D) as a 
function of O, which we can do-by applying the EM 
algorithm \[3, 4\]. The EM algorithm adjusts an initial 
estimate of O in a series of iterations. Each itera- 
tion consists of an estimation step in which a count is 
determined for each parameter, followed by a maxi- 
mization step in which each parameter is replaced by 
a value proportional to its count. The count ce for a 
parameter 0 is defined by 
co = a~0 log Pro(//, D). (12) 
Because we assume that II and D are independent, 
we can write ce as the sum of a count for H and a 
count for D: 
ce = co(It) + co(D). 03) 
The corpus count is a sum of counts, one for each 
translation in the corpus. The dictionary count is also 
a sum of counts, but with each count weighted by a 
factor #(e,f) which we call the effective multiplicity 
of the translation. Thus, 
ce(H)= cde, f) 04) 
(e,f)¢~ 
and 
 (e,f)ce(e,f) 05) (e,f)ED 
with 0 
cs(e,f) = e~0 logpo(fle ). (16) 
The effective multiplicity is just the expected num- 
ber of times that our lexicographer observed the 
translation (e,f) given the dictionary and the cor- 
pus. In terms of the a priori multiplicity, p0(e,f) = 
A(e)p(fle), it is given by 
#0(e,f) 
/s(e,f) - 1 - e-~0(e,f)" (17) 
Figure I shows the effective multiplicity as a func- 
tion of the a priori multiplicity. For small values 
of lL0(e,f), /s(e,f) is approximately equal to 1 + 
#o(e, f)/2. For very large values, #0(e, f) and p(e, f) 
are approximately equal. Thus, if we expect a priori 
.that the lexicographer will see the translation (e,f) 
very many times, lhen the effective multiplicity will 
be nearly equal to this number, but even if we expect 
a priori that he will scarcely ever see a translation, the 
effective multiplicity for it cannot fall below 1. This 
is reasonable because in our model for the dictionary 
construction process, we assume that nothing can get 
into the dictionary unless it is seen at least once by 
the lexicographer. 
RESULTS 
We have used the algorithm described above to es- 
timate translation probabilities and fertilities for our 
statistical model in two different ways. First, we es- 
timated them from the corpus alone, then we esti- 
mated them from the corpus and the dictionary to- 
gether. The corpus that we used is the proceedings 
of the Canadian Parliament described elsewhere \[2\]. 
The dictionary is a machine readable version of the 
HarperCollins Robert French Dictionary \[1\]. 
We do not expect that including informa.tion hom 
the dictionary will have much effect on words that 
204 
5 
0 
- f~" - :/ -! 
_ • - 
// 
~//// . . . 
/ / 
./ 
/ I I I I 
0 1 2 3 4 5 
/1,0 
Figure I: Effective multiplicity vs P0 
occur frequently in the corpus, and this is borne out 
by the data. But. for words that are rare, we expect 
that there will be an effect. 
_f tCfle) \[ ¢ n(¢le) 
toundra .233 
duns .097 
antre .048 
poser .048 
ceux .048 
3 .644 
9 .160 
1 .144 
2 .021 
0 .029 
Table 1: Parameters for tundra, corpus only 
f t(fle) ~ -(4'le) 
toundra .665 
duns .040 
autre .020 
poser .020 
ceux .020 
I .855 
3 .089 
0 .029 
9 .022 
Table 2: Parameters for tundra, corpus and dictionary 
Tables 1 and 2 show the two results for the English 
word tundra. The entry for tundra in the Harper° 
Collins Robert French Dictionary \[1\] is simply the 
word toundra. We interpret this as a list with only 
one entry. We don't know how many times the lexi- 
cography ran across tundra, translated as toundra., but 
we know that it was at least once, and we know that 
he never ran across it translated as anything else. 
Even without the dictionary, toundra appears as the 
most probable translation, but with the dictionary, its 
probability is greatly improved. A more significant 
fact is the change in the fertility probabilities. Rare 
words have a tendency to act as garbage collectors 
in our system. This is why tundra, in the absence 
of guidance from the dictionary has, 3 as its most 
probable fertility and has a significant probability of 
fertility 9. With the dictionary, fertility 1 becomes 
the overwhelming favorite, and fertifity 9 dwindles to 
insignificance. 
Tables 3 and 4 show the trained parameters for 
jungle. The entry for jungle in the HarperCollins 
Robert French Dictionary is simply the word jun- 
gle. As with tundra using the dictionary enhances 
the probability of the dictionary translation of jungle 
and also improves the fertility substantially, 
f *(fl e ) ¢ n(¢l e ) 
jungle .277 
darts .072 
fouillis .045 
domaine .017 
devenir .017 
imbroglio .017 
2 .401 
1 .354 
5 .120 
3 .080 
4 .020 
6 .019 
Table 3: Parameters for jungle, corpus only 
f t(fle) , ¢ n(¢le) 
jungle .442 
dans .057 
fouillis .036 
domaine .014 
devenir .014 
imbroglio .014 
1 .598 
5 .074 
3 .049 
2 .024 
4 .012 
6 .012 
Table 4: Parameters for jungle, corpus and dictionary 
probabilistic functions of a Markov process," In- 
equalities, vol. 3, pp. 1-8, 1972. 
\[4\] A. Dempster, N. Laird, and D. Rubin, "Maximum 
likelihood from incomplete data via the EM algo- 
rithm," Journal of the Royal Statistical Society, 
vol. 39, no. B, pp. 1-38, 1977. 
REFERENCES 
\[1\] B.T. Atkins, A. Dnv~, R. C. Milne, P.-H. Cousin, 
H. M. A. Lewis, L. A. Sinclair, R. O. Birks, and 
M.-N. Lamy, HarperCollins Robert French Dictio- 
nary. New York: Harper & Row, 1990. 
\[2\] P. F. Brown, S. A. DellaPietra, V. J. DellaPietra, 
and R. L. Mercer, "The mathematics of machine 
translation: Parameter estimation." Submitted to 
Computational Linguistics, 1992. 
\[3\] L. Baum, "An inequality and associated max- 
imization technique in statistical estimation °~'05z 
