The Mathematics of Statistical Machine 
Translation: Parameter Estimation 
Peter E Brown* 
IBM T.J. Watson Research Center 
Vincent J. Della Pietra* 
IBM T.J. Watson Research Center 
Stephen A. Della Pietra* 
IBM T.J. Watson Research Center 
Robert L. Mercer* 
IBM T.J. Watson Research Center 
We describe a series o,f five statistical models o,f the translation process and give algorithms,for 
estimating the parameters o,f these models given a set o,f pairs o,f sentences that are translations 
o,f one another. We define a concept o,f word-by-word alignment between such pairs o,f sentences. 
For any given pair of such sentences each o,f our models assigns a probability to each of the 
possible word-by-word alignments. We give an algorithm for seeking the most probable o,f these 
alignments. Although the algorithm is suboptimal, the alignment thus obtained accounts well for 
the word-by-word relationships in the pair o,f sentences. We have a great deal o,f data in French 
and English from the proceedings o,f the Canadian Parliament. Accordingly, we have restricted 
our work to these two languages; but we,feel that because our algorithms have minimal linguistic 
content they would work well on other pairs o,f languages. We also ,feel, again because of the 
minimal linguistic content o,f our algorithms, that it is reasonable to argue that word-by-word 
alignments are inherent in any sufficiently large bilingual corpus. 
1. Introduction 
The growing availability of bilingual, machine-readable texts has stimulated interest 
in methods for extracting linguistically valuable information from such texts. For ex- 
ample, a number of recent papers deal with the problem of automatically obtaining 
pairs of aligned sentences from parallel corpora (Warwick and Russell 1990; Brown, 
Lai, and Mercer 1991; Gale and Church 1991b; Kay 1991). Brown et al. (1990) assert, 
and Brown, Lai, and Mercer (1991) and Gale and Church (1991b) both show, that it is 
possible to obtain such aligned pairs of sentences without inspecting the words that 
the sentences contain. Brown, Lai, and Mercer base their algorithm on the number of 
words that the sentences contain, while Gale and Church base a similar algorithm on 
the number of characters that the sentences contain. The lesson to be learned from 
these two efforts is that simple, statistical methods can be surprisingly successful in 
achieving linguistically interesting goals. Here, we address a natural extension of that 
work: matching up the words within pairs of aligned sentences. 
In recent papers, Brown et al. (1988, 1990) propose a statistical approach to ma- 
chine translation from French to English. In the latter of these papers, they sketch an 
algorithm for estimating the probability that an English word will be translated into 
any particular French word and show that such probabilities, once estimated, can be 
used together with a statistical model of the translation process to align the words 
in an English sentence with the words in its French translation (see their Figure 3). 
* IBM T.J. Watson Research Center, Yorktown Heights, NY 10598 
(~) 1993 Association for Computational Linguistics 
Computational Linguistics Volume 19, Number 2 
Pairs of sentences with words aligned in this way offer a valuable resource for work 
in bilingual lexicography and machine translation. 
Section 2 is a synopsis of our statistical approach to machine translation. Following 
this synopsis, we develop some terminology and notation for describing the word-by- 
word alignment of pairs of sentences. In Section 4 we describe our series of models 
of the translation process and give an informal discussion of the algorithms by which 
we estimate their parameters from data. We have written Section 4 with two aims 
in mind: first, to provide the interested reader with sufficient detail to reproduce our 
results, and second, to hold the mathematics at the level of college calculus. A few 
more difficult parts of the discussion have been postponed to the Appendix. 
In Section 5, we present results obtained by estimating the parameters for these 
models from a large collection of aligned pairs of sentences from the Canadian Hansard 
data (Brown, Lai, and Mercer 1991). For a number of English words, we show trans- 
lation probabilities that give convincing evidence of the power of statistical methods 
to extract linguistically interesting correlations from large corpora. We also show au- 
tomatically derived word-by-word alignments for several sentences. 
In Section 6, we discuss some shortcomings of our models and propose modifica- 
tions to address some of them. In the final section, we discuss the significance of our 
work and the possibility of extending it to other pairs of languages. 
Finally, we include two appendices: one to summarize notation and one to collect 
the formulae for the various models that we describe and to fill an occasional gap in 
our development. 
2. Statistical Translation 
In 1949, Warren Weaver suggested applying the statistical and cryptanalytic techniques 
then emerging from the nascent field of communication theory to the problem of us- 
ing computers to translate text from one natural language to another (published in 
Weaver 1955). Efforts in this direction were soon abandoned for various philosophical 
and theoretical reasons, but at a time when the most advanced computers were of a 
piece with today's digital watch, any such approach was surely doomed to computa- 
tional starvation. Today, the fruitful application of statistical methods to the study of 
machine translation is within the computational grasp of anyone with a well-equipped 
workstation. 
A string of English words, e, can be translated into a string of French words in 
many different ways. Often, knowing the broader context in which e occurs may serve 
to winnow the field of acceptable French translations, but even so, many acceptable 
translations will remain; the choice among them is largely a matter of taste. In statistical 
translation, we take the view that every French string, f, is a possible translation of e. 
We assign to every pair of strings (e~ f) a number Pr(fle ), which we interpret as the 
probability that a translator, when presented with e, will produce f as his translation. 
We further take the view that when a native speaker of French produces a string 
of French words, he has actually conceived of a string of English words, which he 
translated mentally. Given a French string f, the job of our translation system is to find 
the string e that the native speaker had in mind when he produced f. We minimize 
our chance of error by choosing that English string 6 for which Pr(elf ) is greatest. 
Using Bayes' theorem, we can write 
Pr(elf ) = Pr(e) Pr(fle ) 
Pr(f) (1) 
Since the denominator here is independent of e, finding ~ is the same as finding e 
264 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
so as to make the product Pr(e)Pr(fle ) as large as possible. We arrive, then, at the 
Fundamental Equation of Machine Translation: 
= argmax Pr(e) Pr(fle ). (2) 
e 
As a representation of the process by which a human being translates a passage from 
French to English, this equation is fanciful at best. One can hardly imagine someone 
rifling mentally through the list of all English passages computing the product of the 
a priori probability of the passage, Pr(e), and the conditional probability of the French 
passage given the English passage, Pr(fle ). Instead, there is an overwhelming intuitive 
appeal to the idea that a translator proceeds by first understanding the French, and 
then expressing in English the meaning that he has thus grasped. Many people have 
been guided by this intuitive picture when building machine translation systems. 
From a purely formal point of view, on the other hand, Equation (2) is completely 
adequate. The conditional distribution Pr(f\[e) is just an enormous table that associates 
a real number between zero and one with every possible pairing of a French passage 
and an English passage. With the proper choice for this distribution, translations of 
arbitrarily high quality can be achieved. Of course, to construct Pr(f\[e) by examining 
individual pairs of French and English passages one by one is out of the question. 
Even if we restrict our attention to passages no longer than a typical novel, there are 
just too many such pairs. But this is only a problem in practice, not in principle. The 
essential question for statistical translation, then, is not a philosophical one, but an 
empirical one: Can one construct approximations to the distributions Pr(e) and Pr(f\[e) 
that are good enough to achieve an acceptable quality of translation? 
Equation (2) summarizes the three computational challenges presented by the 
practice of statistical translation: estimating the language model probability, Pr(e); esti- 
mating the translation model probability, Pr(fle); and devising an effective and efficient 
suboptimal search for the English string that maximizes their product. We call these 
the language modeling problem, the translation modeling problem, and the search 
problem. 
The language modeling problem for machine translation is essentially the same 
as that for speech recognition and has been dealt with elsewhere in that context (see, 
for example, the recent paper by Maltese and Mancini \[1992\] and references therein). 
We hope to deal with the search problem in a later paper. In this paper, we focus 
on the translation modeling problem. Before we turn to this problem, however, we 
should address an issue that may be a concern to some readers: Why do we estimate 
Pr(e) and Pr(fle ) rather than estimate Pr(elf ) directly? We are really interested in this 
latter probability. Wouldn't we reduce our problems from three to two by this direct 
approach? If we can estimate Pr(fle ) adequately, why can't we just turn the whole 
process around to estimate Pr(eif)? 
To understand this, imagine that we divide French and English strings into those 
that are well-formed and those that are ill-formed. This is not a precise notion. We 
have in mind that strings like II va ?z la biblioth~que, or I live in a house, or even Colorless 
green ideas sleep furiously are well-formed, but that strings like ~ lava I1 biblioth~que or a 
I in live house are not. When we translate a French string into English, we can think of 
ourselves as springing from a well-formed French string into the sea of well-formed 
English strings with the hope of landing on a good one. It is important, therefore, 
that our model for Pr(elf ) concentrate its probability as much as possible on well- 
formed English strings. But it is not important that our model for Pr(f\[e) concentrate 
its probability on well-formed French strings. If we were to reduce the probability 
of all well-formed French strings by the same factor, spreading the probability thus 
265 
Computational Linguistics Volume 19, Number 2 
liberated over ill-formed French strings, there would be no effect on our translations: 
the argument that maximizes some function f(x) also maximizes cf(x) for any posi- 
tive constant c. As we shall see below, our translation models are prodigal, spraying 
probability all over the place, most of it on ill-formed French strings. In fact, as we 
discuss in Section 4.5, two of our models waste much of their probability on things 
that are not strings at all, having, for example, several different second words but no 
first word. If we were to turn one of these models around to model Pr(elf ) directly, 
the result would be a model with so little probability concentrated on well-formed 
English strings as to confound any scheme to discover one. 
The two factors in Equation (2) cooperate. The translation model probability is 
large for English strings, whether well- or ill-formed, that have the necessary words in 
them in roughly the right places to explain the French. The language model probability 
is large for well-formed English strings regardless of their connection to the French. 
Together, they produce a large probability for well-formed English strings that account 
well for the French. We cannot achieve this simply by reversing our translation models. 
3. Alignments 
We say that a pair of strings that are translations of one another form a translation, 
and we show this by enclosing the strings in parentheses and separating them by a 
vertical bar. Thus, we write the translation (Qu'aurions-nous pu faire? I What could we 
have done?) to show that What could we have done? is a translation of Qu'aurions-nous pu 
faire? When the strings end in sentences, we usually omit the final stop unless it is a 
question mark or an exclamation point. 
Brown et al. (1990) introduce the idea of an alignment between a pair of strings as 
an object indicating for each word in the French string that word in the English string 
from which it arose. Alignments are shown graphically, as in Figure 1, by drawing 
lines, which we call connections, from some of the English words to some of the French 
words. The alignment in Figure I has seven connections: (the, Le), (program, programme), 
and so on. Following the notation of Brown et al., we write this alignment as (Le 
programme a ~t~ mis en application I And the(l) program(2) has(3) been(4) implemented(5,6,7)). 
The list of numbers following an English word shows the positions in the French string 
of the words to which it is connected. Because And is not connected to any French 
words here, there is no list of numbers after it. We consider every alignment to be 
correct with some probability, and so we find (Le programme a ~t~ mis en application I 
And(I,2,3,4,5,6,7) the program has been implemented) perfectly acceptable. Of course, we 
expect it to be much less probable than the alignment shown in Figure 1. 
In Figure 1 each French word is connected to exactly one English word, but more 
general alignments are possible and may be appropriate for some translations. For 
example, we may have a French word connected to several English words as in Fig- 
ure 2, which we write as (Le reste appartenait aux autochtones I The(l) balance(2) was(3) 
the(3) territory(3) of(4) the(4) aboriginal(5) people(5)). More generally still, we may have 
several French words connected to several English words as in Figure 3, which we 
write as (Les pauvres sont d~munis I The(l) poor(2) don't(3,4) have(3,4) any(3,4) money(3,4)). 
Here, the four English words don't have any money work together to generate the two 
French words sont d~munis. 
In a figurative sense, an English passage is a web of concepts woven together 
according to the rules of English grammar. When we look at a passage, we cannot see 
the concepts directly but only the words that they leave behind. To show that these 
words are related to a concept but are not quite the whole story, we say that they form 
a cept. Some of the words in a passage may participate in more than one cept, while 
266 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
And1 the2 program3 has4 been5 implemented6 
Lel programme2 a3 ~t64 miss en6 application7 
Figure 1 
An alignment with independent English words. 
The1 
balance2 
was3 
the4 
territory5 
of 6 
the7 
aboriginal8 
people9 
Figure 2 
An alignment with independent French words. 
Lel 
reste2 
appartenait3 
aux4 
autochtones5 
The1 
LeSl 
poor2 
pauvres2 
don't3 have4 any5 rnoney6 
sonta demunis4 
Figure 3 
A general alignment. 
267 
Computational Linguistics Volume 19, Number 2 
others may participate in none, serving only as a sort of syntactic glue to bind the 
whole together. When a passage is translated into French, each of its cepts contributes 
some French words to the translation. We formalize this use of the term cept and relate 
it to the idea of an alignment as follows. 
We call the set of English words connected to a French word in a particular align- 
ment the cept that generates the French word. Thus, an alignment resolves an English 
string into a set of possibly overlapping cepts that we call the ceptual scheme of the 
English string with respect to the alignment. The alignment in Figure 3 contains the 
three cepts The, poor, and don't have any money. When one or more of the French words 
is connected to no English words, we say that the ceptual scheme includes the empty 
cept and that each of these words has been generated by this empty cept. 
Formally, a cept is a subset of the positions in the English string together with the 
words occupying those positions. When we write the words that make up a cept, we 
sometimes affix a subscript to each one showing its position. The alignment in Figure 2 
includes the cepts the~ and of 6 the7, but not the cepts of 6 the1 or the7. In (J'applaudis ?l la 
ddcision \] I(1) applaud(2) the(4) decision(5)), ?l is generated by the empty cept. Although 
the empty cept has no position, we place it by convention in position zero, and write 
it as e0. Thus, we may also write the previous alignment as (J'applaudis ?~ la d~cision 
leo(3) I(1) applaud(2) the(4) decision(5)). 
We denote the set of alignments of if\[e) by .A(e, f). If e has length I and f has 
length m, there are Im different connections that can be drawn between them because 
each of the m French words can be connected to any of the I English words. Since an 
alignment is determined by the connections that it contains, and since a subset of the 
possible connections can be chosen in 2 lm ways, there are 2 zm alignments in .A(e, f). 
4. Translation Models 
In this section, we develop a series of five translation models together with the al- 
gorithms necessary to estimate their parameters. Each model gives a prescription for 
computing the conditional probability Pr(f\[e), which we call the likelihood of the trans- 
lation (f, e). This likelihood is a function of a large number of free parameters that we 
must estimate in a process that we call training. The likelihood of a set of transla- 
tions is the product of the likelihoods of its members. In broad outline, our plan is to 
guess values for these parameters and then to apply the EM algorithm (Baum 1972; 
Dempster, Laird, and Rubin 1977) iteratively so as to approach a local maximum of 
the likelihood of a particular set of translations that we call the training data. When 
the likelihood of the training data has more than one local maximum, the one that we 
approach will depend on our initial guess. 
In Models 1 and 2, we first choose a length for the French string, assuming all 
reasonable lengths to be equally likely. Then, for each position in the French string, we 
decide how to connect it to the English string and what French word to place there. 
In Model 1 we assume all connections for each French position to be equally likely. 
Therefore, the order of the words in e and f does not affect Pr(f\]e). In Model 2 we 
make the more realistic assumption that the probability of a connection depends on 
the positions it connects and on the lengths of the two strings. Therefore, for Model 2, 
Pr(f\[e) does depend on the order of the words in e and f. Although it is possible 
to obtain interesting correlations between some pairs of frequent words in the two 
languages using Models 1 and 2, as we will see later (in Figure 5), these models often 
lead to unsatisfactory alignments. 
In Models 3, 4, and 5, we develop the French string by choosing, for each word in 
the English string, first the number of words in the French string that will be connected 
268 
Peter E Brown et al. The Mathematics of Statistical Machine Translation 
to it, then the identity of these French words, and finally the actual positions in the 
French string that these words will occupy. It is this last step that determines the 
connections between the English string and the French string and it is here that these 
three models differ. In Model 3, as in Model 2, the probability of a connection depends 
on the positions that it connects and on the lengths of the English and French strings. 
In Model 4 the probability of a connection depends in addition on the identities of the 
French and English words connected and on the positions of any other French words 
that are connected to the same English word. Models 3 and 4 are deficient, a technical 
concept defined and discussed in Section 4.5. Briefly, this means that they waste some 
of their probability on objects that are not French strings at all. Model 5 is very much 
like Model 4, except that it is not deficient. 
Models 1-4 serve as stepping stones to the training of Model 5. Models 1 and 2 
have an especially simple mathematical form so that iterations of the EM algorithm 
can be computed exactly. That is, we can explicitly perform sums over all possible 
alignments for these two models. In addition, Model 1 has a unique local maximum so 
that parameters derived for it in a series of EM iterations do not depend on the starting 
point for the iterations. As explained below, we use Model 1 to provide initial estimates 
for the parameters of Model 2. In Model 2 and subsequent models, the likelihood 
function does not have a unique local maximum, but by initializing each model from 
the parameters of the model before it, we arrive at estimates of the parameters of the 
final model that do not depend on our initial estimates of the parameters for Model 1. 
In Models 3 and 4, we must be content with approximate EM iterations because it is 
not feasible to carry out sums over all possible alignments for these models. But, while 
approaching more closely the complexity of Model 5, they retain enough simplicity 
to allow an efficient investigation of the neighborhood of probable alignments and 
therefore allow us to include what we hope are all of the important alignments in 
each EM iteration. 
In the remainder of this section, we give an informal but reasonably precise de- 
scription of each of the five models and an intuitive account of the EM algorithm as 
applied to them. We assume the reader to be comfortable with Lagrange multipliers, 
partial differentiation, and constrained optimization as they are presented in a typical 
college calculus text, and to have a nodding acquaintance with random variables. On 
the first time through, the reader may wish to jump from here directly to Section 5, 
returning to this Section when and if he should desire to understand more deeply 
how the results reported later are achieved. 
The basic mathematical object with which we deal here is the joint probability 
distribution Pr(F = f, A = a, E = e), where the random variables F and E are a French 
string and an English string making up a translation, and the random variable A is 
an alignment between them. We also consider various marginal and conditional prob- 
ability distributions that can be constructed from Pr(F = f, A = a, E = e), especially 
the distribution Pr(F = fie = e). We generally follow the common convention of using 
uppercase letters to denote random variables and the corresponding lowercase letters 
to denote specific values that the random variables may take. We have already used I 
and m to represent the lengths of the strings e and L and so we use L and M to denote 
the corresponding random variables. When there is no possibility for confusion, or, 
more properly, when the probability of confusion is not thereby materially increased, 
we write Pr(f, a, e) for Pr(F = f, A = a, E = e), and use similar shorthands throughout. 
We can write the likelihood of (fie) in terms of the conditional probability Pr(f, ale ) 
as 
Pr(fle) = Z Pr(f, ale ). (3) 
a 
269 
Computational Linguistics Volume 19, Number 2 
The sum here, like all subsequent sums over a, is over the elements of M(e, f). We 
restrict ourselves in this section to alignments like the one shown in Figure I where 
each French word has exactly one connection. In this kind of alignment, each cept is 
either a single English word or it is empty. Therefore, we can assign cepts to positions 
in the English string, reserving position zero .for the empty cept. If the English string, 
e = e~ - el e2... el, has 1 words, and the French string, f = f~ =_ flf2.., fro, has m 
words, then the alignment, a, can be represented by a series, a~ = ala2...am, of m 
values, each between 0 and I such that if the word in position j of the French string 
is connected to the word in position i of the English string, then aj = i, and if it is not 
connected to any English word, then aj = O. 
Without loss of generality, we can write 
m 
Pr(f,a\[e) = Pr(m\[e)HPr(ajla~-l,fJ-l,m,e)Pr(fj\[4,f~-l,m,e). 
j=l 
(4) 
This is only one of many ways in which Pr(f, ale) can be written as the product of a 
series of conditional probabilities. It is important to realize that Equation (4) is not an 
approximation. Regardless of the form of Pr(f, ale ), it can always be analyzed into a 
product of terms in this way. We are simply asserting in this equation that when we 
generate a French string together with an alignment from an English string, we can 
first choose the length of the French string given our knowledge of the English string. 
Then we can choose where to connect the first position in the French string given 
our knowledge of the English string and the length of the French string. Then we can 
choose the identity of the first word in the French string given our knowledge of the 
English string, the length of the French string, and the position in the English string 
to which the first position in the French string is connected, and so on. As we step 
through the French string, at each point we make our next choice given our complete 
knowledge of the English string and of all our previous choices as to the details of the 
French string and its alignment. 
4.1 Model 1 
The conditional probabilities on the right-hand side of Equation (4) cannot all be 
taken as independent parameters because there are too many of them. In Model 1, we 
assume that Pr(mle ) is independent of e and m; that Pr(ajlalJ-l, J -1, m, e), depends 
only on 1, the length of the English string, and therefore must be (l + 1)-1; and that 
Pr(fj\[alJ,fl j-l, m~ e) depends only on j~ and %. The parameters, then, are ~ -_ Pr(mle ), 
and t(~\]%) -- Pr(djlalJ,AJ-1, m, e), which we call the translation probability of ~ given 
eaj. We think of ~ as some small, fixed number. The distribution of M, the length of the 
French string, is unnormalized but this is a minor technical issue of no significance 
to our computations. If we wish, we can think of M as having some finite range. As 
long as this range encompasses everything that actually occurs in training data, no 
problems arise. 
We turn now to the problem of estimating the translation probabilities for Model 1. 
The joint likelihood of a French string and an alignment given an English string is 
Pr(f, ale) - ~ (l +-l)m t( lea,). (5) 
The alignment is determined by specifying the values of aj for j from 1 to m, each of 
270 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
which can take any value from 0 to I. Therefore, 
1 l m 
Pr(fle)- (l+l)m E E 1-It(fJl% ). (6) 
al=0 am=O j=l 
We wish to adjust the translation probabilities so as to maximize Pr(fIe ) subject to 
the constraints that for each ¢, 
E t(f\]e) = 1. (7) 
I 
Following standard practice for constrained maximization, we introduce Lagrange 
multipliers )%, and seek an unconstrained extrernum of the auxiliary function 
h(t,,X) - (l q2i) m ~_,... t(fjl%) - ~_,~(2s t(f\]e) - I ). (8) 
al---0 am =0 j=l e 
An extremum occurs when all of the partial derivatives of h with respect to the compo- 
nents of t and ,~ are zero. That the partial derivatives with respect to the components 
of I be zero is simply a restatement of the constraints on the translation probabilities. 
The partial derivative of h with respect to t(f\] e) is 
l l re re Oh e 
- (i+1)~ ~... ~ Es(f,~)5(e, eo,/t(fle)-I IIt(~lea~) -- Ae, (9) cot(fie) 
al=0 are=0j 1"= k=l 
where 6 is the Kronecker delta function, equal to one when both of its arguments are 
the same and equal to zero otherwise. This partial derivative will be zero provided 
that 
l l m m 
t(fle) = )%1 (l -~-l)re E"" E 6(f,fj)6(e, %) t(fkl%). (10) II 
al=0 are=O '= k=l 
Superficially, Equation (10) looks like a solution to the extremum problem, but 
it is not because the translation probabilities appear on both sides of the equal sign. 
Nonetheless, it suggests an iterative procedure for finding a solution: given an initial 
guess for the translation probabilities, we can evaluate the right-hand side of Equation 
(10) and use the result as a new estimate for t(ff e). (Here and elsewhere, the Lagrange 
multipliers simply serve as a reminder that we need to normalize the translation 
probabilities so that they satisfy Equation (7).) This process, when applied repeatedly, 
is called the EM algorithm. That it converges to a stationary point of h in situations 
like this was first shown by Baum (1972) and later by others (Dempster, Laird, and 
Rubin 1977). 
With the aid of Equation (5), we can reexpress Equation (10) as 
7rg 
t(fte ) = A~-' EPr(f,a\[e)E6(f,fj)5(e, %). (11) 
a j=l 
number of times econnects to f in a 
We call the expected number of times that e connects to f in the translation (fie) the 
count of f given e for (fie) and denote it by c(fle; f, e). By definition, 
c(f I e; f, e) = E Pr(ale' f) E 6(f, fj)5(e, eat) , (12) 
a j=l 
271 
Computational Linguistics Volume 19, Number 2 
where Pr(ale, f) = Pr(f, ale)/Pr(fle ). If we replace Ae by ~¢ Pr(fle ), then Equation (11) 
can be written very compactly as 
t(fle ) = )~-jlc(yle; f, e). (13) 
In practice, our training data consists of a set of translations, (f(1) leO)), (f(2)le(2)), ..., 
(f(S)\[e(S)), so this equation becomes 
S 
t(fle ) = A; 1 E c(f\[e; f(S), e(S)). (14) 
s=l 
Here, )% serves only as a reminder that the translation probabilities must be normal- 
ized. 
Usually, it is not feasible to evaluate the expectation in Equation (12) exactly. Even 
when we exclude multi-word cepts, there are still (1 + 1) m alignments possible for 
(fie). Model 1, however, is special because by recasting Equation (6), we arrive at 
an expression that can be evaluated efficiently. The right-hand side of Equation (6) 
is a sum of terms each of which is a monomial in the translation probabilities. Each 
monomial contains m translation probabilities, one for each of the words in f. Different 
monomials correspond to different ways of connecting words in f to cepts in e with 
every way appearing exactly once. By direct evaluation, we see that 
1 l m m l 
E-.. E 
al=0 am=O j=l j=l i=0 
(15) 
An example may help to clarify this. Suppose that m = 3 and 1 = 1, and that we write 
tji as a shorthand for t(d~le~). Then the left-hand side of Equation (15) is ho t20 t30 + 
tlo t20 t31 +"" q- tn t21 t30 + tll t21 t31, and the right-hand side is (ho + tn) (t20 + t21 ) (t30 q- t31 ). 
It is routine to verify that these are the same. Therefore, we can interchange the sums 
in Equation (6) with the product to obtain 
m l 
pr(fle ) = e (I +-1) '~ I-\[~-'t(~lei)" 
j=l i=0 
(16) 
If we use this expression in place of Equation (6) when we write the auxiliary function 
in Equation (8), we find that 
count of e in e 
r 
c(fle; f, e) = t(fleo ) :-+ t(flet) y=l i=o 
count of f in f 
(17) 
Thus, the number of operations necessary to calculate a count is proportional to 1 + m 
rather than to (I + 1) m as Equation (12) might suggest. 
272 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
Using Equations (14) and (17), we can estimate the parameters t(f I e) as follows. 
1. Choose initial values for t(fle ). 
2. For each pair of sentences if(s), e(S)), 1 < s < S, use Equation (17) to 
compute the counts c(f\] e; f(s), e(S)). Notice that these counts will be 
different from zero only when f is one of the words in f(s) and e is one 
of the words in e (~). Notice, also, that c(f I e; f(s), e(~)) does not depend on 
the order of the words in the sentences, but only on the number of times 
that the words appear in their respective sentences. 
3. For each e that appears in at least one of the e (s), 
• Compute ,~ according to the equation 
S 
= 1181 
f s=l 
• For each f that appears in at least one f('), use Equation (14) to 
obtain a new value for t(f\] e). 
4. Repeat steps 2 and 3 until the values of t(dle) have converged to the 
desired degree. 
The details of our initial guesses for t(fl e) are unimportant because Pr(fle ) has a 
unique local maximum for Model 1, as is shown in Appendix B. We start with all of 
the t(fle) equal, but any other choice that avoids zeros would lead to the same final 
solution. 
4.2 Model 2 
In Model 1, we take no cognizance of where words appear in either string. The first 
word in the French string is just as likely to be connected to a word at the end of the 
English string as to one at the beginning. In Model 2 we make the same assumptions 
as in Model 1 except that we assume that Pr(aj\]~-l,f~ -1, m, e) depends on j, aj, and 
m, as well as on I. We introduce a set of alignment probabilities, 
a(aj\]j, m, I) - Pr(aj\]~-l,f~ -1, ra, I), (19) 
which satisfy the constraints 
l 
~--~a(ilj, m, l) -- 1 (20) 
i=0 
for each triple jml. In place of Equation (6), we have 
1 l m 
Pr(fle) =¢ ~"" ~ H t(fjleaj)a(ajlj' m, I). (21) 
al =0 am =0 j=l 
Therefore, we seek an unconstrained extremum of the auxiliary function 
h(t,a,)~,#) =_ e ~-~ ... ~ t(fjleaj)a(aj\]j,m,l ) 
al=O am=O j=l 
- ae(yt(?le/- 1/- 1/- (22) 
e 1 
273 
Computational Linguistics Volume 19, Number 2 
The reader will easily verify that Equations (11), (13), and (14) carry over from 
Model 1 to Model 2 unchanged. We need a new count, c(ilj, m, l; f, e), the expected 
number of times that the word in position j of f is connected to the word in position 
i of e. Clearly, 
c(ilj, m, l; f, e) = ~ Pr(ale, f)6(i, aj). (23) 
a 
In analogy with Equations (13) and (14), we have, for a single translation, 
a(ilj, m, l) = #~lc(ilj, m, l; f, e), (24) 
and, for a set of translations, 
S 
-1 f(s), e(S)). (25) a(ilj, m, 1) = #jmt ~ c(ilJ, m, l; 
s=l 
Notice that if f(s) does not have length m or if e (s) does not have length l, then the 
corresponding count is zero. As with the As in earlier equations, the #s here serve 
simply to remind us that the alignment probabilities must be normalized. 
Model 2 shares with Model 1 the important property that the sums in Equations 
(12) and (23) can be obtained efficiently. We can rewrite Equation (21) as 
Pr(fle ) =e I~I ~ t(fjlei)a(ilj , m, 1). 
j=l i=0 
(26) 
Using this form for Pr(fle ), we find that 
m l t(fle)a(ilj, m,l) 6(f,fj)6(e, ei) 
c(f\[e; f, e) = ~ y~ t(fleo ) a(Oij, m, l)-+. : --+ ~ ~( llj, m, I)' 
j=l i=0 
(27) 
and 
t(fjl e~.) a(ilj , m, I) (28) c(ilj , m, 
I; f, e) = t(fjleo)a(OIj, m, l) +... + t(fjlez)a(llj, m, l)" 
Equation (27) has a double sum rather than the product of two single sums, as in 
Equation (17), because in Equation (27) i and j are tied together through the alignment 
probabilities. 
Model 1 is the special case of Model 2 in which a(ilj , m, I) is held fixed at (1+1) -1. 
Therefore, any set of parameters for Model I can be reinterpreted as a set of parameters 
for Model 2. Taking as our initial estimates of the parameters for Model 2 the parameter 
values that result from training Model 1 is equivalent to computing the probabilities 
of all alignments as if we were dealing with Model 1, but then collecting the counts 
as if we were dealing with Model 2. The idea of computing the probabilities of the 
alignments using one model, but collecting the counts in a way appropriate to a second 
model is very general and can always be used to transfer a set of parameters from 
one model to another. 
274 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
4.3 Intermodel Interlude 
We created Models 1 and 2 by making various assumptions about the conditional 
probabilities that appear in Equation (4). As we have mentioned, Equation (4) is an 
exact statement, but it is only one of many ways in which the joint likelihood of f 
and a can be written as a product of conditional probabilities. Each such product 
corresponds in a natural way to a generative process for developing f and a from e. 
In the process corresponding to Equation (4), we first choose a length for f. Next, we 
decide which position in e is connected to fl and what the identity of fl is. Then, we 
decide which position in e is connected to f2, and so on. For Models 3, 4, and 5, we 
write the joint likelihood as a product of conditional probabilities in a different way. 
Casual inspection of some translations quickly establishes that the is usually trans- 
lated into a single word (le, la, or l'), but is sometimes omitted; or that only is often 
translated into one word (for example, seulement), but sometimes into two (for exam- 
ple, ne ... que), and sometimes into none. The number of French words to which e 
is connected in a randomly selected alignment is a random variable, Ce, that we call 
the fertility of e. Each choice of the parameters in Model 1 or Model 2 determines a 
distribution, Pr(¢e = ¢), for this random variable. But the relationship is remote: just 
what change will be wrought in the distribution of ~th~ if, say, we adjust a(1 \[2, 8, 9) is 
not immediately clean In Models 3, 4, and 5, we parameterize fertilities directly. 
As a prolegomenon to a detailed discussion of Models 3, 4, and 5, we describe 
the generative process upon which they are based. Given an English string, e, we first 
decide the fertility of each word and a list of French words to connect to it. We call 
this list, which may be empty, a tablet. The collection of tablets is a random variable, T, 
that we call the tableau of e; the tablet for the i th English word is a random variable, Ti; 
and the k th French word in the i th tablet is a random variable, Tik. After choosing the 
tableau, we permute its words to produce f. This permutation is a random variable, 
H. The position in f of the k th word in the l *h tablet is yet another a random variable, 
I~ik. 
The joint likelihood for a tableau, T, and a permutation, 7r, is 
Pr(% zc\[e) 
l II 
Pr(¢il¢  -1, e) Pr(¢01¢I, e) × 
i=l 
l ¢i 
II II Pr(  kt d -1 , "0i-1 , o0, " e) x 
i=0 k=l 
H II PrOrikITrif -1' 7r{ -1' rO~' COt' e) x 
i=l k=l 
¢0 
II Pr (Tr°k\[~ro~-l' 7r~, r~, ¢~, e). 
k=l 
(29) 
In this equation, rik1-1 represents the series of values Til,... , "l-ik_l; 7ri k-1 represents the 
series of values 7ril,..., 7rik-1; and ¢i is shorthand for Cei. 
Knowing T and 7r determines a French string and an alignment, but in general 
several different pairs r, 7r may lead to the same pair f, a. We denote the set of such 
pairs by (f, a). Clearly, then 
Pr(f, ale ) = E PRO-, 7tie ). (30) 
(~',~r) E (f,a) 
275 
Computational Linguistics 
el 
cheap 
bon march~ 
bon march6 
fl f2 
Figure 4 
Two tableaux for one alignment. 
Volume 19, Number 2 
el 
cheap 
march~ bon 
X 
bon march6 
fl f2 
l 1 The number of elements in (f, a} is I-\[i=0 ¢i- because for each ri there are ¢i! arrange- 
ments that lead to the pair f, a. Figure 4 shows the two tableaux for (bon march~ \[ 
cheap(I,2)). 
Except for degenerate cases, there is one alignment in A(e, f) for which Pr(ale, f) 
is greatest. We call this the Viterbi alignment for (fie) and denote it by V(f\[e). We 
know of no practical algorithm for finding V(fle ) for a general model. Indeed, if 
someone were to claim that he had found V(f\]e), we know of no practical algorithm 
for demonstrating that he is correct. But for Model 2 (and, thus, also for Model 1), 
finding V(f\[e) is straightforward. For each j, we simply choose aj so as to make the 
product t(fj\[%)a(ajlj, ra, l) as large as possible. The Viterbi alignment depends on the 
model with respect to which it is computed. When we need to distinguish between 
the Viterbi alignments for different models, we write V(f\[e; 1), V(fle; 2), and so on. 
We denote by .Ai,_-j(e, f) the set of alignments for which aj = i. We say that ij 
is pegged in these alignments. By the pegged Viterbi alignment for ij, which we write 
Vi~_j(fle), we mean that element of Ai~-j(e, f) for which Pr(a\[e, f) is greatest. Obviously, 
we can find Vi~j(fle; 1) and Viii(fie;2) quickly with a straightforward modification 
of the algorithm described above for finding V(f\]e; 1) and V(fle; 2). 
4.4 Model 3 
Model 3 is based on Equation (29). Earlier, we were unable to treat each of the con- 
ditional probabilities on the right-hand side of Equation (4) as a separate parameter. 
With Equation (29) we are no better off and must again make assumptions to reduce 
the number of independent parameters. There are many different sets of assumptions 
that we might make, each leading to a different model for the translation process. 
In Model 3, we assume that, for i between 1 and 1, Pr(¢i\[¢~ -1, e) depends only on 
¢i and e/; that, for all i, Pr(~-iklTit -1, T~ -1, ¢~, e) depends only on Tik and e/; and that, 
for i between 1 and 1, Pr0rik\[~rik-1,7r 1i-1, ~'0 z, ¢~, e) depends only on ~rik, i, m, and 1. The 
parameters of Model 3 are thus a set of fertility probabilities, n(¢\[ e/) = Pr(¢\]¢~ -1, e); a set 
of translation probabilities, t(f\[e~) - Pr(Tik =f\[~_ik-1, 7-0i-1, %,1~ e); and a set of distortion 
probabilities, dq\[i, m, 1) =- Pr(IIik = j\[w/k-l, 7r~ -1, TO t, ¢~, e). 
We treat the distortion and fertility probabilities for e0 differently. The empty cept 
conventionally occupies position 0, but actually has no position. Its purpose is to 
account for those words in the French string that cannot readily be accounted for by 
other cepts in the English string. Because we expect these words to be spread uniformly 
throughout the French string, and because they are placed only after all of the other 
276 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
= 11~-01, 71"1, T~, 40, e) words in the string have been placed, we assume that Pr(H0k+l • k l t t 
equals 0 unless position j is vacant in which case it equals (40 - k) -1. Therefore, the 
contribution of the distortion probabilities for all of the words in TO is 1/40\[. 
We expect 40 to depend on the length of the French string because longer strings 
should have more extraneous words. Therefore, we assume that 
Pr (¢0t¢~, e)= (¢1 +""" + ¢140 ) ~¢'+'''+¢l-¢°*'db°pO F1 (31) 
for some pair of auxiliary parameters p0 and pl. The expression on the left-hand side 
of this equation depends on ¢~ only through the sum ¢1 + "'" + ¢1 and defines a 
probability distribution over ¢0 whenever P0 and pl are nonnegative and sum to 1. 
We can interpret Pr(¢01¢~,e) as follows. We imagine that each of the words from T1 t 
requires an extraneous word with probability pl and that this extraneous word must 
be connected to the empty cept. The probability that exactly ¢0 of the words from T~ 
will require an extraneous word is just the expression given in Equation (31). 
As with Models 1 and 2, an alignment of (fie) is determined by specifying aj for 
each position in the French string. The fertilities, ¢0 through ¢l, are functions of the 
ajs: ¢i is equal to the number of js for which aj equals i. Therefore, 
Pr(fle) 
l l 
y~ .--~ Pr(f,a\]e) 
al~0 am~O 
' ±( ) 
... m 2o O po-2 op o 
al=0 a~,7. =0 
fi t(:l%)d(jlaj, m, 11 
j=l 
l 
H ~i! n(4i1¢/) X 
i=1 
(32) 
with y~f t(fle ) = 1, Y~qd(jli, m, 1) = 1, ~-~¢ n(¢le) = 1, and po+pl = 1. The assumptions 
that we make for Model 3 are such that each of the pairs (% 70 in If, a) makes an 
identical contribution to the sum in Equation (30). The factorials in Equation (32) 
come from carrying out this sum explicitly. There is no factorial for the empty cept 
because it is exactly canceled by the contribution from the distortion probabilities. 
By now, the reader will be able to provide his or her own auxiliary function for 
seeking a constrained minimum of the likelihood of a translation with Model 3, but 
for completeness and to establish notation, we write 
h(t,d,n,p,A,#,u,~) Pr(fle ) - ~ Ae(~/t(fle ) - 1)- ~-~#im,(~.d(/li, m,l)- 1) 
i ! 
- ~ ue (~ n(¢ I e) - 1) - ~(P0 + pl - 1). (33) 
e 
Following the trail blazed with Models 1 and 2, we define the counts 
m 
c(fle; f, e) = ~ Pr(ale, f) y~. 6(S, :)6(e, %), 
a j=l 
(34) 
cqli , m, I; f,e) = y~ Pr(ale, f)6(i, aj), (35) 
a 
277 
Computational Linguistics Volume 19, Number 2 
and 
l 
c(¢ I e; f, e) = E Pr(ale' f) ~ 6(¢, ¢i)6(e, ei), 
a i=1 
c(O; f, e) = ~\[~ Pr(ale , f)(m - 2¢0) 
a 
(36) 
(37) 
c(1; f, e) = ~ Pr(ale, f)¢0. (38) 
a 
The counts in these last two equations correspond to the parameters p0 and pl that de- 
termine the fertility of the empty cept in the English string. The reestimation formulae 
for Model 3 are 
S 
t(fle) = A2 a ~_, c(fle; f(~), e(5)), (39) 
s=l 
S 
dO\[i, m, l) --1 ~- #iml ~ c(Jl i, m,/; f(~), e(5)), (40/ 
5=1 
S 
n(¢ I e) = u\[ I ~ c(¢1 e; f(s), e(S)), (41) 
5=1 
and 
S 
Pk = ~-1 ~ c(k; f(s)e(511. (42) 
Equations (34) and (39) are identical to Equations (12) and (14) and are repeated here 
only for convenience. Equations (35) and (40) are similar to Equations (23) and (25), 
but a(i\[j, m, 1) differs from d(jti , m, 1) in that the former sums to unity over all i for 
fixed j while the latter sums to unity over all j for fixed i. Equations (36), (37), (38), 
(41), and (42), for the fertility parameters, are new. 
The trick that allows us to evaluate the right-hand sides of Equations (12) and (23) 
efficiently for Model 2 does not work for Model 3. Because of the fertility parameters, 
we cannot exchange the sums over al through am with the product over j in Equation 
(32) as we were able to for Equations (6) and (21). We are not, however, entirely bereft 
of hope. The alignment is a useful device precisely because some alignments are much 
more probable than others. Our strategy is to carry out the sums in Equations (32) 
and (34)-(38) only over some of the more probable alignments, ignoring the vast sea 
of much less probable ones. Specifically, we begin with the most probable alignment 
that we can find and then include all alignments that can be obtained from it by small 
changes. 
To define unambiguously the subset, S, of the elements of A(fle) over which we 
evaluate the sums, we need yet more terminology. We say that two alignments, a and 
a', differ by a move if there is exactly one value of j for which aj ~ aj'. We say that 
they differ by a swap if aj = aj' except at two values, jl and j2, for which a h = a h' and 
aj 2 = aj 11. We say that two alignments are neighbors if they are identical or differ by a 
move or by a swap. We denote the set of all neighbors of a by A/'(a). 
Let b(a) be that neighbor of a for which the likelihood Pr(b(a)l L e) is greatest. 
Suppose that ij is pegged for a. Among the neighbors of a for which/j is also pegged, 
let bi~_;(a) be that for which the likelihood is greatest. The sequence of alignments a, 
b(a), b~(a) =-- b(b(a)), ..., converges in a finite number of steps to an alignment that we 
write as b°°(a). Similarly, if/j is pegged for a, the sequence of alignments a, bi,_-j(a), 
278 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
b2,__j(a), ..., converges in a finite number of steps to an alignment that we write as 
bi~°~j(a). The simple form of the distortion probabilities in Model 3 makes it easy to 
find b(a) and bi~-j(a). If a' is a neighbor of a obtained from it by the move of j from i 
to i ~, and if neither i nor i ~ is 0, then 
Pr(a'\[e,f) = Pr(a\[e,f)(¢i, -~ 1) n(¢i, + l\[ei,) n(¢i - l\[ei) t(~le~, ) d(jli', m, 1) 
(fli n(ffgi'\[ei') n(~/\[~/) t(fjl~ ) dO'\[/, re, l)' (43) 
Notice that ¢i, is the fertility of the word in position i ~ for alignment a. The fertility 
of this word in alignment a ~ is ¢i, + 1. Similar equations can be easily derived when 
either i or i ~ is zero, or when a and a ~ differ by a swap. We leave the details to the 
reader. 
With these preliminaries, we define S by 
$ = H(b ~ (V(fle; 2))) U ? N(b~_j(Vi~_j(f\[e; 2))). (44) 
In this equation, we use b~(V(fle; 2)) and b~j(Vi,__j(fle; 2)) as handy approximations 
to V(fle; 3) and Vi,__j(fle; 3), neither of which we are able to compute efficiently. 
In one iteration of the EM algorithm for Model 3, we compute the counts in 
Equations (34)-(38), summing only over elements of S, and then use these counts in 
Equations (39)-(42) to obtain a new set of parameters. If the error made by including 
only some of the elements of A(e, f) is not too great, this iteration will lead to values 
of the parameters for which the likelihood of the training data is at least as large as 
for the first set of parameters. 
We make no initial guess of the parameters for Model 3, but instead adapt the 
parameters from the final iteration of the EM algorithm for Model 2. That is, we com- 
pute the counts in Equations (34)-(38) using Model 2 to evaluate Pr(a\[e, f). The simple 
form of Model 2 again makes exact calculation feasible. We can readily adapt Equa- 
tions (27) and (28) to compute counts for the translation and distortion probabilities; 
efficient calculation of the fertility counts is more involved, and we defer a discussion 
of it to Appendix B. 
4.5 Deficiency 
The reader will have noticed a problem with our parameterization of the distortion 
probabilities in Model 3: whereas we can see by inspection that the sum over all pairs 
% 7r of the expression on the right-hand side of Equation (29) is unity, it is equally clear 
that this can no longer be the case if we assume that Pr(IIik :. k-1 ~_i-1 ~l ~l e) z \]\[TFi1 ~ 1 ~ o~rPo~ 
depends only on j, i, m, and l for i > 0. Because the distortion probabilities for assigning 
positions to later words do not depend on the positions assigned to earlier words, 
Model 3 wastes some of its probability on what we might call generalized strings, i.e., 
strings that have some positions with several words and others with none. When a 
model has this property of not concentrating all of its probability on events of interest, 
we say that it is deficient. Deficiency is the price that we pay for the simplicity that 
allows us to write Equation (43). 
Deficiency poses no serious problem here. Although Models 1 and 2 are not tech- 
nically deficient, they are surely spiritually deficient. Each assigns the same probability 
to the alignments (Je n'ai pas de stylo I I(1) do not(2,4) have(3) a(5) pen(6)) and (Je pas ai ne de 
stylo I I(1) do not(2,4) have(3) a(5) pen(6)), and, therefore, essentially the same probability 
to the translations (Je n" ai pas de stylo I I do not have a pen) and (Je pas ai ne de stylo \[ I do 
not have a pen). In each case, not produces two words, ne and pas, and in each case, 
279 
Computational Linguistics Volume 19, Number 2 
one of these words ends up in the second position of the French string and the other 
in the fourth position. The first translation should be much more probable than the 
second, but this defect is of little concern because while we might have to translate 
the first string someday, we will never have to translate the second. We do not use 
our translation models to predict French given English but rather as a component of 
a system designed to predict English given French. They need only be accurate to 
within a constant factor over well-formed strings of French words. 
4.6 Model 4 
Often the words in an English string constitute phrases that are translated as units 
into French. Sometimes, a translated phrase may appear at a spot in the French string 
different from that at which the corresponding English phrase appears in the English 
string. The distortion probabilities of Model 3 do not account well for this tendency of 
phrases to move around as units. Movement of a long phrase will be much less likely 
than movement of a short phrase because each word must be moved independently. In 
Model 4, we modify our treatment of Pr(IIik = jlTri k-l, ~i-l,,1 , '0,~'0,-I ,~ e) so as to alleviate 
this problem. Words that are connected to the empty cept do not usually form phrases, 
and so we continue to assume that these words are spread uniformly throughout the 
French string. 
As we have described, an alignment resolves an English string into a ceptual 
scheme consisting of a set of possibly overlapping cepts. Each of these cepts then ac- 
counts for one or more French words. In Model 3 the ceptual scheme for an alignment 
is determined by the fertilities of the words: a word is a cept if its fertility is greater 
than zero. The empty cept is a part of the ceptual scheme if ¢0 is greater than zero. 
As before we exclude multi-word cepts. Among the one-word cepts, there is a natural 
order corresponding to the order in which they appear in the English string. Let \[i\] 
denote the position in the English string of the/th one-word cept. We define the center 
of this cept, ®i, to be the ceiling of the average value of the positions in the French 
string of the words from its tablet. We define its head to be that word in its tablet for 
which the position in the French string is smallest. 
In Model 4, we replace d(jli , m, l) by two sets of parameters: one for placing the 
head of each cept, and one for placing any remaining words. For \[i\] > 0, we require 
that the head for cept i be r\[i\]l and we assume that 
• \[t 7-1 TO / ,~/,e) = d 1 Pr(II\[i\]l = 1 7rl , (J - ®i-llA(e\[i-1\]),/3(fJ))" (45) 
Here, A and B are functions of the English and French words that take on a small 
number of different values as their arguments range over their respective vocabularies. 
Brown et al. (1990) describe an algorithm for dividing a vocabulary into classes so as to 
preserve mutual information between adjacent classes in running text. We construct 
,A and /3 as functions with 50 distinct values by dividing the English and French 
vocabularies each into 50 classes according to this algorithm. By assuming that the 
probability depends on the previous cept and on the identity of the French word 
being placed, we can account for such facts as the appearance of adjectives before 
nouns in English but after them in French. We call j - ®i-1 the displacement for the 
head of cept i. It may be either positive or negative. We expect dl(-lI.A(e),/3(f)) to 
be larger than dl(+ llA(e),/3(f)) when e is an adjective and d is a noun. Indeed, 
this is borne out in the trained distortion probabilities for Model 4, where we find 
that dl (-llA(government's),/3(d~veloppement)) is 0.7986, while dl (+ llM(government's), 
/3(d~veloppement)) is 0.0168. 
280 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
Suppose, now, that we wish to place the k th word of cept i for \[i\] > 0, k > 1. We 
assume that 
• k-1 \[i\]-1 l -t , Pr(H\[i\]k = \] 71"\[i\]1 ,7r 1 , r0, tp0 , e) = d>l(j - 7l\[i\]k-1 \[/~(d~))" (46) 
We require that ~r\[i\]k be greater than ~rI,\]k-1. Some English words tend to produce a 
series of French words that belong together, while others tend to produce a series of 
words that should be separate. For example, implemented can produce mis en application, 
which usually occurs as a unit, but not can produce ne pas, which often occurs with 
an intervening verb• We expect d>l(2\[B(pas)) to be relatively large compared with 
d>l(2\[/J(en)). After training, we find that d>l(2\[B(pas)) is 0.6847 and d>l(2II3(en)) is 
0.1533. 
Whereas we assume that T\[i\]l can be placed either before or after any previously 
positioned words, we require subsequent words from 7\[i\] to be placed in order. This 
does not mean that they must occupy consecutive positions but only that the second 
word from T\[~\] must lie to the right of the first, the third to the right of the second, and 
so on. Because of this, only one of the ¢\[i\]! arrangements of 71i\] is possible. 
We leave the routine details of deriving the count and reestimation formulae for 
Model 4 to the reader. He may find the general formulae in Appendix B helpful. 
Once again, the several counts for a translation are expectations of various quantities 
over the possible alignments with the probability of each alignment computed from an 
earlier estimate of the parameters. As with Model 3, we know of no trick for evaluating 
these expectations and must rely on sampling some small set, S, of alignments. As 
described above, the simple form that we assume for the distortion probabilities in 
Model 3 makes it possible for us to find b °o (a) rapidly for any a. The analog of Equation 
(43) for Model 4 is complicated by the fact that when we move a French word from cept 
to cept we change the centers of two cepts and may affect the contribution of several 
words. It is nonetheless possible to evaluate the adjusted likelihood incrementally, 
although it is substantially more time-consuming. 
Faced with this unpleasant situation, we proceed as follows. Let the neighbors 
of a be ranked so that the first is the neighbor for which Pr(a\[e, f; 3) is greatest, the 
second the one for which Pr(a\[e, f; 3) is next greatest, and so on. We define b(a) to be the 
highest-ranking neighbor of a for which Pr(b(a)\[e, f; 4) is at least as large as Pr(aIe, f; 4). 
We define bi,._j(a) analogously. Here, Pr(a\[e, f;3) means Pr(a\[e, f) as computed with 
Model 3, and Pr(ale, f;4) means Pr(a\[e, f) as computed with Model 4. We define S for 
Model 4 by 
S = A/'(boo (V(f\[e; 2))) ~.J Uij.M(b~j(Vi,_j(f\[e; 2))). (47) 
This equation is identical to Equation (47) except that b has been replaced by/~. 
4.7 Model 5 
Models 3 and 4 are both deficient. In Model 4, not only can several words lie on top 
of one another, but words can be placed before the first position or beyond the last 
position in the French string. We remove this deficiency in Model 5. 
After we have placed the words for r~ i\]-1 and T\[i\] k-1 there will remain some va- 
cant positions in the French string. Obviously, T\[i\]k should be placed in one of these 
vacancies. Models 3 and 4 are deficient precisely because we fail to enforce this con- 
straint for the one-word cepts. Let v(j, T~ i\]-1, T\[i\]I k-l) be the number of vacancies up to 
and including position j just before we place T\[,lk. In the interest of notational brevity, 
a noble but elusive goal, we write this simply as vj. We retain two sets of distortion 
281 
Computational Linguistics Volume 19, Number 2 
parameters, as in Model 4, and continue to refer to them as dl and d>l. We assume 
that, for \[i\] > 0, 
;1~\[i\]-1 ~1 ¢/, e) dl(vjlt~(Z), vo,_,, vm - Ctil + 1)(1 - 5(vj, vj-1)). (48) Pr (II\[i\]l = 11"1 ,'o, -- 
The number of vacancies up to j is the same as the number of vacancies up to j - 1 
only when j is not itself vacant. The last factor, therefore, is 1 when j is vacant and 0 
otherwise. In the final parameter of dl, vm is the number of vacancies remaining in the 
French string. If ~b\[i\] = 1, then 7.\[i11 may be placed in any of these vacancies; if ~b\[i\] = 2, 
7-\[i\]1 may be placed in any but the last of these vacancies; in general, 7-\[,11 may be placed 
in any but the rightmost ~b\[,\] - 1 of the remaining vacancies. Because 7-\[/\]1 must occupy 
the leftmost place of any of the words from T\[,\], we must take care to leave room at 
the end of the string for the remaining words from this tablet. As with Model 4, we 
allow dl to depend on the center of the previous cept and on ~, but we suppress the 
dependence on eli-l\] since we should otherwise have too many parameters. 
For \[i\] > 0 and k > 1, we assume 
Pr(Ilfilk = jlTr\[i\]~-i ' 71"1\[i\]-1' 7"6,z ~b~,e) 
= d>l(vj - v~t,lk_~ll3(fj),vm - v~t,lk_ , - ~b\[i\] +k)(1 - 8(vj, vj_l)). (49) 
Again, the final factor enforces the constraint that 7.\[i\]k land in a vacant position, and, 
again, we assume that the probability depends on ~ only through its class. Model 5 is 
described in more detail in Appendix B. 
As with Model 4, we leave the details of the count and reestimation formulae 
to the reader. No incremental evaluation of the likelihood of neighbors is possible 
with Model 5 because a move or swap may require wholesale recomputation of the 
likelihood of an alignment. Therefore, when we evaluate expectations for Model 5, we 
include only the alignments in S as defined in Equation (47). We further trim these 
alignments by removing any alignment a, for which Pr(ale, f;4) is too much smaller 
than Pr(b°°(V(fle; 2)le, f; 4). 
Model 5 is a powerful but unwieldy ally in the battle to align translations. It must 
be led to the battlefield by its weaker but more agile brethren Models 2, 3, and 4. In fact, 
this is the raison d'etre of these models. To keep them aware of the lay of the land, we 
adjust their parameters as we carry out iterations of the EM algorithm for Model 5. That 
is, we collect counts for Models 2, 3, and 4 by summing over alignments as determined 
by the abbreviated S described above, using Model 5 to compute Pr(ale, f). Although 
this appears to increase the storage necessary for maintaining counts as we proceed 
through the training data, the extra burden is small because the overwhelming majority 
of the storage is devoted to counts for t(fle ), and these are the same for Models 2, 3, 
4, and 5. 
5. Results 
We have used a large collection of training data to estimate the parameters of the 
models described above. Brown, Lai, and Mercer (1991) have described an algorithm 
with which one can reliably extract French and English sentences that are translations 
of one another from parallel corpora. They used the algorithm to extract a large number 
of translations from several years of the proceedings of the Canadian parliament. From 
these translations, we have chosen as our training data those for which both the English 
sentence and the French sentence are 30 or fewer words in length. This is a collection 
282 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
Table 1 
A summary of the training iterations. 
Iteration In --* Out Survivors Alignments Perplexity 
1 1 --* 2 12,017,609 71,550.56 
2 2 --* 2 12,160,475 202.99 
3 2 --, 2 9,403,220 89.41 
4 2 --* 2 6,837,172 61.59 
5 2 --, 2 5,303,312 49.77 
6 2 --* 2 4,397,172 46.36 
7 2 --* 3 3,841,470 45.15 
8 3 --* 5 2,057,033 291 124.28 
9 5 --* 5 1,850,665 95 39.17 
10 5 ~ 5 1,763,665 48 32.91 
11 5 -* 5 1,703,393 39 31.29 
12 5 --* 5 1,658,364 33 30.65 
of 1,778,620 translations. In an effort to eliminate some of the typographical errors that 
abound in the text, we have chosen as our English vocabulary all of those words that 
appear at least twice in English sentences in our data, and as our French vocabulary 
all of those words that appear at least twice in French sentences in our data. All 
other words we replace with a special unknown English word or unknown French word 
accordingly as they appear in an English sentence or a French sentence. We arrive 
in this way at an English vocabulary of 42,005 words and a French vocabulary of 
58,016 words. Some typographical errors are quite frequent, for example, momento for 
memento, and so our vocabularies are not completely free of them. At the same time, 
some words are truly rare, and so we have, in some cases, snubbed legitimate words. 
Adding e0 to the English vocabulary brings it to 42,006 words. 
We have carried out 12 iterations of the EM algorithm for this data. We initialized 
the process by setting each of the 2,437, 020,096 translation probabilities, t(fle), to 
1/58,016. That is, we assume each of the 58,016 words in the French vocabulary to be 
equally likely as a translation for each of the 42,006 words in the English vocabulary. 
For t(f\[e) to be greater than zero at the maximum likelihood solution for one of our 
models, f and e must occur together in at least one of the translations in our training 
data. This is the case for only 25,427, 016 pairs, or about one percent of all translation 
probabilities. On the average, then, each English word appears with about 605 French 
words. 
Table 1 summarizes our training computation. At each iteration, we compute the 
probabilities of the various alignments of each translation using one model, and collect 
counts using a second (possibly different) model. These are referred to in the table as 
the In model and the Out model, respectively. After each iteration, we retain individual 
values only for those translation probabilities that surpass a threshold; the remainder 
we set to a small value (10-12). This value is so small that it does not affect the 
normalization conditions, but is large enough that translation probabilities can be 
resurrected during later iterations. We see in columns 4 and 5 that even though we 
lower the threshold as iterations progress, fewer and fewer probabilities survive. By 
the final iteration, only 1,658,364 probabilities survive, an average of about 39 French 
words for each English word. 
Although the entire t array has 2,437, 020,096 entries, and we need to store it 
twice, once as probabilities and once as counts, it is clear from the preceeding remarks 
that we need never deal with more than about 25 million counts or about 12 million 
probabilities. We store these two arrays using standard sparse matrix techniques. We 
283 
Computational Linguistics Volume 19, Number 2 
keep counts as pairs of bytes, but allow for overflow into 4 bytes if necessary. In 
this way, it is possible to run the training program in less than 100 megabytes of 
memory. While this number would have seemed extravagant a few years ago, today 
it is available at modest cost in a personal workstation. 
As we have described, when the In model is neither Model 1 nor Model 2, we 
evaluate the count sums over only some of the possible alignments. Many of these 
alignments have a probability much smaller than that of the Viterbi alignment. The 
column headed Alignments in Table 1 shows the average number of alignments for 
which the probability is within a factor of 25 of the probability of the Viterbi align- 
ment in each iteration. As this number drops, the model concentrates more and more 
probability onto fewer and fewer alignments so that the Viterbi alignment becomes 
ever more dominant. 
The last column in the table shows the perplexity of the French text given the 
English text for the In model of the iteration. We expect the likelihood of the training 
data to increase with each iteration. We can think of this likelihood as arising from a 
product of factors, one for each French word in the training data. We have 28,850,104 
French words in our training data, so the 28,850,104 th root of the likelihood is the 
average factor by which the likelihood is reduced for each additional French word. 
The reciprocal of this root is the perplexity shown in the table. As the likelihood 
increases, the perplexity decreases. We see a steady decrease in perplexity as the itera- 
tions progress except when we switch from Model 2 as the In model to Model 3. This 
sudden jump is not because Model 3 is a poorer model than Model 2, but because 
Model 3 is deficient: the great majority of its probability is squandered on objects that 
are not strings of French words. As we have argued, deficiency is not a problem. In 
our description of Model 1, we left Pr(mle ) unspecified. In quoting perplexities for 
Models 1 and 2, we have assumed that the length of the French string is Poisson with 
a mean that is a linear function of the length of the English string. Specifically, we 
have assumed that Pr(M : role ) = (Al)me-~Z/m!, with A equal to 1.09. 
It is interesting to see how the Viterbi alignments change as the iterations progress. 
In Figure 5, we show for several sentences the Viterbi alignment after iterations 1, 6, 
7, and 12. Iteration 1 is the first iteration for Model 2, and iterations 6, 7, and 12 are 
the final iterations for Models 2, 3, and 5, respectively. In each example, we show 
the French sentence with a subscript affixed to each word to ease the reader's task 
in interpreting the list of numbers after each English word. In the first example, (Il 
me semble faire signe que oui I It seems to me that he is nodding), two interesting changes 
evolve over the course of the iterations. In the alignment for Model 1,//is correctly 
connected to he, but in all later alignments II is incorrectly connected to It. Models 2, 3, 
and 5 discount a connection of he to II because it is quite far away. We do not yet have 
a model with sufficient linguistic sophistication to make this connection properly. On 
the other hand, we see that nodding, which in Models 1, 2, and 3 is connected only to 
signe and oui, is correctly connected to the entire phrase faire signe que oui in Model 5. 
In the second example, (Voyez les profits que ils ont r~alis~s \[ Look at the profits they have 
made), Models 1, 2, and 3 incorrectly connect profits4 to both profits3 and rdalisds7, but 
with Model 5, profits4 is correctly connected only to profits3, and made7 is connected to 
r~alis~s7. Finally, in (De les promesses, de les promesses! I Promises, promises.), Promises1 is 
connected to both instances of promesses with Model 1; promises3 is connected to most 
of the French sentence with Model 2; the final punctuation of the English sentence is 
connected to both the exclamation point and, curiously, to des with Model 3; and only 
with Model 5 do we have a satisfying alignment of the two sentences. The orthography 
for the French sentence in the second example is Voyez les profits qu'ils ont rdalisds and 
in the third example is Des promesses, des promesses! We have restored the e to the end 
284 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
Ill me2 semble3 faire4 signe5 que6 oui7 
It seems(3) to(4) me(2) that(6) he(l) is nodding(5,7) 
It(l) seems(3) to me(2) that he is nodding(5,7) 
It(l) seems(3) to(4) me(2) that(6) he is nodding(5,7) 
It(l) seems(3) to me(2) that he is nodding(4,5,6,7) 
Voyezl les2 profits3 que4 ils5 ont6 r6alis6s7 
Look(l) at the(2) profits(3,7) they(5) have(6) made 
Look(l) at the(2,4) profits(3,7) they(5) have(6) made 
Look(l) at the profits(3,7) they(5) have(6) made 
Look(l) at the(2) profits(3) they(5) have(6) rnade(7) 
Del les2 promesses3,4 de5 les6 promesses7 !8 
Promises(3,7) ,(4) promises .(8) 
Promises ,(4) promises(2,3,6,7) .(8) 
Promises(3) ,(4) promises(7) .(5,8) 
Promises(2,3) ,(4) promises(6,7) .(8) 
Figure 5 
The progress of alignments with iteration. 
of qu' and have twice analyzed des into its constituents, de and les. We commit these 
and other petty pseudographic improprieties in the interest of regularizing the French 
text. In all cases, orthographic French can be recovered by rule from our corrupted 
versions. 
Figures 6-15 show the translation probabilities and fertilities after the final iteration 
of training for a number of English words. We show all and only those probabilities 
that are greater than 0.01. Some words, like nodding, in Figure 6, do not slip gracefully 
into French. Thus, we have translations like (Il fait signe que oui I He is nodding), (Il fait 
un signe de la t~te } He is nodding), (Il fait un signe de t~te affirmatif l He is nodding), or (II 
hoche la t~te affirmativement I He is nodding). As a result, nodding frequently has a large 
fertility and spreads its translation probability over a variety of words. In French, what 
is worth saying is worth saying in many different ways. We see another facet of this 
with words like should, in Figure 7, which rarely has a fertility greater than one but still 
produces many different words, among them devrait, devraient, devrions, doit, doivent, 
devons, and devrais. These are (just a fraction of the many) forms of the French verb 
devoir. Adjectives fare a little better: national, in Figure 8, almost never produces more 
than one word and confines itself to one of nationale, national, nationaux, and nationales, 
respectively the feminine, the masculine, the masculine plural, and the feminine plural 
of the corresponding French adjective. It is clear that our models would benefit from 
some kind of morphological processing to rein in the lexical exuberance of French. 
We see from the data for the, in Figure 9, that it produces le, la, les, and I' as we 
would expect. Its fertility is usually 1, but in some situations English prefers an article 
where French does not and so about 14% of the time its fertility is 0. Sometimes, as 
with farmers, in Figure 10, it is French that prefers the article. When this happens, the 
English noun trains to produce its translation together with an article. Thus, farmers 
285 
Computational Linguistics Volume 19, Number 2 
nodding 
Figure 6 
f t(fle) ~ n(q5 I e) 
signe 0.164 
la 0.123 
t~te 0.097 
oui 0.086 
fait 0.073 
que 0.073 
hoche 0.054 
hocher 0.048 
faire 0.030 
me 0.024 
approuve 0.019 
qui 0.019 
un 0.012 
faites 0.011 
Translation and fertility probabilities for nodding. 
4 0.342 
3 0.293 
2 0.167 
1 0.163 
0 0.023 
typically has a fertility 2 and usually produces either agriculteurs or les. We include 
additional examples in Figures 11 through 15, which show the translation and fertility 
probabilities for external, answer, oil, former, and not. Although we show the various 
probabilities to three decimal places, one must realize that the specific numbers that 
appear are peculiar to the training data that we used in obtaining them. They are not 
constants of nature relating the Platonic ideals of eternal English and eternal French. 
Had we used different sentences as training data, we might well have arrived at 
different numbers. For example, in Figure 9, we see that t(lelthe ) = 0.497 while the 
corresponding number from Figure 4 of Brown et al. (1990) is 0.610. The difference 
arises not from some instability in the training algorithms or some subtle shift in 
the languages in recent years, but from the fact that we have used 1,778,620 pairs of 
sentences covering virtually the complete vocabulary of the Hansard data for training, 
while they used only 40,000 pairs of sentences and restricted their attention to the 9,000 
most common words in each of the two vocabularies. 
Figures 16, 17, and 18 show automatically derived alignments for three transla- 
tions. In the terminology of Section 4.6, each alignment is ~o~ (V(fle; 2)). We stress that 
these alignments have been found by an algorithm that involves no explicit knowledge 
of either French or English. Every fact adduced to support them has been discovered 
algorithmically from the 1,778,620 translations that constitute our training data. This 
data, in turn, is the product of an algorithm the sole linguistic input of which is a set 
of rules explaining how to find sentence boundaries in the two languages. We may 
justifiably claim, therefore, that these alignments are inherent in the Canadian Hansard 
data itself. 
In the alignment shown in Figure 16, all but one of the English words has fertility 1. 
The final prepositional phrase has been moved to the front of the French sentence, but 
otherwise the translation is almost verbatim. Notice, however, that the new proposal 
has been translated into les nouvelles propositions, demonstrating that number is not an 
invariant under translation. The empty cept has fertility 5 here. It generates enl, de3, 
the comma, del6, and de18. 
286 
Peter E Brown et al. The Mathematics of Statistical Machine Translation 
should 
Figure 7 
f t(fle) (J n(~le) 
devrait 0.330 
devraient 0.123 
devrions 0.109 
faudrait 0.073 
faut 0.058 
doit 0.058 
aurait 0.041 
doivent 0.024 
devons 0.017 
devrais 0.013 
Translation and fertility probabilities for should. 
1 0.649 
0 0.336 
2 0.014 
national 
f t(fle) ~ n(~ I e) 
nationale 0.469 1 0.905 
national 0.418 0 0.094 
nationaux 0.054 
nationales 0.029 
Figure 8 
Translation and fertility probabilities for national. 
the 
Figure 9 
f t(fle) ~ n(~hle) 
0.497 1 
0.207 0 
0.155 
0.086 
0.018 
0.011 
le 
la 
les 
1' 
ce 
cette 
0.746 
0.254 
Translation and fertility probabilities for the. 
farmers 
Figure 10 
f t(fle ) n(~ \[e) 
agriculteurs 0.442 2 0.731 
les 0.418 1 0.228 
cultivateurs 0.046 0 0.039 
producteurs 0.021 
Translation and fertility probabilities for farmers. 
287 
Computational Linguistics Volume 19, Number 2 
external 
f t(fle) ~ n(~le) 
ext6rieures 0.944 1 0.967 
ext6rieur 0.015 0 0.028 
externe 0.011 
ext6rieurs 0.010 
Figure 11 
Translation and fertility probabilities for external. 
answer 
f t(fie ) ~ n(~b \[ e) 
r6ponse 0.442 
r6pondre 0.233 
r6pondu 0.041 
0.038 
solution 0.027 
r6pondez 0.021 
r6pondrai 0.016 
r6ponde 0.014 
y 0.013 
ma 0.010 
Figure 12 
Translation and fertility probabilities for 
1 0.809 
2 0.115 
0 0.074 
answer. 
oil 
Figure 13 
f t(fle) ~ n(~le) 
p6trole 0.558 
p6troli6res 0.138 
p@troli~re 0.109 
le 0.054 
p6trolier 0.030 
p6troliers 0.024 
huile 0.020 
Oil 0.013 
Translation and fertility probabilities for oil. 
1 0.760 
0 0.181 
2 0.057 
288 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
former 
Figure 14 
f t(-f\[e) ~ n(~le ) 
ancien 0.592 
anciens 0.092 
ex 0.092 
pr6c6dent 0.054 
1' 0.043 
ancienne 0.018 
dtd 0.013 
Translation and fertility probabilities for former. 
1 0.866 
0 0.074 
2 0.060 
not 
f t(fle ) c~ n(~\[e) 
ne 0.497 2 0.735 
pas 0.442 0 0.154 
non 0.029 1 0.107 
rien 0.011 
Figure 15 
Translation and fertility probabilities for not. 
In Figure 17, two of the English words have fertility 0, one has fertility 2, and one, 
embattled, has fertility 5. Embattled is another word, like nodding, that eludes the French 
grasp and comes with a panoply of multi-word translations. 
The final example, in Figure 18, has several features that bear comment. The second 
word, Speaker, is connected to the sequence l'Orateur. Like farmers above, it has trained 
to produce both the word that we naturally think of as its translation and the associated 
article. In our data, Speaker always has fertility 2 and produces equally often l'Orateur 
and le president. Later in the sentence, starred is connected to the phrase marquees de un 
astdrisque. From an initial situation in which each French word is equally probable as 
a translation of starred, we have arrived, through training, at a situation where it is 
possible to connect starred to just the right string of four words. Near the end of the 
sentence, give is connected to donnerai, the first person singular future of donner, which 
means to give. We should be more comfortable if both will and give were connected 
to donnerai, but by limiting cepts to no more than one word, we have precluded this 
possibility. Finally, the last 12 words of the English sentence, I now have the answer and 
will give it to the House, clearly correspond to the last 7 words of the French sentence, 
je donnerai la rdponse ?l la Chambre, but, literally, the French is I will give the answer to 
the House. There is nothing about now, have, and, or it, and each of these words has 
fertility 0. Translations that are as far as this from the literal are rather more the rule 
than the exception in our training data. One might cavil at the connection of la r~ponse 
to the answer rather than to it. We do not. 
6. Better Translation Models 
Models 1-5 provide an effective means for obtaining word-by-word alignments of 
translations, but as a means to achieve our real goal, which is translation, there is 
289 
Computational Linguistics Volume 19, Number 2 
Whah 
is2 
the3 
anticipated4 
cost5 
of 6 
administering7 
ands 
collecting9 
feesl0 
undern 
the12 
newl3 
proposal14 
?15 
Figure 16 
The best of 1.9 x 1025 alignments. 
Enl 
vertu2 
de3 
les4 
nouvelle% 
propositions6 
,7 
quels 
est9 
lelo 
cbutll 
pr6vulz 
de13 
administration14 
et15 
del6 
perceptions7 
de18 
lesl9 
droits20 
?21 
290 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
The1 
secretary2 
orb 
state4 
fors 
external6 
affairs7 
comess 
in9 
asl0 
the11 
onel2 
supporter13 
of 14 
thels 
embattled16 
ministerl7 
of 18 
yesterday~9 
Figure 17 
The best of 8.4 x 1029 alignments. 
Eel 
secr6taire2 
de3 
t~tat4 
as 
les6 
Affaires7 
ext6rieurs 
se9 
pr6sentel0 
comme11 
le12 
seull3 
d6fenseur14 
de15 
lel6 
ministre17 
quils 
sel9 
est20 
fait21 
bousculer22 
hier23 
291 
Computational Linguistics Volume 19, Number 2 
Mrq 
Speaker2 
p3 
if4 
we5 
might6 
return7 
to8 
starred9 
questions10 
ill 
I12 
now13 
have14 
the15 
answer16 
and17 
will18 
givel9 
it20 
t021 
the22 
house23 
Figure 18 
The best of 5.6 x 1031 alignments. 
Monsieur1 
1'2 
Orateur3 
r4 
sis 
nous6 
pouvons7 
revenir8 
/~9 
lesl0 
questions11 
rnarqu6es12 
de13 
unl4 
ast6risque15 
,16 
jel7 
donnerai18 
la19 
r@ponse20 
a21 
\]a22 
chambre23 
292 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
room for improvement. We have seen that by ignoring the morphological structure 
of the two languages we dilute the strength of our statistical model, explaining, for 
example, each of the several tens of forms of each French verb independently. We have 
seen that by ignoring multi-word cepts, we are forced to give a false, or at least an 
unsatisfactory, account of some features in many translations. And finally, we have 
seen that our models are deficient, either in fact, as with Models 3 and 4, or in spirit, 
as with Models 1, 2, and 5. 
6.1 The Truth about Deficiency 
We have argued in Section 2 that neither spiritual nor actual deficiency poses a serious 
problem, but this is not entirely true. Let w(e) be the sum of Pr(fle ) over well-formed 
French strings and let i(e) be the sum over ill-formed French strings. In a deficient 
model, w(e) + i(e) < 1. We say that the remainder of the probability is concentrated 
on the event failure and we write w(e) + i(e) + Pr(failurele ) = 1. Clearly, a model is 
deficient precisely when Pr(failurele ) > 0. If Pr(failure\]e) = 0, but i(e) > 0, then the 
model is spiritually deficient. If w(e) were independent of e, neither form of deficiency 
would pose a problem, but because our models have no long-term constraints, w(e) 
decreases exponentially with 1. When computing alignments, even this creates no 
problem because e and f are known. If, however, we are given f and asked to discover 
4, then we will find that the product Pr(e) Pr(fle ) is too small for long English strings 
as compared with short ones. As a result, we will improperly favor short English 
strings. We can counteract this tendency in part by replacing Pr(fJe) with c I Pr(f\]e) 
for some empirically chosen constant c. This is treatment of the symptom rather than 
treatment of the disease itself, but it offers some temporary relief. The cure lies in 
better modeling. 
6.2 Viterbi Training 
As we progress from Model 1 to Model 5, evaluating the expectations that give us 
counts becomes increasingly difficult. For Models 1 and 2, we are able to include the 
contribution of each of the (1 + 1) m possible alignments exactly. For later models, 
we include the contributions of fewer and fewer alignments. Because most of the 
probability for each translation is concentrated by these models on a small number 
of alignments, this suboptimal procedure, mandated by the complexity of the models, 
yields acceptable results. 
In the limit, we can contemplate evaluating the expectations using only a single, 
probable alignment for each translation. When that alignment is the Viterbi alignment, 
we call this Viterbi training. It is easy to see that Viterbi training converges: at each 
step, we reestimate parameters so as to make the current set of Viterbi alignments as 
probable as possible; when we use these parameters to compute a new set of Viterbi 
alignments, we find either the old set or a set that is yet more probable. Since the 
probability can never be greater than one, this process must converge. In fact, unlike 
the EM algorithm in general, it must converge in a finite, though impractically large, 
number of steps because each translation has only a finite number of alignments. 
In practice, we are never sure that we have found the Viterbi alignment. If we 
reinterpret the Viterbi alignment to mean the most probable alignment that we can 
find rather than the most probable alignment that exists, then a similarly reinterpreted 
Viterbi training algorithm still converges. We have already used this algorithm suc- 
cessfully as a part of a system to assign senses to English and French words on the 
basis of the context in which they appear (Brown et al. 1991a, 1991b). We expect to 
use it in models that we develop beyond Model 5. 
293 
Computational Linguistics Volume 19, Number 2 
6.3 Multi-Word Cepts 
In Models 1-5, we restrict our attention to alignments with cepts containing no more 
than one word each. Except in Models 4 and 5, cepts play little r61e in our development. 
Even in these models, cepts are determined implicitly by the fertilities of the words 
in the alignment: words for which the fertility is greater than zero make up one-word 
cepts; those for which it is zero do not. We can easily extend the generative process 
upon which Models 3, 4, and 5 are based to encompass multi-word cepts. We need 
only include a step for selecting the ceptual scheme and ascribe fertilities to cepts 
rather than to words, requiring that the fertility of each cept be greater than zero. 
Then, in Equation (29), we can replace the products over words in an English string 
with products over cepts in the ceptual scheme. 
When we venture beyond one-word cepts, however, we must tread lightly. An 
English string can contain any of 42,005 one-word cepts, but there are more than 
1.7 billion possible two-word cepts, more than 74 trillion three-word cepts, and so 
on. Clearly, one must be discriminating in choosing potential multi-word cepts. The 
caution that we have displayed thus far in limiting ourselves to cepts with fewer 
than two words was motivated primarily by our respect for the featureless desert that 
multi-word cepts offer a priori. The Viterbi alignments that we have computed with 
Model 5 give us a frame of reference from which to expand our horizons to multi-word 
cepts. By inspecting them, we can find translations for a given multi-word sequence. 
We need only promote a multi-word sequence to cepthood when these translations 
differ substantially from what we might expect on the basis of the individual words 
that it contains. In English, either a boat or a person can be left high and dry, but in 
French, un bateau is not left haut et sec, nor une personne haute et s~che. Rather, a boat is 
left ~chou~ and a person en plan. High and dry, therefore, is a promising three-word cept 
because its translation is not compositional. 
6.4 Morphology 
We treat each distinct sequence of letters as a distinct word. In English, for example, 
we recognize no kinship among the several forms of the verb to eat (eat, ate, eaten, 
eats, and eating). In French, irregular verbs have many forms. In Figure 7, we have 
already seen 7 forms of devoir. Altogether, it has 41 different forms. And there would 
be 42 if the French did not inexplicably drop the circumflex from the masculine plural 
past participle (dus), thereby causing it to collide with the first and second person 
singular in the pass~ simple, no doubt a source of endless confusion for the beleaguered 
francophone. 
The French make do with fewer forms for the multitude of regular verbs that are 
the staple diet of everyday speech. Thus, manger (to eat), has only 39 forms (manger, 
mange, manges ..... mangeassent). Models 1-5 must learn to connect the 5 forms of to eat 
to the 39 forms of manger. In the 28~ 850~ 104 French words that make up our training 
data, only 13 of the 39 forms of manger actually appear. Of course, it is only natural 
that in the proceedings of a parliament, forms of manger are less numerous than forms 
of parler (to speak), but even for parler, only 28 of the 39 forms occur in our data. If we 
were to encounter a rare form of one of these words, say, parlassions or mangeassent, we 
would have no inkling of its relationship to speak or eat. A similar predicament besets 
nouns and adjectives as well. For example, composition is the among the most common 
words in our English vocabulary, but compositions is among the least common words. 
We plan to ameliorate these problems with a simple inflectional analysis of verbs, 
nouns, adjectives, and adverbs, so that the relatedness of the several forms of the same 
word is manifest in our representation of the data. For example, we wish to make 
evident the common pedigree of the different conjugations of a verb in French and 
294 
Peter E Brown et al. The Mathematics of Statistical Machine Translation 
in English; of the singular and plural, and singular possessive and plural possessive 
forms of a noun in English; of the singular, plural, masculine, and feminine forms of 
a noun or adjective in French; and of the positive, comparative, and superlative forms 
of an adjective or adverb in English. 
Thus, our intention is to transform (/'e mange la p~che I I eat the peach) into, e.g., (je 
manger, 13spres la p~che I I eat,x3spres the peach). Here, eat is analyzed into a root, eat, and 
an ending, x3spres, that indicates the present tense form used except in the third person 
singular. Similarly, mange is analyzed into a root, manger, and an ending, 13spres, that 
indicates the present tense form used for the first and third persons singular. 
These transformations are invertible and should reduce the French vocabulary by 
about 50% and the English vocabulary by about 20%. We hope that this will signifi- 
cantly improve the statistics in our models. 
7. Discussion 
That interesting bilingual lexical correlations can be extracted automatically from a 
large bilingual corpus was pointed out by Brown et al. (1988). The algorithm that 
they describe is, roughly speaking, equivalent to carrying out the first iteration of the 
EM algorithm for our Model 1 starting from an initial guess in which each French 
word is equally probable as a translation for each English word. They were unaware 
of a connection to the EM algorithm, but they did realize that their method is not 
entirely satisfactory. For example, once it is clearly established that in (La porte est 
rouge I The door is red), it is red that produces rouge, one is uncomfortable using this 
sentence as support for red producing porte or door producing rouge. They suggest 
removing words once a correlation between them has been clearly established and then 
reprocessing the resulting impoverished translations hoping to recover less obvious 
correlations now revealed by the departure of their more prominent relatives. From 
our present perspective, we see that the proper way to proceed is simply to carry out 
more iterations of the EM algorithm. The likelihood for Model 1 has a unique local 
maximum for any set of training data. As iterations proceed, the count for porte as a 
translation of red will dwindle away. 
In a later paper, Brown et al. (1990) describe a model that is essentially the same as 
our Model 3. They sketch the EM algorithm and show that, once trained, their model 
can be used to extract word-by-word alignments for pairs of sentences. They did not 
realize that the logarithm of the likelihood for Model 1 is concave and, hence, has a 
unique local maximum. They were also unaware of the trick by which we are able to 
sum over all alignments when evaluating the counts for Models 1 and 2, and of the 
trick by which we are able to sum over all alignments when transferring parameters 
from Model 2 to Model 3. As a result, they were unable to handle large vocabularies 
and so restricted themselves to vocabularies of only 9,000 words. Nonetheless, they 
were able to align phrases in French with the English words that produce them as 
illustrated in their Figure 3. 
More recently, Gale and Church (1991a) describe an algorithm similar to the one 
described in Brown et al. (1988). Like Brown et al., they consider only the simulta- 
neous appearance of words in pairs of sentences that are translations of one another. 
Although algorithms like these are extremely simple, many of the correlations between 
English and French words are so pronounced as to fall prey to almost any effort to 
expose them. Thus, the correlation of pairs like (eau I water), (lait \] milk), (pourquoi I why), 
(chambre I house), and many others, simply cannot be missed. They shout from the data, 
and any method that is not stone deaf will hear them. But many of the correlations 
speak in a softer voice: to hear them clearly, we must model the translation process, as 
295 
Computational Linguistics Volume 19, Number 2 
Brown et al. (1988) suggest and as Brown et al. (1990) and the current paper actually 
do. Only in this way can one hope to hear the quiet call of (marquees d'un ast&isque I 
starred) or the whisper of (qui s'est fait bousculer I embattled). 
The series of models that we have described constitutes a mathematical embodi- 
ment of the powerfully compelling intuitive feeling that a word in one language can 
be translated into a word or phrase in another language. In some cases, there may 
be several or even several tens of translations depending on the context in which the 
word appears, but we should be quite surprised to find a word with hundreds of 
mutually exclusive translations. Although we use these models as part of an auto- 
matic system for translating French into English, they provide, as a byproduct, very 
satisfying accounts of the word-by-word alignment of pairs of French and English 
strings. 
Our work has been confined to French and English, but we believe that this is 
purely adventitious: had the early Canadian trappers been Manchurians later to be 
outnumbered by swarms of conquistadores, and had the two cultures clung stubbornly 
each to its native tongue, we should now be aligning Spanish and Chinese. We con- 
jecture that local alignment of the component parts of any corpus of parallel texts is 
inherent in the corpus itself, provided only that it be large enough. Between any pair 
of languages where mutual translation is important enough that the rate of accumula- 
tion of translated examples sufficiently exceeds the rate of mutation of the languages 
involved, there must eventually arise such a corpus. 
The linguistic content of our program thus far is scant indeed. It is limited to one 
set of rules for analyzing a string of characters into a string of words, and another 
set of rules for analyzing a string of words into a string of sentences. Doubtless even 
these can be recast in terms of some information theoretic objective function. But it is 
not our intention to ignore linguistics, neither to replace it. Rather, we hope to enfold 
it in the embrace of a secure probabilistic framework so that the two together may 
draw strength from one another and guide us to better natural language processing 
systems in general and to better machine translation systems in particular. 
Acknowledgments 
We would like to thank many of our 
colleagues who read and commented on 
early versions of the manuscript, especially 
John Lafferty. We would also like to thank 
the reviewers, who made a number of 
invaluable suggestions about the 
organization of the paper and pointed out 
many weaknesses in our original 
manuscript. If any weaknesses remain, it is 
not because of their failure to point them 
out, but because of our ineptness at 
responding adequately to their criticisms. 
References 
Baum, L. E. (1972). "An inequality and 
associated maximization technique in 
statistical estimation of probabilistic 
functions of a Markov process." 
Inequalities, 3, 1-8. 
Brown, Peter E; Cocke, John; Della Pietra, 
Stephen A.; Della Pietra, Vincent J.; 
Jelinek, Frederick; Lafferty, John D.; 
Mercer, Robert L.; and Roossin, Paul S. 
(1990). "A statistical approach to machine 
translation." Computational Linguistics, 
16(2), 79--85. 
Brown, Peter F.; Cocke, John; Della Pietra, 
Stephen A.; Della Pietra, Vincent J.; 
Jelinek, Frederick; Mercer, Robert L.; and 
Roossin, Paul S. (1988). "A statistical 
approach to language translation." In 
Proceedings, 12th International Conference on 
Computational Linguistics (COLING-88). 
Budapest, Hungary, August 1988, 71-76. 
Brown, Peter F.; Della Pietra, Stephen A.; 
Della Pietra, Vincent J.; and Mercer, 
Robert L. (1991a). "A statistical approach 
to sense disambiguation in machine 
translation." In Fourth DARPA Workshop on 
Speech and Natural Language. Morgan 
Kaufmann Publishers, Inc., 146-151. 
Brown, Peter F.; Della Pietra, Stephen A.; 
Della Pietra, Vincent J.; and Mercer, 
Robert L. (1991b). "Word sense 
disambiguation using statistical 
methods." In Proceedings, 29th Annual 
296 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
Meeting of the Association for Computational 
Linguistics. Berkeley CA, June 1991, 
265-270. 
Brown, Peter F.; Della Pietra, Vincent J.; 
deSouza, Peter V.; and Mercer, Robert L. 
(1990). "Class-based n-gram models of 
natural language." In Proceedings of the IBM 
Natural Language ITI. Paris, France, March 
1990, 283-298. Also in Computational 
Linguistics 18(4), 1992, 467-479. 
Brown, Peter F.; Lai, Jennifer C.; and Mercer, 
Robert L. (1991). "Aligning sentences in 
parallel corpora." In Proceedings, 29th 
Annual Meeting of the Association for 
Computational Linguistics. Berkeley CA, 
June 1991, 169-176. 
Dempster, A. E; Laird, N. M.; and Rubin, 
D. B. (1977). "Maximum likelihood from 
incomplete data via the EM algorithm." 
Journal of the Royal Statistical Society, 39(B), 
1-38. 
Gale, William A.; and Church, Kenneth W. 
(1991a). "Identifying word 
correspondences in parallel texts." In 
Fourth DARPA Workshop on Speech and 
Natural Language. Morgan Kaufmann 
Publishers, Inc., 152-157. 
Gale, William A.; and Church, Kenneth W. 
(1991b). "A program for aligning 
sentences in bilingual corpora." In 
Proceedings, 29th Annual Meeting of the 
Association for Computational Linguistics. 
Berkeley CA, June 1991, 177-184. 
It6, Kiyoshi, editor. (1987). Encyclopedic 
Dictionary of Mathematics, Second Edition. 
MIT Press. 
Kay, Martin (1991). "Text-translation 
alignment." In ACH/ALLC "91: "Making 
Connections" Conference Handbook. Tempe 
AZ, March 1991. 
Maltese, G., and Mancini, F. (1992). "An 
automatic technique to include 
grammatical and morphological 
information in a trigram-based statistical 
language model." In Proceedings, IEEE 
International Conference on Acoustics, Speech 
and Signal Processing. San Francisco CA, 
March 1992, 1-157-I-160. 
Warwick, Susan, and Russell, Graham 
(1990). "Bilingual concordancing and 
bilingual lexicography." In EURALEX 4th 
International Congress. M~ilaga, Spain. 
Weaver, W. (1955). Translation (1949). In 
Machine Translation of Languages. MIT 
Press. 
Appendix A: Table of Notation 
8 
e 
e 
E 
I 
L 
i 
ei 
eo 
e~ 
English vocabulary 
English word 
English string 
random English string 
length of e 
random length of E 
position in e, i = 0, 1,... ,1 
word i of e 
the empty cept 
ele2.., ei 
.F 
f 
f 
F 
m 
M j 
fj. 
French vocabulary 
French word 
French string 
random French string 
length of f 
random length of F 
position in f, j -- 1,2,... ,m 
word j of f 
flf2 . . .fj 
alignment 
297 
Computational Linguistics Volume 19, Number 2 
aj 
4 
T 
7"i 
¢i 
k 
~k 
IT ik 
V(f I e) 
Vi~-j(f l e) 
N(a) 
 j(a) 
b(a) 
b°°(a) 
bi,__j(a) 
bi°°~j(a) 
A(e) 
aj 
V 
Pi 
Ci 
\[i\] 
®i 
Po 
C(f,e) 
ga(Po) 
R(P~, Po) 
position in e connected to position j of f for 
alignment a 
ala2 • • • aj 
number of positions of f connected to position i of e 
~1~2''' (~i 
tableau--a sequence of tablets, where a tablet is a 
sequence of French words 
tablet i of "r 
ToT 1 ... T i 
length of T i 
position within a tablet, k = 1,2,..., ¢i 
word k of T i 
a permutation of the positions of a tableau 
position in f for word k of "ri for permutation r¢ 
7¢il 7Fi2... 7rik 
Viterbi alignment for (f\[ e) 
Viterbi alignment for (f \] e) with 0 pegged 
neighboring alignments of a 
neighboring alignments of a with ij pegged 
alignment in N(a) with greatest probability 
alignment obtained by applying b repeatedly to a 
alignment in -My (a) with greatest probability 
alignment obtained by applying bi~j repeatedly to a 
class of English word e 
class of French word f 
displacement of a word in f 
vacancies in f 
first position in e to the left of i that has non-zero 
fertility 
average position in f of the words connected to 
position i of e 
position in e of the i th one word cept 
C\[i\] 
translation model P with parameter values 0 
empirical distribution of a sample 
log-likelihood objective function 
relative objective function 
t(f l e) translation probabilities (all models) 
298 
Peter E Brown et al. The Mathematics of Statistical Machine Translation 
¢(m It) 
n(~ l e) 
Po, pl 
a(i I j, l, m) 
d(j l i, l,m) 
dl (Aj I A, 13) 
d>,(aj J e) 
d I(Aj I 13, v) 
e>l(aj r 13,v) 
string length probabilities (Models 1 and 2) 
fertility probabilities (Models 3, 4, and 5) 
fertility probabilities for e0 (Models 3, 4, and 5) 
alignment probabilities (Model 2) 
distortion probabilities (Model 3) 
distortion 
(Model 4) 
distortion 
(Model 4) 
probabilities for the first word of a tablet 
probabilities for the other words of a tablet 
distortion 
(Model 5) 
distortion 
(Model 5) 
probabilities for the first word of a tablet 
probabilities for the other words of a tablet 
Appendix B: Summary of Models 
We collect here brief descriptions of our various translation models and the formulae 
needed for training them. 
B.1 Translation Models 
An English-to-French translation model P with parameters 0 is a formula for calculating 
a conditional probability, or likelihood, P0(f I e), for any string f of French words and 
any string e of English words. These probabilities satisfy 
Po(f I e) > O, Po(failurele ) >_ O, 
Po (failure I e) + Z Po (f\[ e) = 1, 
f 
(50) 
where the sum ranges over all French strings f, and failure is a special symbol not in 
the French vocabulary. We interpret P0(f I e) as the probability that a translator will 
produce f when given e, and P0 (failure I e) as the probability that he will produce no 
translation when given e. We call a model deficient if Po(failure I e) is greater than zero 
for some e. 
Log-Likelihood Objective Function. The log-likelihood of a sample of translations 
(f(s), e(S)), s = 1, 2,..., S, is 
S 
~(P0) = S-1 ZlogP0(f(s) \] e(S)) = ZC(f,e)logP0(f\] e). (51) 
s=l f,e 
Here C(f, e) is the empirical distribution of the sample, so that C(f, e) is 1/S times the 
number of times (usually 0 or 1) that the pair (f, e) occurs in the sample. We determine 
values for the parameters 0 so as to maximize this log-likelihood for a large training 
sample of translations. 
299 
Computational Linguistics Volume 19, Number 2 
Hidden Alignment Models. For the models that we present here, we can express 
P0 (f I e) as a sum of the probabilities of hidden alignments a between e and f: 
Po(f I e) = Y~Po(f,a I e). (52) 
a 
For our models, the only alignments that have positive probability are those for which 
each word of f is connected to at most one word of e. 
Relative Objective Function. We can compare hidden alignment models P6 and P0 
using the relative objective function 1 
R(f~o,po) - ~--~C(f,e)~--~0(a I f, e)log P°(f'a I e) 
f,. . P0(f, ale)' 
(54) 
where P0(a \[ f,e) = P0(a,f \[ e)/P0(f \[ e). Note that R(P0,P6) = 0. R is related to ~b by 
Jensen's inequality 
~b(Po) >_ ~b(P~) + R(P6, Po), (55) 
which follows because the logarithm is concave. In fact, for any e and f, 
P6(alf, e) logPo(f'a l e) < log~P~(a If, "P°(f'ale) (56) 
. P~(f,a I e) - , e)p~,a l e) 
= log Po(f I e) = logPo(f I e) - log~0(f I e). (57) Po(f I e) 
Summing over e and f and using the Definitions (51) and (54) we arrive at Equa- 
tion (55). 
B.2 Iterative Improvement 
We cannot create a good model or find good parameter values at a stroke. Rather 
we employ a process of iterative improvement. For a given model we use current 
parameter values to find better ones, and in this way, from initial values we find 
locally optimal ones. Then, given good parameter values for one model, we use them 
to find initial parameter values for another model. By alternating between these two 
steps we proceed through a sequence of gradually more sophisticated models. 
Improving Parameter Values. From Jensen's inequality (55), we see that ~b(Po) is 
greater than ~b(P0) if R(P0, Po) is positive. With P = P, this suggests the following 
1 The reader will notice a similarity between R(P6, Pe) and the relative entropy 
"x" lo p(x) D(p,q) = 2~,p k ) g q(x) (53) 
x 
between probability distributions p and q. However, whereas the relative entropy is never negative, R can take any value. The inequality (55) for R is the analog of the inequality D > 0 for D. 
300 
Peter E Brown et al. The Mathematics of Statistical Machine Translation 
iterative procedure, known as the EM Algorithm (Baum 1972; Dempster, Laird, and 
Rubin 1977), for finding locally optimal parameter values 8 for a model P: 
0. Choose some initial values 0. 
1. Repeat Steps 2-3 until convergence. 
2. With 8 fixed, find the values 8 that maximize R(P6, Po). 
3. Replace ~ by 8. 
Note that for any ~, R(P6, Pe) is non-negative at its maximum in 8, since it is zero for 
8 = ~. Thus ~b(Pe) will not decrease from one iteration to the next. 
Going From One Model to Another. Jensen's inequality also suggests a method for 
using parameter values 0 for one model P to find initial parameter values 8 for another 
model P: 
. With P and ~ fixed, find the values 8 that maximize 
R(P~, Pe). 
In contrast to the case where P = P, there may not be any ~9 for which R(P~, Pa) is 
non-negative. Thus, it could be that, even for the best 8, ~b(Pe) < ~b(I36). 
Parameter Reestimation Formulae. In order to apply these algorithms, we need to 
solve the maximization problems of Steps 2 and 4. For the models that we consider, 
we can do this explicitly. To exhibit the basic form of the solution, we suppose Pe is a 
model given by 
P0(f, a l e) = II (5s) 
,;Eft 
where the 8(w), o; E f~, are real-valued parameters satisfying the constraints 
8(w) ~ O, y~ 8(w) = 1, (59) 
";Eft 
and for each 0; and (a, f, e), c(~v ; a, f, e) is a non-negative integer. 2 We interpret 8(a;) as 
the probability of the event w and c(w;a, f, e) as the number of times that this event 
occurs in (a, f, e). Note that 
c(0;; a, f, e) = 8(a;) O~(~) log Pe (f, a I e). (60) 
The values for 8 that maximize the relative objective function R(P~, Pe) subject to 
the contraints (59) are determined by the Kuhn-Tucker conditions 
~ 
O0(a;) R(P~,Po) - ,~ ~- O, cv E f~, (61) 
where ~ is a Lagrange multiplier, the value of which is determined by the equality 
constraint in Equation (59). These conditions are both necessary and sufficient for a 
2 More generally, we can allow c(w ; a, f, e) to be a non-negative real number. 
301 
Computational Linguistics Volume 19, Number 2 
maximum since R(150, P0) is a concave function of the 0(w). By multiplying Equation 
(61) by 0(w) and using Equation (60) and Definition (54) of R, we obtain the parameter 
reestimation formulae 
0(w) = ,\-1~0(w), A = ~ ~0(w), (62) 
wEf~ 
~(~o) = ~C(f,e)?~(w;f,e), (63) 
f,e 
~(~;f,e) -= E150(alf, e)c(w ; a, f, e). (64) 
a 
We interpret ~0(co;f, e) as the expected number of times, as computed by the model 
150, that the event a; occurs in the translation of e to f. Thus O(w) is the (normalized) 
expected number of times, as computed by model 150, that a; occurs in the translations 
of the training sample. 
We can easily generalize these formulae to allow models of the form (58) for which 
the single equality constraint in Equation (59) is replaced by multiple constraints 
O(w) = 1, /z = 1,2,..., (65) 
wEf~ 
where the subsets f~a, # --- 1,2,... form a partition of fL We need only modify Equation 
(62) by allowing a different A~ for each #: if w E f2~,, then 
O(o.,) = A~-:t?::~(w), A~ = ~ ?::~(o..,). (66) 
wEf2~ 
B.3 Model 1 
Parameters. 
e(m I I) string length probabilities 
t0 c \] e) translation probabilities 
Heref E 5r; e E $ or e = e0; l = 1,2,...; and m = 1,2,.... 
General Formula. 
Assumptions. 
P0(f,a I e) = Po(m l e)Po(a l m, e)Po(f l a, rn, e) 
This model is not deficient. 
(67) 
P0(mie) = e(m\[l) (68) 
P0(aim, e) = (l+1) -m (69) 
m 
P0(f l a, /, e) = IIt~jleaj) (70) 
j=l 
302 
Peter E Brown et al. The Mathematics of Statistical Machine Translation 
Generation. Equations (67)-(70) describe the following process for producing f from e: 
. 
2. 
. 
Choose a length m for f according to the probability distribution e(m \] l). 
For each j = 1,2,..., m, choose aj from O, 1,2,... l according to the 
uniform distribution. 
For each j = 1,2,..., m, choose a French word j~ according to the 
distribution t(~ \]%). 
Useful Formulae. Because of the independence assumptions (68)-(70), we can calcu- 
late the sum over alignments (52) in closed form: 
Po(f \] e) = y'~Pe(f,a l e) (71) 
a 
1 I m 
= e(m II)(I+ 1)-my~, ... ~ IIt(fJl%) (72) 
al =0 am =0 j=l 
m 1 
= e(m \] l)(l + 1) -m II ~ t~. I ei). (73) 
j=l i=0 
Equation (73) is useful in computations since it involves only O(lm) arithmetic opera- 
tions, whereas the original sum over alignments (72) involves 0(I m) operations. 
Concavity. The objective function (51) for this model is a strictly concave function of 
the parameters. In fact, from Equations (51) and (73), 
m 1 
~b(Pe) -- ~ C(f, e) ~ log ~ tory I ei) + ~ C(f, e)log e(m II) + constant 
f,e j=l i=0 f,e 
(74) 
which is clearly concave in e(m I I) and t(f I e) since the logarithm of a sum is concave, 
and the sum of concave functions is concave. 
Because ~b is concave, it has a unique local maximum. Moreover, we will find this 
maximum using the EM algorithm, provided that none of our initial parameter values 
is zero. 
B.4 Model 2 
Parameters. 
c(m I l) 
t(fle) 
a(i Ij, l,m) 
string length probabilities 
translation probabilities 
alignment probabilities 
Here i = 0,...,1; and j = 1,...,m. 
General Formula. 
P0(f,a I e) = Po(m l e)Pe(alm, e)Po(f l a, m,e) (75) 
303 
Computational Linguistics Volume 19, Number 2 
Assumptions. 
P0(mle ) = ¢(m\[l) (76) 
m 
P0(alm, e) = IIa(ajlj, l,m) (77) 
j=l 
m 
P0(fla, m,e) = IIt(fjl%) (78) 
j=l 
This model is not deficient. Model 1 is the special case of this model in which the 
alignment probabilities are uniform: a(i I J, l, m) = (I + 1) -1 for all i. 
Generation. Equations (75)-(78) describe the following process for producing f from e: 
1. Choose a length m for f according to the distribution e(m I l). 
2. For each j = 1,2,..., m, choose aj from 0,1,2,... 1 according to the 
distribution a(aj I J, I, m). 
3. For each j, choose a French wordj~ according to the distribution t~ I%). 
Useful Formulae. Just as for Model 1, the independence assumptions allow us to 
calculate the sum over alignments (52) in closed form: 
P0(fle) = EP0(f, ale) (79) 
a 
l l m 
= ffmll)~... ~_, I-\[t~Leo,)a(ajlj, l,m) (80) 
al =0 am =0 j=l 
m 1 
= ¢(m \[ I) H E t(fj ei)a(i\]j,l,m). 
j=l i=0 
(81) 
By assumption (77) the connections of a are independent given the length m of f. 
Using Equation (81) we find that they are also independent given f: 
m 
P0(alf, e) = IIpo(aj\[j,f,e), (82) 
j=l 
where 
7(i,j, f,e) pe(i 
Ij, f,e) = Y'~i,'7(i',j,f,e) with 7(i,j,f,e) = t(fj l ei)a(ilj, l,m). (83) 
Viterbi Alignment. For this model, and thus also for Model 1, we can express in 
closed form the Viterbi alignment V(f I e) between a pair of strings (f, e): 
V(f I e)j = argmax t(fj I ei)a(i I J, 1, m). (84) 
i 
304 
Peter E Brown et al. The Mathematics of Statistical Machine Translation 
Parameter Reestimation Formulae. We can find the parameter values/9 that maximize 
the relative objective function R(Pt, P0) by applying the considerations of Section B.2. 
The counts c(w; a, f, e) of Equation (58) are 
m 
c(fle;a,f,e ) = ~_,6(e,%)6(f,fj), (85) 
j=l 
c(ilj, l,m;a,f,e) = 6(i, afl. (86) 
We obtain the parameter reestimation formulae for t(f I e) and a(i I J, I, m) by using 
these counts in Equations (62)-(66). 
Equation (64) requires a sum over alignments. If PO satisfies 
m 
P~(a If, e) = IIp~(aj I j,f,e), 
j=l 
(87) 
as is the case for Models 1 and 2 (see Equation (82)), then this sum can be calculated 
explicitly: 
1 m 
c~(f I e;f, e) = E E/3~(i I J, f, e)6(e, ei)6(f,fj), (88) 
i=0 j=l 
~o(ilj;f,e) = ~(ilj, f,e). (89) 
Equations (85)-(89) involve only O(Im) arithmetic operations, whereas the sum over 
alignments involves 0(I m) operations. 
B.5 Model 3 
Parameters. 
t(f l e) 
n(~b I e) 
P0, pl d(j l i, l,m) 
translation probabilities 
fertility probabihties 
fertility probabilities for e0 
distortion probabilities 
Here ~b = 0,1,2,..-. 
General Formulae. 
Po(r, re\]e) 
P0(f, ale) 
= P0(~b\[e)P0(T \[~b,e)P0(re \[7-,~b,e) 
= P0( ,rele) 
(r,re)C (f,a) 
Here (f, a) is the set of all (~-, re) consistent with (f, a): 
(r,z¢) C (f,a) if for all i= 0,1,...,l and k = 1,2,...,~i, 
fTrik = Tik and aTrik -~- i. 
(90) 
(91) 
(92) 
305 
Computational Linguistics Volume 19, Number 2 
Assumptions. 
where 
l l 
Po(~ble) = no(Co I E¢i)IIn(¢i \[ei) (93) 
i=1 i=1 
1 4,i 
Po('r \[¢,e) = I-\[1-It(Tik l ei) (94) 
i=0 k-----1 
1 q~ 1 
Po(Tr I"r,q~,e) - ¢o! I-\[1-Id(Trik l i'l'm) (95) 
i=1 k=l 
rt/t 
n0(¢0 I m') -- ¢0 pom -¢°p1¢°. (96) 
In Equation (95) the factor of 1/¢0! accounts for the choices of ~r0k, k = 1,2,..., ¢0. This 
model is deficient, since 
Po(failure I'r,q~,e) = 1 - EPe(Tr I ~',~b,e) > 0. (97) 
71" 
Generation. Equations (90)-(95) describe the following process for producing f or 
failure from e: 
1. For each i = 1, 2,..., I, choose a length (~i for ri according to the 
distribution n(¢i l ei). 
2. Choose a length ¢0 for r0 according to the distribution n0(¢0 I ~li=l ¢i). 
3. Let m = ¢0 + Y2~li=l ¢i. 
4. For each i = O, 1,..., l and each k = 1,2,..., ¢i, choose a French word rik 
according to the distribution t(Tik l ei). 
5. For each i = 1, 2,..., l and each k = 1, 2,..., ¢i, choose a position 7rik from 
1,..., m according to the distribution dOrik I i, l, m). 
6. If any position has been chosen more than once then return failure. 
7. For each k = 1,2,..., ¢0, choose a position ~rOk from the ¢0 - k + 1 
remaining vacant positions in 1,2,..., m according to the uniform 
distribution. 
8. Let f be the string with f~ik = Tik" 
Useful Formulae. From Equations (93)-(95) it follows that if (r, 7r) is consistent with 
(f, a) then 
m 
Pe(r I c~,e) = IIt(fJ l eaj), (98) 
j=l 
1 Po(Tr I~-,q~,e) - ¢0! l-I d(jlaj 'l,m)" (99) 
j:aj#O 
306 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
In Equation (99), the product runs over all j = 1,2,..., m except those for which aj = O. 
By summing over all pairs (% rr) consistent with (f, a) we find 
P0 (f, ale) = ~ Po(v, 7r l e) (100) 
('r,rr)C {f,a) 
= no ¢01 ¢i l-In(¢ile~)¢~!IIt~l%)IId(jla/,l,m).(lOl) 
i=1 j=l j:aj:~O 
The factors of ¢i! in Equation (101) arise because there are I-If=0 ¢i! equally probable 
terms in the sum (100). 
Parameter Reestimation Formulae. We can find the parameter values 0 that maximize 
the relative objective function R(P6, P0) by applying the considerations of Section B.2. 
The counts c(w ; a, f, e) of Equation (58) are 
m 
c0Cle;a,f,e) = y~6(e,%)6(f,fj), (102) 
j=l 
c(j \] i,l,m;a,f,e) = 6(i, aj), (103) 
l 
c(¢ \] e;a, f, e) = y~ 6(e, ei)6(¢,  i). (104) 
i=1 
We obtain the parameter reestimation formulae for t0 c \]e), a(j I i, I, m), and t(¢ l e) by 
using these counts in Equations (62)-(66). 
Equation (64) requires a sum over alignments. If P0 satisfies 
m 
P0(a If, e) = II~o(a/\] j,f,e), 
j=l 
(105) 
as is the case for Models 1 and 2 (see Equation (82)), then this sum can be calculated 
explicitly for ~6(f I e;f, e) and ~o(jli;f, e): 
l m 
co(f \] e; f, e) = ~ ~/?0(i \] j, f, e)6(e, ei)~(f,fj), (106) 
i=0 j=l 
c~(/'li;f,e) = \]?0(i \] j, f, e). (107) 
Unfortunately, there is no analogous formula for ~6(¢ I e; f, e). Instead we must be 
content with 
c0(¢ \[ e;f,e) z m ~ ~i~(f, e),k (108) ~-~6(e'ei)II(1-f~o(ilJ'f'e)) ~ I-I %! , 
i=1 j=l 3'cF¢ k=l 
(-1) k+l 1 ~ flq(f, e)k, (109) Oqk(f , e) -- k! k 
j=l 
/?0(i I J, f, e) &(f,e) = 1 (110) 
307 
Computational Linguistics Volume 19, Number 2 
In Equation (108), F4 denotes the set of all partitions of qS. 
Recall that a partition of ~b is a decomposition of q5 as a sum of positive integers. 
For example, ¢ = 5 has 7 partitions since 1 + 1 + 1 + 1 + 1 = 1 + 1 + 1 + 2 = 1 + 1 + 3 = 
1 + 2 + 2 = 1 + 4 = 2 + 3 = 5. For a partition % we let ~k be the number of times 
k that k appears in the sum, so that ~b = Y~-k=l 7k. If 7 is the partition corresponding 
to 1 + 1 + 3, then "Y1 = 2, 73 = 1, and "Yk = 0 for k other than 1 or 3. We adopt the 
convention that F0 consists of the single element 7 with 7k = 0 for all k. 
Equation (108) allows us to compute the counts ~6(¢ I e;f,e) in O(Im + ~g) op- 
erations, where g is the number of partitions of ~b. Although g grows with ~b like 
(4x/3~b) -1 expTrv/~-/3 \[11\], it is manageably small for small ~b. For example, ~b = 10 
has 42 partitions, 
Proof of Formula (108). Introduce the generating functional 
oo 
G(x I e,f,e) = y~O(~ le;f,e)x ~, (111) 
¢=0 
where x is an indeterminant. Then 
oo 1 l m 1 
G(xle, f,e ) = EE...EH~o(ajlj, f,e)E6(e, ei)6(~b,~i)x~, (112) 
qS=0 al =0 am=O j=l i=1 
1 1 1 m 
= ES(e'ei)E "'" E 11po(aj IJ'f'e)x4i (113) 
i=1 a~:0 a~=0 j=l 
1 I 1 m 
= ~-,5(e'ei)E"" E HPo(aj IJ, f,e) x~(i'~j) (114) 
i=1 al=O am=O j=l 
1 m I 
: ES(e'ei) IIZ o(a IJ, f,e) x'~<i'a) (115) 
i:1 j:l a=0 
1 m 
= E6(e, ei)H(1-~o(ilj, f,e)+x~(ilj, f,e)) (116) 
i=1 j=l 
1 rn m 
= ES(e, ei)H(1 -l;o(i \[j,f,e))H(1 +flij(f,e)x). (117) 
i=1 j=l j=l 
To obtain Equation (113), rearrange the order of summation and sum over ~b to elim- 
m inate the &function of ~b. To obtain Equation (114), note that ~bi = Y~4=1 6(i, aj) and so 
x ~' = I-\[;m1 xe(i'aj ). To obtain Equation (115), interchange the order of the sums on aj 
and the product on j. To obtain Equation (116), note that in the sum on a, the only 
term for which the power of x is nonzero is the one for which a = i. 
Now note that for any indeterminants x, !/1, y2,.-., ym, 
m m ,~ 
Zk'Yk (118) 11(1+ yjx)= E E II 
j=l ~b=0 -yEF 4 k=l 
(-1) k+l ~ (yj)k. (119) where Zk -- k 
j=l 
308 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
This follows from the calculation 
m 
\[\[(1 + yjx) 
j=l 
= expElog(l+yjx) = exp (--1)k+a(yjx)k (120) k 
j=l j=l k=l 
= exp E zkxk = E ~ zkxk (121) 
k=l n=O kk=l / 
( ")'1"/'2"'" )fi( n=0 n: . k=a 
oo 0o Zk,.yk 
= Z x4 E I-1%! (122) 
4=0 3'EP4 k=l 
The reader will notice that the left-hand side of Equation (120) involves only powers 
of x up to m, while Equations (121)-(122) involve all powers of x. This is because the 
zk are not algebraically independent. In fact, for q~ > m, the coefficient of x 4 on the 
right-hand side of Equation (122) must be zero. It follows that we can express z4 as a 
polynomial in Zk, k = 1,2,..., m. 
Using Equation (118) we can identify the coefficient of x 4 in Equation (117). We 
obtain Equation (108) by combining Equations (117), (118), and the definitions (109)- 
(111) and (119). 
B.6 Model 4 
Parameters. 
t(f l e) 
n(q~ \[ e) 
P0, Pa 
dl(Aj I A,/3) 
d>l (aj \[/3) 
translation probabilities 
fertility probabilities 
fertility probabilities for e0 
distortion probabilities for the first word of a tablet 
distortion probabilities for the other words of a tablet 
Here Aj is an integer; ,4 is an English class; and/3 is a French class. 
General Formulae. 
Assumptions. 
Po(r, 7r \[e) 
Po(f, a F e) 
= Po(~ble)Po(r I q~,e)Po(~ \[r,q~,e) 
= E Po(r, 7r l e) 
(r,rr)ff(f,a) 
Po(~b I e) 
Po(r \[ q~, e) 
Po(~ \[ r, ~b,e) 
= no d;o \[ q~i l-In(~i l ei) 
i=1 
= HHt(rik l ei) 
i=0 k=l 
__ 1 H H Pik(Trik) 
40! i=1 k=l 
(123) 
(124) 
(125) 
(126) 
(127) 
309 
Computational Linguistics Volume 19, Number 2 
where 
m! ) no(~o \[ m') = 4o po m -+°pl+°, (128) 
{dl(j -c., \[.A(ep,),B('rn)) ifk = 1 (129) Pie(j) = d>l(j - 7rik-1 \[ B(rik)) if k > 1 
In Equation (129), Pi is the first position to the left of i for which ~bi > 0, and cp is the 
ceiling of the average position of the words of -rp: 
Pi = max{/': q~i' > 0}, i' <i 
This model is deficient, since 
Po (failure 1% 4 }, e) 
C o-= Op -l~Trpl~ (130) 
1 -EP0(lr \] r,q),e) > 0. (131) 
T¢ 
Note that Equations (125), (126), and (128) are identical to the corresponding formulae 
(93), (94), and (96) for Model 3. 
Generation. Equations (123)-(127) describe the following process for producing f or 
failure from e: 
1-4. Choose a tableau r by following Steps 1--4 for Model 3. 
5. For each i = 1,2,..., I and each k = 1,2,..., ~bi choose a position 
7rik as follows. 
If k = 1 then choose 7ri1 according to the distribution 
dl(Tril - co, I A(ep,), ~J(Til)). 
If k > 1 then choose 7(ik greater than ~rik-1 according to 
the distribution 
d>10rik -- 7rig-1 \[ B(rik)). 
6-8. Finish generating f by following Steps 6-8 for Model 3. 
B.7 Model 5 
Parameters. 
t(f l e) 
n(d? l e) 
Po, pl dl (Aj 113, v) 
a>l(Aj 113, v) 
translation probabilities 
fertility probabilities 
fertility probabilities for e0 
distortion probabilities for the first word of a tablet 
distortion probabilities for the other words of a tablet 
Here v = 1,2,.-.,m. 
General Formulae. 
Po("-, l e) 
Po (f, a I e) 
= Po(~\[e)Po(r I q~,e)Po(Tr I'r,4~,e) 
= ~ Po ('r, 7r I e) 
(r,~)e (f,a) 
(132) 
(133) 
310 
Peter F. Brown et al. The Mathematics of Statistical Machine Translation 
Assumptions. ( z), 
Pa(~b\[e) = no 4o \[ Y~ 4i I-\[ n(4i \[ ei) (134) 
i=1 i=1 
1 ~i 
Po(r \] ~b,e) = 1-I1--it(rik \[ei) (135) 
i=0 k=l 
1 4i 
1 II \]--I P ikOrik) (136) P0(Irl'r,~b,e) - 40\[ i~--lk=l 
where 
n0(40 mt m') = 40 pom -4Op1~O , (137) 
(138) 
dl(Vil(j) - Vil(Cpi) \[ \]3(Til),Vil(m) -- 4i + k) 
if k= 1 
Pik(j) = ~ik~) (139) 
d>l(Vik(j) -- Vik("lrik-1) I \]~(Tik), Vik(m) -- Vik('lrik-l) -- 4i q-k) 
if k> 1 
In Equation (139), pi is the first position to the left of i which has a non-zero fertility; 
and cp is the ceiling of the average position of the words of tablet p (see Equation 
(130)). Also, elk(j) is 1 if position j is vacant after all the words of tablets i' < i and the 
first k - 1 words of tablet i have been placed, and 0 otherwise. Vik(j) is the number of 
vacancies not to the right of j at this time: vik(j) = ~j,K_j £ikq'). 
This model is not deficient. Note that Equations (134), (135), and (138) are identical 
to the corresponding formulae for Model 3. 
Generation. Equations (132)-(136) describe the following process for producing f 
from e: 
1.--4. Choose a tableau -r by following Steps 1-4 for Model 3. 
5. For each i = 1,2,..., I and each k = 1, 2,..., 4i choose a position 
~rik as follows: 
If k = 1 then choose a vacant position 7ri1 according to 
the distribution 
d 1 (vii (71-/1) - vii (Cpi) \[ /J(Til), 7di1 (ttl) -- 4i -\[- k). 
If k > 1 then choose a vacant position 7rik greater than 
7rik-1 according to the distribution 
d>l(Vik(Trik) -- Pik(Trik-1) I ~(Tik)' Vik(m) -- Vik(Trik-X) -- 4i nt-k). 
6.-8. Finish generating f by following Steps 6-8 for Model 3. 
311 

