File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/j93-2003_metho.xml
Size: 76,187 bytes
Last Modified: 2025-10-06 14:13:23
<?xml version="1.0" standalone="yes"?> <Paper uid="J93-2003"> <Title>The Mathematics of Statistical Machine Translation: Parameter Estimation</Title> <Section position="5" start_page="267" end_page="281" type="metho"> <SectionTitle> 4. Translation Models </SectionTitle> <Paragraph position="0"> In this section, we develop a series of five translation models together with the algorithms necessary to estimate their parameters. Each model gives a prescription for computing the conditional probability Pr(f\[e), which we call the likelihood of the translation (f, e). This likelihood is a function of a large number of free parameters that we must estimate in a process that we call training. The likelihood of a set of translations is the product of the likelihoods of its members. In broad outline, our plan is to guess values for these parameters and then to apply the EM algorithm (Baum 1972; Dempster, Laird, and Rubin 1977) iteratively so as to approach a local maximum of the likelihood of a particular set of translations that we call the training data. When the likelihood of the training data has more than one local maximum, the one that we approach will depend on our initial guess.</Paragraph> <Paragraph position="1"> In Models 1 and 2, we first choose a length for the French string, assuming all reasonable lengths to be equally likely. Then, for each position in the French string, we decide how to connect it to the English string and what French word to place there.</Paragraph> <Paragraph position="2"> In Model 1 we assume all connections for each French position to be equally likely.</Paragraph> <Paragraph position="3"> Therefore, the order of the words in e and f does not affect Pr(f\]e). In Model 2 we make the more realistic assumption that the probability of a connection depends on the positions it connects and on the lengths of the two strings. Therefore, for Model 2, Pr(f\[e) does depend on the order of the words in e and f. Although it is possible to obtain interesting correlations between some pairs of frequent words in the two languages using Models 1 and 2, as we will see later (in Figure 5), these models often lead to unsatisfactory alignments.</Paragraph> <Paragraph position="4"> In Models 3, 4, and 5, we develop the French string by choosing, for each word in the English string, first the number of words in the French string that will be connected Peter E Brown et al. The Mathematics of Statistical Machine Translation to it, then the identity of these French words, and finally the actual positions in the French string that these words will occupy. It is this last step that determines the connections between the English string and the French string and it is here that these three models differ. In Model 3, as in Model 2, the probability of a connection depends on the positions that it connects and on the lengths of the English and French strings.</Paragraph> <Paragraph position="5"> In Model 4 the probability of a connection depends in addition on the identities of the French and English words connected and on the positions of any other French words that are connected to the same English word. Models 3 and 4 are deficient, a technical concept defined and discussed in Section 4.5. Briefly, this means that they waste some of their probability on objects that are not French strings at all. Model 5 is very much like Model 4, except that it is not deficient.</Paragraph> <Paragraph position="6"> Models 1-4 serve as stepping stones to the training of Model 5. Models 1 and 2 have an especially simple mathematical form so that iterations of the EM algorithm can be computed exactly. That is, we can explicitly perform sums over all possible alignments for these two models. In addition, Model 1 has a unique local maximum so that parameters derived for it in a series of EM iterations do not depend on the starting point for the iterations. As explained below, we use Model 1 to provide initial estimates for the parameters of Model 2. In Model 2 and subsequent models, the likelihood function does not have a unique local maximum, but by initializing each model from the parameters of the model before it, we arrive at estimates of the parameters of the final model that do not depend on our initial estimates of the parameters for Model 1.</Paragraph> <Paragraph position="7"> In Models 3 and 4, we must be content with approximate EM iterations because it is not feasible to carry out sums over all possible alignments for these models. But, while approaching more closely the complexity of Model 5, they retain enough simplicity to allow an efficient investigation of the neighborhood of probable alignments and therefore allow us to include what we hope are all of the important alignments in each EM iteration.</Paragraph> <Paragraph position="8"> In the remainder of this section, we give an informal but reasonably precise description of each of the five models and an intuitive account of the EM algorithm as applied to them. We assume the reader to be comfortable with Lagrange multipliers, partial differentiation, and constrained optimization as they are presented in a typical college calculus text, and to have a nodding acquaintance with random variables. On the first time through, the reader may wish to jump from here directly to Section 5, returning to this Section when and if he should desire to understand more deeply how the results reported later are achieved.</Paragraph> <Paragraph position="9"> The basic mathematical object with which we deal here is the joint probability distribution Pr(F = f, A = a, E = e), where the random variables F and E are a French string and an English string making up a translation, and the random variable A is an alignment between them. We also consider various marginal and conditional probability distributions that can be constructed from Pr(F = f, A = a, E = e), especially the distribution Pr(F = fie = e). We generally follow the common convention of using uppercase letters to denote random variables and the corresponding lowercase letters to denote specific values that the random variables may take. We have already used I and m to represent the lengths of the strings e and L and so we use L and M to denote the corresponding random variables. When there is no possibility for confusion, or, more properly, when the probability of confusion is not thereby materially increased, we write Pr(f, a, e) for Pr(F = f, A = a, E = e), and use similar shorthands throughout. We can write the likelihood of (fie) in terms of the conditional probability Pr(f, ale ) as</Paragraph> <Paragraph position="11"> Computational Linguistics Volume 19, Number 2 The sum here, like all subsequent sums over a, is over the elements of M(e, f). We restrict ourselves in this section to alignments like the one shown in Figure I where each French word has exactly one connection. In this kind of alignment, each cept is either a single English word or it is empty. Therefore, we can assign cepts to positions in the English string, reserving position zero .for the empty cept. If the English string, e = e~ - el e2... el, has 1 words, and the French string, f = f~ =_ flf2.., fro, has m words, then the alignment, a, can be represented by a series, a~ = ala2...am, of m values, each between 0 and I such that if the word in position j of the French string is connected to the word in position i of the English string, then aj = i, and if it is not connected to any English word, then aj = O.</Paragraph> <Paragraph position="12"> Without loss of generality, we can write</Paragraph> <Paragraph position="14"> This is only one of many ways in which Pr(f, ale) can be written as the product of a series of conditional probabilities. It is important to realize that Equation (4) is not an approximation. Regardless of the form of Pr(f, ale ), it can always be analyzed into a product of terms in this way. We are simply asserting in this equation that when we generate a French string together with an alignment from an English string, we can first choose the length of the French string given our knowledge of the English string.</Paragraph> <Paragraph position="15"> Then we can choose where to connect the first position in the French string given our knowledge of the English string and the length of the French string. Then we can choose the identity of the first word in the French string given our knowledge of the English string, the length of the French string, and the position in the English string to which the first position in the French string is connected, and so on. As we step through the French string, at each point we make our next choice given our complete knowledge of the English string and of all our previous choices as to the details of the French string and its alignment.</Paragraph> <Section position="1" start_page="269" end_page="272" type="sub_section"> <SectionTitle> 4.1 Model 1 </SectionTitle> <Paragraph position="0"> The conditional probabilities on the right-hand side of Equation (4) cannot all be taken as independent parameters because there are too many of them. In Model 1, we assume that Pr(mle ) is independent of e and m; that Pr(ajlalJ-l, J -1, m, e), depends only on 1, the length of the English string, and therefore must be (l + 1)-1; and that Pr(fj\[alJ,fl j-l, m~ e) depends only on j~ and %. The parameters, then, are ~ -_ Pr(mle ), and t(~\]%) -- Pr(djlalJ,AJ-1, m, e), which we call the translation probability of ~ given eaj. We think of ~ as some small, fixed number. The distribution of M, the length of the French string, is unnormalized but this is a minor technical issue of no significance to our computations. If we wish, we can think of M as having some finite range. As long as this range encompasses everything that actually occurs in training data, no problems arise.</Paragraph> <Paragraph position="1"> We turn now to the problem of estimating the translation probabilities for Model 1.</Paragraph> <Paragraph position="2"> The joint likelihood of a French string and an alignment given an English string is</Paragraph> <Paragraph position="4"> The alignment is determined by specifying the values of aj for j from 1 to m, each of Peter F. Brown et al. The Mathematics of Statistical Machine Translation which can take any value from 0 to I. Therefore,</Paragraph> <Paragraph position="6"> We wish to adjust the translation probabilities so as to maximize Pr(fIe ) subject to the constraints that for each C/,</Paragraph> <Paragraph position="8"> Following standard practice for constrained maximization, we introduce Lagrange multipliers )%, and seek an unconstrained extrernum of the auxiliary function</Paragraph> <Paragraph position="10"> An extremum occurs when all of the partial derivatives of h with respect to the components of t and ,~ are zero. That the partial derivatives with respect to the components of I be zero is simply a restatement of the constraints on the translation probabilities.</Paragraph> <Paragraph position="11"> The partial derivative of h with respect to t(f\] e) is l l re re Oh e</Paragraph> <Paragraph position="13"> where 6 is the Kronecker delta function, equal to one when both of its arguments are the same and equal to zero otherwise. This partial derivative will be zero provided</Paragraph> <Paragraph position="15"> Superficially, Equation (10) looks like a solution to the extremum problem, but it is not because the translation probabilities appear on both sides of the equal sign.</Paragraph> <Paragraph position="16"> Nonetheless, it suggests an iterative procedure for finding a solution: given an initial guess for the translation probabilities, we can evaluate the right-hand side of Equation (10) and use the result as a new estimate for t(ff e). (Here and elsewhere, the Lagrange multipliers simply serve as a reminder that we need to normalize the translation probabilities so that they satisfy Equation (7).) This process, when applied repeatedly, is called the EM algorithm. That it converges to a stationary point of h in situations like this was first shown by Baum (1972) and later by others (Dempster, Laird, and Rubin 1977).</Paragraph> <Paragraph position="17"> With the aid of Equation (5), we can reexpress Equation (10) as</Paragraph> <Paragraph position="19"> number of times econnects to f in a We call the expected number of times that e connects to f in the translation (fie) the count of f given e for (fie) and denote it by c(fle; f, e). By definition, c(f I e; f, e) = E Pr(ale' f) E 6(f, fj)5(e, eat) , (12) a j=l Computational Linguistics Volume 19, Number 2 where Pr(ale, f) = Pr(f, ale)/Pr(fle ). If we replace Ae by ~C/ Pr(fle ), then Equation (11) can be written very compactly as t(fle ) = )~-jlc(yle; f, e). (13) In practice, our training data consists of a set of translations, (f(1) leO)), (f(2)le(2)), ..., (f(S)\[e(S)), so this equation becomes</Paragraph> <Paragraph position="21"> Here, )% serves only as a reminder that the translation probabilities must be normalized. null Usually, it is not feasible to evaluate the expectation in Equation (12) exactly. Even when we exclude multi-word cepts, there are still (1 + 1) m alignments possible for (fie). Model 1, however, is special because by recasting Equation (6), we arrive at an expression that can be evaluated efficiently. The right-hand side of Equation (6) is a sum of terms each of which is a monomial in the translation probabilities. Each monomial contains m translation probabilities, one for each of the words in f. Different monomials correspond to different ways of connecting words in f to cepts in e with every way appearing exactly once. By direct evaluation, we see that</Paragraph> <Paragraph position="23"> An example may help to clarify this. Suppose that m = 3 and 1 = 1, and that we write tji as a shorthand for t(d~le~). Then the left-hand side of Equation (15) is ho t20 t30 + tlo t20 t31 +&quot;&quot; q- tn t21 t30 + tll t21 t31, and the right-hand side is (ho + tn) (t20 + t21 ) (t30 q- t31 ). It is routine to verify that these are the same. Therefore, we can interchange the sums in Equation (6) with the product to obtain</Paragraph> <Paragraph position="25"> If we use this expression in place of Equation (6) when we write the auxiliary function in Equation (8), we find that count of e in e</Paragraph> <Paragraph position="27"> Thus, the number of operations necessary to calculate a count is proportional to 1 + m rather than to (I + 1) m as Equation (12) might suggest.</Paragraph> <Paragraph position="28"> Peter F. Brown et al. The Mathematics of Statistical Machine Translation Using Equations (14) and (17), we can estimate the parameters t(f I e) as follows. 1. Choose initial values for t(fle ).</Paragraph> <Paragraph position="29"> 2. For each pair of sentences if(s), e(S)), 1 < s < S, use Equation (17) to compute the counts c(f\] e; f(s), e(S)). Notice that these counts will be different from zero only when f is one of the words in f(s) and e is one of the words in e (~). Notice, also, that c(f I e; f(s), e(~)) does not depend on the order of the words in the sentences, but only on the number of times that the words appear in their respective sentences.</Paragraph> <Paragraph position="30"> 3. For each e that appears in at least one of the e (s), * Compute ,~ according to the equation</Paragraph> <Paragraph position="32"> * For each f that appears in at least one f('), use Equation (14) to obtain a new value for t(f\] e).</Paragraph> <Paragraph position="33"> 4. Repeat steps 2 and 3 until the values of t(dle) have converged to the desired degree.</Paragraph> <Paragraph position="34"> The details of our initial guesses for t(fl e) are unimportant because Pr(fle ) has a unique local maximum for Model 1, as is shown in Appendix B. We start with all of the t(fle) equal, but any other choice that avoids zeros would lead to the same final solution.</Paragraph> </Section> <Section position="2" start_page="272" end_page="274" type="sub_section"> <SectionTitle> 4.2 Model 2 </SectionTitle> <Paragraph position="0"> In Model 1, we take no cognizance of where words appear in either string. The first word in the French string is just as likely to be connected to a word at the end of the English string as to one at the beginning. In Model 2 we make the same assumptions as in Model 1 except that we assume that Pr(aj\]~-l,f~ -1, m, e) depends on j, aj, and m, as well as on I. We introduce a set of alignment probabilities,</Paragraph> <Paragraph position="2"> Therefore, we seek an unconstrained extremum of the auxiliary function</Paragraph> <Paragraph position="4"> Computational Linguistics Volume 19, Number 2 The reader will easily verify that Equations (11), (13), and (14) carry over from Model 1 to Model 2 unchanged. We need a new count, c(ilj, m, l; f, e), the expected number of times that the word in position j of f is connected to the word in position i of e. Clearly, c(ilj, m, l; f, e) = ~ Pr(ale, f)6(i, aj). (23) a In analogy with Equations (13) and (14), we have, for a single translation, a(ilj, m, l) = #~lc(ilj, m, l; f, e), (24) and, for a set of translations,</Paragraph> <Paragraph position="6"> Notice that if f(s) does not have length m or if e (s) does not have length l, then the corresponding count is zero. As with the As in earlier equations, the #s here serve simply to remind us that the alignment probabilities must be normalized.</Paragraph> <Paragraph position="7"> Model 2 shares with Model 1 the important property that the sums in Equations (12) and (23) can be obtained efficiently. We can rewrite Equation (21) as</Paragraph> <Paragraph position="9"> Using this form for Pr(fle ), we find that</Paragraph> <Paragraph position="11"> Equation (27) has a double sum rather than the product of two single sums, as in Equation (17), because in Equation (27) i and j are tied together through the alignment probabilities.</Paragraph> <Paragraph position="12"> Model 1 is the special case of Model 2 in which a(ilj , m, I) is held fixed at (1+1) -1. Therefore, any set of parameters for Model I can be reinterpreted as a set of parameters for Model 2. Taking as our initial estimates of the parameters for Model 2 the parameter values that result from training Model 1 is equivalent to computing the probabilities of all alignments as if we were dealing with Model 1, but then collecting the counts as if we were dealing with Model 2. The idea of computing the probabilities of the alignments using one model, but collecting the counts in a way appropriate to a second model is very general and can always be used to transfer a set of parameters from one model to another.</Paragraph> <Paragraph position="13"> Peter F. Brown et al. The Mathematics of Statistical Machine Translation</Paragraph> </Section> <Section position="3" start_page="274" end_page="275" type="sub_section"> <SectionTitle> 4.3 Intermodel Interlude </SectionTitle> <Paragraph position="0"> We created Models 1 and 2 by making various assumptions about the conditional probabilities that appear in Equation (4). As we have mentioned, Equation (4) is an exact statement, but it is only one of many ways in which the joint likelihood of f and a can be written as a product of conditional probabilities. Each such product corresponds in a natural way to a generative process for developing f and a from e.</Paragraph> <Paragraph position="1"> In the process corresponding to Equation (4), we first choose a length for f. Next, we decide which position in e is connected to fl and what the identity of fl is. Then, we decide which position in e is connected to f2, and so on. For Models 3, 4, and 5, we write the joint likelihood as a product of conditional probabilities in a different way.</Paragraph> <Paragraph position="2"> Casual inspection of some translations quickly establishes that the is usually translated into a single word (le, la, or l'), but is sometimes omitted; or that only is often translated into one word (for example, seulement), but sometimes into two (for example, ne ... que), and sometimes into none. The number of French words to which e is connected in a randomly selected alignment is a random variable, Ce, that we call the fertility of e. Each choice of the parameters in Model 1 or Model 2 determines a distribution, Pr(C/e = C/), for this random variable. But the relationship is remote: just what change will be wrought in the distribution of ~th~ if, say, we adjust a(1 \[2, 8, 9) is not immediately clean In Models 3, 4, and 5, we parameterize fertilities directly.</Paragraph> <Paragraph position="3"> As a prolegomenon to a detailed discussion of Models 3, 4, and 5, we describe the generative process upon which they are based. Given an English string, e, we first decide the fertility of each word and a list of French words to connect to it. We call this list, which may be empty, a tablet. The collection of tablets is a random variable, T, that we call the tableau of e; the tablet for the i th English word is a random variable, Ti; and the k th French word in the i th tablet is a random variable, Tik. After choosing the tableau, we permute its words to produce f. This permutation is a random variable, H. The position in f of the k th word in the l *h tablet is yet another a random variable, I~ik.</Paragraph> <Paragraph position="4"> The joint likelihood for a tableau, T, and a permutation, 7r, is</Paragraph> <Paragraph position="6"> In this equation, rik1-1 represents the series of values Til,... , &quot;l-ik_l; 7ri k-1 represents the series of values 7ril,..., 7rik-1; and C/i is shorthand for Cei.</Paragraph> <Paragraph position="7"> Knowing T and 7r determines a French string and an alignment, but in general several different pairs r, 7r may lead to the same pair f, a. We denote the set of such pairs by (f, a). Clearly, then</Paragraph> <Paragraph position="9"> l 1 The number of elements in (f, a} is I-\[i=0 C/i- because for each ri there are C/i! arrangements that lead to the pair f, a. Figure 4 shows the two tableaux for (bon march~ \[ cheap(I,2)).</Paragraph> <Paragraph position="10"> Except for degenerate cases, there is one alignment in A(e, f) for which Pr(ale, f) is greatest. We call this the Viterbi alignment for (fie) and denote it by V(f\[e). We know of no practical algorithm for finding V(fle ) for a general model. Indeed, if someone were to claim that he had found V(f\]e), we know of no practical algorithm for demonstrating that he is correct. But for Model 2 (and, thus, also for Model 1), finding V(f\[e) is straightforward. For each j, we simply choose aj so as to make the product t(fj\[%)a(ajlj, ra, l) as large as possible. The Viterbi alignment depends on the model with respect to which it is computed. When we need to distinguish between the Viterbi alignments for different models, we write V(f\[e; 1), V(fle; 2), and so on.</Paragraph> <Paragraph position="11"> We denote by .Ai,_-j(e, f) the set of alignments for which aj = i. We say that ij is pegged in these alignments. By the pegged Viterbi alignment for ij, which we write Vi~_j(fle), we mean that element of Ai~-j(e, f) for which Pr(a\[e, f) is greatest. Obviously, we can find Vi~j(fle; 1) and Viii(fie;2) quickly with a straightforward modification of the algorithm described above for finding V(f\]e; 1) and V(fle; 2).</Paragraph> </Section> <Section position="4" start_page="275" end_page="278" type="sub_section"> <SectionTitle> 4.4 Model 3 </SectionTitle> <Paragraph position="0"> Model 3 is based on Equation (29). Earlier, we were unable to treat each of the conditional probabilities on the right-hand side of Equation (4) as a separate parameter.</Paragraph> <Paragraph position="1"> With Equation (29) we are no better off and must again make assumptions to reduce the number of independent parameters. There are many different sets of assumptions that we might make, each leading to a different model for the translation process.</Paragraph> <Paragraph position="2"> In Model 3, we assume that, for i between 1 and 1, Pr(C/i\[C/~ -1, e) depends only on C/i and e/; that, for all i, Pr(~-iklTit -1, T~ -1, C/~, e) depends only on Tik and e/; and that, for i between 1 and 1, Pr0rik\[~rik-1,7r 1i-1, ~'0 z, C/~, e) depends only on ~rik, i, m, and 1. The parameters of Model 3 are thus a set of fertility probabilities, n(C/\[ e/) = Pr(C/\]C/~ -1, e); a set of translation probabilities, t(f\[e~) - Pr(Tik =f\[~_ik-1, 7-0i-1, %,1~ e); and a set of distortion probabilities, dq\[i, m, 1) =- Pr(IIik = j\[w/k-l, 7r~ -1, TO t, C/~, e).</Paragraph> <Paragraph position="3"> We treat the distortion and fertility probabilities for e0 differently. The empty cept conventionally occupies position 0, but actually has no position. Its purpose is to account for those words in the French string that cannot readily be accounted for by other cepts in the English string. Because we expect these words to be spread uniformly throughout the French string, and because they are placed only after all of the other Peter F. Brown et al. The Mathematics of Statistical Machine Translation = 11~-01, 71&quot;1, T~, 40, e) words in the string have been placed, we assume that Pr(H0k+l * k l t t equals 0 unless position j is vacant in which case it equals (40 - k) -1. Therefore, the contribution of the distortion probabilities for all of the words in TO is 1/40\[.</Paragraph> <Paragraph position="4"> We expect 40 to depend on the length of the French string because longer strings should have more extraneous words. Therefore, we assume that</Paragraph> <Paragraph position="6"> for some pair of auxiliary parameters p0 and pl. The expression on the left-hand side of this equation depends on C/~ only through the sum C/1 + &quot;'&quot; + C/1 and defines a probability distribution over C/0 whenever P0 and pl are nonnegative and sum to 1.</Paragraph> <Paragraph position="7"> We can interpret Pr(C/01C/~,e) as follows. We imagine that each of the words from T1 t requires an extraneous word with probability pl and that this extraneous word must be connected to the empty cept. The probability that exactly C/0 of the words from T~ will require an extraneous word is just the expression given in Equation (31).</Paragraph> <Paragraph position="8"> As with Models 1 and 2, an alignment of (fie) is determined by specifying aj for each position in the French string. The fertilities, C/0 through C/l, are functions of the ajs: C/i is equal to the number of js for which aj equals i. Therefore,</Paragraph> <Paragraph position="10"> with y~f t(fle ) = 1, Y~qd(jli, m, 1) = 1, ~-~C/ n(C/le) = 1, and po+pl = 1. The assumptions that we make for Model 3 are such that each of the pairs (% 70 in If, a) makes an identical contribution to the sum in Equation (30). The factorials in Equation (32) come from carrying out this sum explicitly. There is no factorial for the empty cept because it is exactly canceled by the contribution from the distortion probabilities.</Paragraph> <Paragraph position="11"> By now, the reader will be able to provide his or her own auxiliary function for seeking a constrained minimum of the likelihood of a translation with Model 3, but for completeness and to establish notation, we write</Paragraph> <Paragraph position="13"> Following the trail blazed with Models 1 and 2, we define the counts</Paragraph> <Paragraph position="15"> Computational Linguistics Volume 19, Number 2</Paragraph> <Paragraph position="17"> The counts in these last two equations correspond to the parameters p0 and pl that determine the fertility of the empty cept in the English string. The reestimation formulae for Model 3 are</Paragraph> <Paragraph position="19"> Equations (34) and (39) are identical to Equations (12) and (14) and are repeated here only for convenience. Equations (35) and (40) are similar to Equations (23) and (25), but a(i\[j, m, 1) differs from d(jti , m, 1) in that the former sums to unity over all i for fixed j while the latter sums to unity over all j for fixed i. Equations (36), (37), (38), (41), and (42), for the fertility parameters, are new.</Paragraph> <Paragraph position="20"> The trick that allows us to evaluate the right-hand sides of Equations (12) and (23) efficiently for Model 2 does not work for Model 3. Because of the fertility parameters, we cannot exchange the sums over al through am with the product over j in Equation (32) as we were able to for Equations (6) and (21). We are not, however, entirely bereft of hope. The alignment is a useful device precisely because some alignments are much more probable than others. Our strategy is to carry out the sums in Equations (32) and (34)-(38) only over some of the more probable alignments, ignoring the vast sea of much less probable ones. Specifically, we begin with the most probable alignment that we can find and then include all alignments that can be obtained from it by small changes.</Paragraph> <Paragraph position="21"> To define unambiguously the subset, S, of the elements of A(fle) over which we evaluate the sums, we need yet more terminology. We say that two alignments, a and a', differ by a move if there is exactly one value of j for which aj ~ aj'. We say that they differ by a swap if aj = aj' except at two values, jl and j2, for which a h = a h' and aj 2 = aj 11. We say that two alignments are neighbors if they are identical or differ by a move or by a swap. We denote the set of all neighbors of a by A/'(a).</Paragraph> <Paragraph position="22"> Let b(a) be that neighbor of a for which the likelihood Pr(b(a)l L e) is greatest.</Paragraph> <Paragraph position="23"> Suppose that ij is pegged for a. Among the neighbors of a for which/j is also pegged, let bi~_;(a) be that for which the likelihood is greatest. The sequence of alignments a, b(a), b~(a) =-- b(b(a)), ..., converges in a finite number of steps to an alignment that we write as bdegdeg(a). Similarly, if/j is pegged for a, the sequence of alignments a, bi,_-j(a), Peter F. Brown et al. The Mathematics of Statistical Machine Translation b2,__j(a), ..., converges in a finite number of steps to an alignment that we write as bi~deg~j(a). The simple form of the distortion probabilities in Model 3 makes it easy to find b(a) and bi~-j(a). If a' is a neighbor of a obtained from it by the move of j from i to i ~, and if neither i nor i ~ is 0, then</Paragraph> <Paragraph position="25"> Notice that C/i, is the fertility of the word in position i ~ for alignment a. The fertility of this word in alignment a ~ is C/i, + 1. Similar equations can be easily derived when either i or i ~ is zero, or when a and a ~ differ by a swap. We leave the details to the reader.</Paragraph> <Paragraph position="26"> With these preliminaries, we define S by</Paragraph> <Paragraph position="28"> In this equation, we use b~(V(fle; 2)) and b~j(Vi,__j(fle; 2)) as handy approximations to V(fle; 3) and Vi,__j(fle; 3), neither of which we are able to compute efficiently.</Paragraph> <Paragraph position="29"> In one iteration of the EM algorithm for Model 3, we compute the counts in Equations (34)-(38), summing only over elements of S, and then use these counts in Equations (39)-(42) to obtain a new set of parameters. If the error made by including only some of the elements of A(e, f) is not too great, this iteration will lead to values of the parameters for which the likelihood of the training data is at least as large as for the first set of parameters.</Paragraph> <Paragraph position="30"> We make no initial guess of the parameters for Model 3, but instead adapt the parameters from the final iteration of the EM algorithm for Model 2. That is, we compute the counts in Equations (34)-(38) using Model 2 to evaluate Pr(a\[e, f). The simple form of Model 2 again makes exact calculation feasible. We can readily adapt Equations (27) and (28) to compute counts for the translation and distortion probabilities; efficient calculation of the fertility counts is more involved, and we defer a discussion of it to Appendix B.</Paragraph> </Section> <Section position="5" start_page="278" end_page="279" type="sub_section"> <SectionTitle> 4.5 Deficiency </SectionTitle> <Paragraph position="0"> The reader will have noticed a problem with our parameterization of the distortion probabilities in Model 3: whereas we can see by inspection that the sum over all pairs % 7r of the expression on the right-hand side of Equation (29) is unity, it is equally clear that this can no longer be the case if we assume that Pr(IIik :. k-1 ~_i-1 ~l ~l e) z \]\[TFi1 ~ 1 ~ o~rPo~ depends only on j, i, m, and l for i > 0. Because the distortion probabilities for assigning positions to later words do not depend on the positions assigned to earlier words, Model 3 wastes some of its probability on what we might call generalized strings, i.e., strings that have some positions with several words and others with none. When a model has this property of not concentrating all of its probability on events of interest, we say that it is deficient. Deficiency is the price that we pay for the simplicity that allows us to write Equation (43).</Paragraph> <Paragraph position="1"> Deficiency poses no serious problem here. Although Models 1 and 2 are not technically deficient, they are surely spiritually deficient. Each assigns the same probability to the alignments (Je n'ai pas de stylo I I(1) do not(2,4) have(3) a(5) pen(6)) and (Je pas ai ne de stylo I I(1) do not(2,4) have(3) a(5) pen(6)), and, therefore, essentially the same probability to the translations (Je n&quot; ai pas de stylo I I do not have a pen) and (Je pas ai ne de stylo \[ I do not have a pen). In each case, not produces two words, ne and pas, and in each case, Computational Linguistics Volume 19, Number 2 one of these words ends up in the second position of the French string and the other in the fourth position. The first translation should be much more probable than the second, but this defect is of little concern because while we might have to translate the first string someday, we will never have to translate the second. We do not use our translation models to predict French given English but rather as a component of a system designed to predict English given French. They need only be accurate to within a constant factor over well-formed strings of French words.</Paragraph> </Section> <Section position="6" start_page="279" end_page="280" type="sub_section"> <SectionTitle> 4.6 Model 4 </SectionTitle> <Paragraph position="0"> Often the words in an English string constitute phrases that are translated as units into French. Sometimes, a translated phrase may appear at a spot in the French string different from that at which the corresponding English phrase appears in the English string. The distortion probabilities of Model 3 do not account well for this tendency of phrases to move around as units. Movement of a long phrase will be much less likely than movement of a short phrase because each word must be moved independently. In Model 4, we modify our treatment of Pr(IIik = jlTri k-l, ~i-l,,1 , '0,~'0,-I ,~ e) so as to alleviate this problem. Words that are connected to the empty cept do not usually form phrases, and so we continue to assume that these words are spread uniformly throughout the French string.</Paragraph> <Paragraph position="1"> As we have described, an alignment resolves an English string into a ceptual scheme consisting of a set of possibly overlapping cepts. Each of these cepts then accounts for one or more French words. In Model 3 the ceptual scheme for an alignment is determined by the fertilities of the words: a word is a cept if its fertility is greater than zero. The empty cept is a part of the ceptual scheme if C/0 is greater than zero.</Paragraph> <Paragraph position="2"> As before we exclude multi-word cepts. Among the one-word cepts, there is a natural order corresponding to the order in which they appear in the English string. Let \[i\] denote the position in the English string of the/th one-word cept. We define the center of this cept, (r)i, to be the ceiling of the average value of the positions in the French string of the words from its tablet. We define its head to be that word in its tablet for which the position in the French string is smallest.</Paragraph> <Paragraph position="3"> In Model 4, we replace d(jli , m, l) by two sets of parameters: one for placing the head of each cept, and one for placing any remaining words. For \[i\] > 0, we require that the head for cept i be r\[i\]l and we assume that * \[t 7-1 TO / ,~/,e) = d 1 Pr(II\[i\]l = 1 7rl , (J - (r)i-llA(e\[i-1\]),/3(fJ))&quot; (45) Here, A and B are functions of the English and French words that take on a small number of different values as their arguments range over their respective vocabularies.</Paragraph> <Paragraph position="4"> Brown et al. (1990) describe an algorithm for dividing a vocabulary into classes so as to preserve mutual information between adjacent classes in running text. We construct ,A and /3 as functions with 50 distinct values by dividing the English and French vocabularies each into 50 classes according to this algorithm. By assuming that the probability depends on the previous cept and on the identity of the French word being placed, we can account for such facts as the appearance of adjectives before nouns in English but after them in French. We call j - (r)i-1 the displacement for the head of cept i. It may be either positive or negative. We expect dl(-lI.A(e),/3(f)) to be larger than dl(+ llA(e),/3(f)) when e is an adjective and d is a noun. Indeed, this is borne out in the trained distortion probabilities for Model 4, where we find that dl (-llA(government's),/3(d~veloppement)) is 0.7986, while dl (+ llM(government's), /3(d~veloppement)) is 0.0168.</Paragraph> <Paragraph position="5"> Peter F. Brown et al. The Mathematics of Statistical Machine Translation Suppose, now, that we wish to place the k th word of cept i for \[i\] > 0, k > 1. We assume that</Paragraph> <Paragraph position="7"> We require that ~r\[i\]k be greater than ~rI,\]k-1. Some English words tend to produce a series of French words that belong together, while others tend to produce a series of words that should be separate. For example, implemented can produce mis en application, which usually occurs as a unit, but not can produce ne pas, which often occurs with an intervening verb* We expect d>l(2\[B(pas)) to be relatively large compared with d>l(2\[/J(en)). After training, we find that d>l(2\[B(pas)) is 0.6847 and d>l(2II3(en)) is 0.1533.</Paragraph> <Paragraph position="8"> Whereas we assume that T\[i\]l can be placed either before or after any previously positioned words, we require subsequent words from 7\[i\] to be placed in order. This does not mean that they must occupy consecutive positions but only that the second word from T\[~\] must lie to the right of the first, the third to the right of the second, and so on. Because of this, only one of the C/\[i\]! arrangements of 71i\] is possible.</Paragraph> <Paragraph position="9"> We leave the routine details of deriving the count and reestimation formulae for Model 4 to the reader. He may find the general formulae in Appendix B helpful.</Paragraph> <Paragraph position="10"> Once again, the several counts for a translation are expectations of various quantities over the possible alignments with the probability of each alignment computed from an earlier estimate of the parameters. As with Model 3, we know of no trick for evaluating these expectations and must rely on sampling some small set, S, of alignments. As described above, the simple form that we assume for the distortion probabilities in Model 3 makes it possible for us to find b dego (a) rapidly for any a. The analog of Equation (43) for Model 4 is complicated by the fact that when we move a French word from cept to cept we change the centers of two cepts and may affect the contribution of several words. It is nonetheless possible to evaluate the adjusted likelihood incrementally, although it is substantially more time-consuming.</Paragraph> <Paragraph position="11"> Faced with this unpleasant situation, we proceed as follows. Let the neighbors of a be ranked so that the first is the neighbor for which Pr(a\[e, f; 3) is greatest, the second the one for which Pr(a\[e, f; 3) is next greatest, and so on. We define b(a) to be the highest-ranking neighbor of a for which Pr(b(a)\[e, f; 4) is at least as large as Pr(aIe, f; 4). We define bi,._j(a) analogously. Here, Pr(a\[e, f;3) means Pr(a\[e, f) as computed with Model 3, and Pr(ale, f;4) means Pr(a\[e, f) as computed with Model 4. We define S for Model 4 by</Paragraph> <Paragraph position="13"> This equation is identical to Equation (47) except that b has been replaced by/~.</Paragraph> </Section> <Section position="7" start_page="280" end_page="281" type="sub_section"> <SectionTitle> 4.7 Model 5 </SectionTitle> <Paragraph position="0"> Models 3 and 4 are both deficient. In Model 4, not only can several words lie on top of one another, but words can be placed before the first position or beyond the last position in the French string. We remove this deficiency in Model 5.</Paragraph> <Paragraph position="1"> After we have placed the words for r~ i\]-1 and T\[i\] k-1 there will remain some vacant positions in the French string. Obviously, T\[i\]k should be placed in one of these vacancies. Models 3 and 4 are deficient precisely because we fail to enforce this constraint for the one-word cepts. Let v(j, T~ i\]-1, T\[i\]I k-l) be the number of vacancies up to and including position j just before we place T\[,lk. In the interest of notational brevity, a noble but elusive goal, we write this simply as vj. We retain two sets of distortion Computational Linguistics Volume 19, Number 2 parameters, as in Model 4, and continue to refer to them as dl and d>l. We assume that, for \[i\] > 0, ;1~\[i\]-1 ~1 C//, e) dl(vjlt~(Z), vo,_,, vm - Ctil + 1)(1 - 5(vj, vj-1)). (48) Pr (II\[i\]l = 11&quot;1 ,'o, --The number of vacancies up to j is the same as the number of vacancies up to j - 1 only when j is not itself vacant. The last factor, therefore, is 1 when j is vacant and 0 otherwise. In the final parameter of dl, vm is the number of vacancies remaining in the French string. If ~b\[i\] = 1, then 7.\[i11 may be placed in any of these vacancies; if ~b\[i\] = 2, 7-\[i\]1 may be placed in any but the last of these vacancies; in general, 7-\[,11 may be placed in any but the rightmost ~b\[,\] - 1 of the remaining vacancies. Because 7-\[/\]1 must occupy the leftmost place of any of the words from T\[,\], we must take care to leave room at the end of the string for the remaining words from this tablet. As with Model 4, we allow dl to depend on the center of the previous cept and on ~, but we suppress the dependence on eli-l\] since we should otherwise have too many parameters.</Paragraph> <Paragraph position="2"> For \[i\] > 0 and k > 1, we assume</Paragraph> <Paragraph position="4"> Again, the final factor enforces the constraint that 7.\[i\]k land in a vacant position, and, again, we assume that the probability depends on ~ only through its class. Model 5 is described in more detail in Appendix B.</Paragraph> <Paragraph position="5"> As with Model 4, we leave the details of the count and reestimation formulae to the reader. No incremental evaluation of the likelihood of neighbors is possible with Model 5 because a move or swap may require wholesale recomputation of the likelihood of an alignment. Therefore, when we evaluate expectations for Model 5, we include only the alignments in S as defined in Equation (47). We further trim these alignments by removing any alignment a, for which Pr(ale, f;4) is too much smaller than Pr(bdegdeg(V(fle; 2)le, f; 4).</Paragraph> <Paragraph position="6"> Model 5 is a powerful but unwieldy ally in the battle to align translations. It must be led to the battlefield by its weaker but more agile brethren Models 2, 3, and 4. In fact, this is the raison d'etre of these models. To keep them aware of the lay of the land, we adjust their parameters as we carry out iterations of the EM algorithm for Model 5. That is, we collect counts for Models 2, 3, and 4 by summing over alignments as determined by the abbreviated S described above, using Model 5 to compute Pr(ale, f). Although this appears to increase the storage necessary for maintaining counts as we proceed through the training data, the extra burden is small because the overwhelming majority of the storage is devoted to counts for t(fle ), and these are the same for Models 2, 3, 4, and 5.</Paragraph> </Section> </Section> <Section position="6" start_page="281" end_page="286" type="metho"> <SectionTitle> 5. Results </SectionTitle> <Paragraph position="0"> We have used a large collection of training data to estimate the parameters of the models described above. Brown, Lai, and Mercer (1991) have described an algorithm with which one can reliably extract French and English sentences that are translations of one another from parallel corpora. They used the algorithm to extract a large number of translations from several years of the proceedings of the Canadian parliament. From these translations, we have chosen as our training data those for which both the English sentence and the French sentence are 30 or fewer words in length. This is a collection of 1,778,620 translations. In an effort to eliminate some of the typographical errors that abound in the text, we have chosen as our English vocabulary all of those words that appear at least twice in English sentences in our data, and as our French vocabulary all of those words that appear at least twice in French sentences in our data. All other words we replace with a special unknown English word or unknown French word accordingly as they appear in an English sentence or a French sentence. We arrive in this way at an English vocabulary of 42,005 words and a French vocabulary of 58,016 words. Some typographical errors are quite frequent, for example, momento for memento, and so our vocabularies are not completely free of them. At the same time, some words are truly rare, and so we have, in some cases, snubbed legitimate words.</Paragraph> <Paragraph position="1"> Adding e0 to the English vocabulary brings it to 42,006 words.</Paragraph> <Paragraph position="2"> We have carried out 12 iterations of the EM algorithm for this data. We initialized the process by setting each of the 2,437, 020,096 translation probabilities, t(fle), to 1/58,016. That is, we assume each of the 58,016 words in the French vocabulary to be equally likely as a translation for each of the 42,006 words in the English vocabulary.</Paragraph> <Paragraph position="3"> For t(f\[e) to be greater than zero at the maximum likelihood solution for one of our models, f and e must occur together in at least one of the translations in our training data. This is the case for only 25,427, 016 pairs, or about one percent of all translation probabilities. On the average, then, each English word appears with about 605 French words.</Paragraph> <Paragraph position="4"> Table 1 summarizes our training computation. At each iteration, we compute the probabilities of the various alignments of each translation using one model, and collect counts using a second (possibly different) model. These are referred to in the table as the In model and the Out model, respectively. After each iteration, we retain individual values only for those translation probabilities that surpass a threshold; the remainder we set to a small value (10-12). This value is so small that it does not affect the normalization conditions, but is large enough that translation probabilities can be resurrected during later iterations. We see in columns 4 and 5 that even though we lower the threshold as iterations progress, fewer and fewer probabilities survive. By the final iteration, only 1,658,364 probabilities survive, an average of about 39 French words for each English word.</Paragraph> <Paragraph position="5"> Although the entire t array has 2,437, 020,096 entries, and we need to store it twice, once as probabilities and once as counts, it is clear from the preceeding remarks that we need never deal with more than about 25 million counts or about 12 million probabilities. We store these two arrays using standard sparse matrix techniques. We Computational Linguistics Volume 19, Number 2 keep counts as pairs of bytes, but allow for overflow into 4 bytes if necessary. In this way, it is possible to run the training program in less than 100 megabytes of memory. While this number would have seemed extravagant a few years ago, today it is available at modest cost in a personal workstation.</Paragraph> <Paragraph position="6"> As we have described, when the In model is neither Model 1 nor Model 2, we evaluate the count sums over only some of the possible alignments. Many of these alignments have a probability much smaller than that of the Viterbi alignment. The column headed Alignments in Table 1 shows the average number of alignments for which the probability is within a factor of 25 of the probability of the Viterbi alignment in each iteration. As this number drops, the model concentrates more and more probability onto fewer and fewer alignments so that the Viterbi alignment becomes ever more dominant.</Paragraph> <Paragraph position="7"> The last column in the table shows the perplexity of the French text given the English text for the In model of the iteration. We expect the likelihood of the training data to increase with each iteration. We can think of this likelihood as arising from a product of factors, one for each French word in the training data. We have 28,850,104 French words in our training data, so the 28,850,104 th root of the likelihood is the average factor by which the likelihood is reduced for each additional French word.</Paragraph> <Paragraph position="8"> The reciprocal of this root is the perplexity shown in the table. As the likelihood increases, the perplexity decreases. We see a steady decrease in perplexity as the iterations progress except when we switch from Model 2 as the In model to Model 3. This sudden jump is not because Model 3 is a poorer model than Model 2, but because Model 3 is deficient: the great majority of its probability is squandered on objects that are not strings of French words. As we have argued, deficiency is not a problem. In our description of Model 1, we left Pr(mle ) unspecified. In quoting perplexities for Models 1 and 2, we have assumed that the length of the French string is Poisson with a mean that is a linear function of the length of the English string. Specifically, we have assumed that Pr(M : role ) = (Al)me-~Z/m!, with A equal to 1.09.</Paragraph> <Paragraph position="9"> It is interesting to see how the Viterbi alignments change as the iterations progress. In Figure 5, we show for several sentences the Viterbi alignment after iterations 1, 6, 7, and 12. Iteration 1 is the first iteration for Model 2, and iterations 6, 7, and 12 are the final iterations for Models 2, 3, and 5, respectively. In each example, we show the French sentence with a subscript affixed to each word to ease the reader's task in interpreting the list of numbers after each English word. In the first example, (Il me semble faire signe que oui I It seems to me that he is nodding), two interesting changes evolve over the course of the iterations. In the alignment for Model 1,//is correctly connected to he, but in all later alignments II is incorrectly connected to It. Models 2, 3, and 5 discount a connection of he to II because it is quite far away. We do not yet have a model with sufficient linguistic sophistication to make this connection properly. On the other hand, we see that nodding, which in Models 1, 2, and 3 is connected only to signe and oui, is correctly connected to the entire phrase faire signe que oui in Model 5. In the second example, (Voyez les profits que ils ont r~alis~s \[ Look at the profits they have made), Models 1, 2, and 3 incorrectly connect profits4 to both profits3 and rdalisds7, but with Model 5, profits4 is correctly connected only to profits3, and made7 is connected to r~alis~s7. Finally, in (De les promesses, de les promesses! I Promises, promises.), Promises1 is connected to both instances of promesses with Model 1; promises3 is connected to most of the French sentence with Model 2; the final punctuation of the English sentence is connected to both the exclamation point and, curiously, to des with Model 3; and only with Model 5 do we have a satisfying alignment of the two sentences. The orthography for the French sentence in the second example is Voyez les profits qu'ils ont rdalisds and in the third example is Des promesses, des promesses! We have restored the e to the end</Paragraph> <Paragraph position="11"> Figure 5 The progress of alignments with iteration. of qu' and have twice analyzed des into its constituents, de and les. We commit these and other petty pseudographic improprieties in the interest of regularizing the French text. In all cases, orthographic French can be recovered by rule from our corrupted versions.</Paragraph> <Paragraph position="12"> Figures 6-15 show the translation probabilities and fertilities after the final iteration of training for a number of English words. We show all and only those probabilities that are greater than 0.01. Some words, like nodding, in Figure 6, do not slip gracefully into French. Thus, we have translations like (Il fait signe que oui I He is nodding), (Il fait un signe de la t~te } He is nodding), (Il fait un signe de t~te affirmatif l He is nodding), or (II hoche la t~te affirmativement I He is nodding). As a result, nodding frequently has a large fertility and spreads its translation probability over a variety of words. In French, what is worth saying is worth saying in many different ways. We see another facet of this with words like should, in Figure 7, which rarely has a fertility greater than one but still produces many different words, among them devrait, devraient, devrions, doit, doivent, devons, and devrais. These are (just a fraction of the many) forms of the French verb devoir. Adjectives fare a little better: national, in Figure 8, almost never produces more than one word and confines itself to one of nationale, national, nationaux, and nationales, respectively the feminine, the masculine, the masculine plural, and the feminine plural of the corresponding French adjective. It is clear that our models would benefit from some kind of morphological processing to rein in the lexical exuberance of French.</Paragraph> <Paragraph position="13"> We see from the data for the, in Figure 9, that it produces le, la, les, and I' as we would expect. Its fertility is usually 1, but in some situations English prefers an article where French does not and so about 14% of the time its fertility is 0. Sometimes, as with farmers, in Figure 10, it is French that prefers the article. When this happens, the English noun trains to produce its translation together with an article. Thus, farmers typically has a fertility 2 and usually produces either agriculteurs or les. We include additional examples in Figures 11 through 15, which show the translation and fertility probabilities for external, answer, oil, former, and not. Although we show the various probabilities to three decimal places, one must realize that the specific numbers that appear are peculiar to the training data that we used in obtaining them. They are not constants of nature relating the Platonic ideals of eternal English and eternal French. Had we used different sentences as training data, we might well have arrived at different numbers. For example, in Figure 9, we see that t(lelthe ) = 0.497 while the corresponding number from Figure 4 of Brown et al. (1990) is 0.610. The difference arises not from some instability in the training algorithms or some subtle shift in the languages in recent years, but from the fact that we have used 1,778,620 pairs of sentences covering virtually the complete vocabulary of the Hansard data for training, while they used only 40,000 pairs of sentences and restricted their attention to the 9,000 most common words in each of the two vocabularies.</Paragraph> <Paragraph position="14"> Figures 16, 17, and 18 show automatically derived alignments for three translations. In the terminology of Section 4.6, each alignment is ~o~ (V(fle; 2)). We stress that these alignments have been found by an algorithm that involves no explicit knowledge of either French or English. Every fact adduced to support them has been discovered algorithmically from the 1,778,620 translations that constitute our training data. This data, in turn, is the product of an algorithm the sole linguistic input of which is a set of rules explaining how to find sentence boundaries in the two languages. We may justifiably claim, therefore, that these alignments are inherent in the Canadian Hansard data itself.</Paragraph> <Paragraph position="15"> In the alignment shown in Figure 16, all but one of the English words has fertility 1.</Paragraph> <Paragraph position="16"> The final prepositional phrase has been moved to the front of the French sentence, but otherwise the translation is almost verbatim. Notice, however, that the new proposal has been translated into les nouvelles propositions, demonstrating that number is not an invariant under translation. The empty cept has fertility 5 here. It generates enl, de3, the comma, del6, and de18.</Paragraph> <Paragraph position="17"> In Figure 17, two of the English words have fertility 0, one has fertility 2, and one, embattled, has fertility 5. Embattled is another word, like nodding, that eludes the French grasp and comes with a panoply of multi-word translations.</Paragraph> <Paragraph position="18"> The final example, in Figure 18, has several features that bear comment. The second word, Speaker, is connected to the sequence l'Orateur. Like farmers above, it has trained to produce both the word that we naturally think of as its translation and the associated article. In our data, Speaker always has fertility 2 and produces equally often l'Orateur and le president. Later in the sentence, starred is connected to the phrase marquees de un astdrisque. From an initial situation in which each French word is equally probable as a translation of starred, we have arrived, through training, at a situation where it is possible to connect starred to just the right string of four words. Near the end of the sentence, give is connected to donnerai, the first person singular future of donner, which means to give. We should be more comfortable if both will and give were connected to donnerai, but by limiting cepts to no more than one word, we have precluded this possibility. Finally, the last 12 words of the English sentence, I now have the answer and will give it to the House, clearly correspond to the last 7 words of the French sentence, je donnerai la rdponse ?l la Chambre, but, literally, the French is I will give the answer to the House. There is nothing about now, have, and, or it, and each of these words has fertility 0. Translations that are as far as this from the literal are rather more the rule than the exception in our training data. One might cavil at the connection of la r~ponse to the answer rather than to it. We do not.</Paragraph> </Section> <Section position="7" start_page="286" end_page="294" type="metho"> <SectionTitle> 6. Better Translation Models </SectionTitle> <Paragraph position="0"> Models 1-5 provide an effective means for obtaining word-by-word alignments of translations, but as a means to achieve our real goal, which is translation, there is Peter F. Brown et al. The Mathematics of Statistical Machine Translation room for improvement. We have seen that by ignoring the morphological structure of the two languages we dilute the strength of our statistical model, explaining, for example, each of the several tens of forms of each French verb independently. We have seen that by ignoring multi-word cepts, we are forced to give a false, or at least an unsatisfactory, account of some features in many translations. And finally, we have seen that our models are deficient, either in fact, as with Models 3 and 4, or in spirit, as with Models 1, 2, and 5.</Paragraph> <Section position="1" start_page="292" end_page="292" type="sub_section"> <SectionTitle> 6.1 The Truth about Deficiency </SectionTitle> <Paragraph position="0"> We have argued in Section 2 that neither spiritual nor actual deficiency poses a serious problem, but this is not entirely true. Let w(e) be the sum of Pr(fle ) over well-formed French strings and let i(e) be the sum over ill-formed French strings. In a deficient model, w(e) + i(e) < 1. We say that the remainder of the probability is concentrated on the event failure and we write w(e) + i(e) + Pr(failurele ) = 1. Clearly, a model is deficient precisely when Pr(failurele ) > 0. If Pr(failure\]e) = 0, but i(e) > 0, then the model is spiritually deficient. If w(e) were independent of e, neither form of deficiency would pose a problem, but because our models have no long-term constraints, w(e) decreases exponentially with 1. When computing alignments, even this creates no problem because e and f are known. If, however, we are given f and asked to discover 4, then we will find that the product Pr(e) Pr(fle ) is too small for long English strings as compared with short ones. As a result, we will improperly favor short English strings. We can counteract this tendency in part by replacing Pr(fJe) with c I Pr(f\]e) for some empirically chosen constant c. This is treatment of the symptom rather than treatment of the disease itself, but it offers some temporary relief. The cure lies in better modeling.</Paragraph> </Section> <Section position="2" start_page="292" end_page="293" type="sub_section"> <SectionTitle> 6.2 Viterbi Training </SectionTitle> <Paragraph position="0"> As we progress from Model 1 to Model 5, evaluating the expectations that give us counts becomes increasingly difficult. For Models 1 and 2, we are able to include the contribution of each of the (1 + 1) m possible alignments exactly. For later models, we include the contributions of fewer and fewer alignments. Because most of the probability for each translation is concentrated by these models on a small number of alignments, this suboptimal procedure, mandated by the complexity of the models, yields acceptable results.</Paragraph> <Paragraph position="1"> In the limit, we can contemplate evaluating the expectations using only a single, probable alignment for each translation. When that alignment is the Viterbi alignment, we call this Viterbi training. It is easy to see that Viterbi training converges: at each step, we reestimate parameters so as to make the current set of Viterbi alignments as probable as possible; when we use these parameters to compute a new set of Viterbi alignments, we find either the old set or a set that is yet more probable. Since the probability can never be greater than one, this process must converge. In fact, unlike the EM algorithm in general, it must converge in a finite, though impractically large, number of steps because each translation has only a finite number of alignments.</Paragraph> <Paragraph position="2"> In practice, we are never sure that we have found the Viterbi alignment. If we reinterpret the Viterbi alignment to mean the most probable alignment that we can find rather than the most probable alignment that exists, then a similarly reinterpreted Viterbi training algorithm still converges. We have already used this algorithm successfully as a part of a system to assign senses to English and French words on the basis of the context in which they appear (Brown et al. 1991a, 1991b). We expect to use it in models that we develop beyond Model 5.</Paragraph> </Section> <Section position="3" start_page="293" end_page="293" type="sub_section"> <SectionTitle> Computational Linguistics Volume 19, Number 2 6.3 Multi-Word Cepts </SectionTitle> <Paragraph position="0"> In Models 1-5, we restrict our attention to alignments with cepts containing no more than one word each. Except in Models 4 and 5, cepts play little r61e in our development.</Paragraph> <Paragraph position="1"> Even in these models, cepts are determined implicitly by the fertilities of the words in the alignment: words for which the fertility is greater than zero make up one-word cepts; those for which it is zero do not. We can easily extend the generative process upon which Models 3, 4, and 5 are based to encompass multi-word cepts. We need only include a step for selecting the ceptual scheme and ascribe fertilities to cepts rather than to words, requiring that the fertility of each cept be greater than zero.</Paragraph> <Paragraph position="2"> Then, in Equation (29), we can replace the products over words in an English string with products over cepts in the ceptual scheme.</Paragraph> <Paragraph position="3"> When we venture beyond one-word cepts, however, we must tread lightly. An English string can contain any of 42,005 one-word cepts, but there are more than 1.7 billion possible two-word cepts, more than 74 trillion three-word cepts, and so on. Clearly, one must be discriminating in choosing potential multi-word cepts. The caution that we have displayed thus far in limiting ourselves to cepts with fewer than two words was motivated primarily by our respect for the featureless desert that multi-word cepts offer a priori. The Viterbi alignments that we have computed with Model 5 give us a frame of reference from which to expand our horizons to multi-word cepts. By inspecting them, we can find translations for a given multi-word sequence.</Paragraph> <Paragraph position="4"> We need only promote a multi-word sequence to cepthood when these translations differ substantially from what we might expect on the basis of the individual words that it contains. In English, either a boat or a person can be left high and dry, but in French, un bateau is not left haut et sec, nor une personne haute et s~che. Rather, a boat is left ~chou~ and a person en plan. High and dry, therefore, is a promising three-word cept because its translation is not compositional.</Paragraph> </Section> <Section position="4" start_page="293" end_page="294" type="sub_section"> <SectionTitle> 6.4 Morphology </SectionTitle> <Paragraph position="0"> We treat each distinct sequence of letters as a distinct word. In English, for example, we recognize no kinship among the several forms of the verb to eat (eat, ate, eaten, eats, and eating). In French, irregular verbs have many forms. In Figure 7, we have already seen 7 forms of devoir. Altogether, it has 41 different forms. And there would be 42 if the French did not inexplicably drop the circumflex from the masculine plural past participle (dus), thereby causing it to collide with the first and second person singular in the pass~ simple, no doubt a source of endless confusion for the beleaguered francophone.</Paragraph> <Paragraph position="1"> The French make do with fewer forms for the multitude of regular verbs that are the staple diet of everyday speech. Thus, manger (to eat), has only 39 forms (manger, mange, manges ..... mangeassent). Models 1-5 must learn to connect the 5 forms of to eat to the 39 forms of manger. In the 28~ 850~ 104 French words that make up our training data, only 13 of the 39 forms of manger actually appear. Of course, it is only natural that in the proceedings of a parliament, forms of manger are less numerous than forms of parler (to speak), but even for parler, only 28 of the 39 forms occur in our data. If we were to encounter a rare form of one of these words, say, parlassions or mangeassent, we would have no inkling of its relationship to speak or eat. A similar predicament besets nouns and adjectives as well. For example, composition is the among the most common words in our English vocabulary, but compositions is among the least common words.</Paragraph> <Paragraph position="2"> We plan to ameliorate these problems with a simple inflectional analysis of verbs, nouns, adjectives, and adverbs, so that the relatedness of the several forms of the same word is manifest in our representation of the data. For example, we wish to make evident the common pedigree of the different conjugations of a verb in French and Peter E Brown et al. The Mathematics of Statistical Machine Translation in English; of the singular and plural, and singular possessive and plural possessive forms of a noun in English; of the singular, plural, masculine, and feminine forms of a noun or adjective in French; and of the positive, comparative, and superlative forms of an adjective or adverb in English.</Paragraph> <Paragraph position="3"> Thus, our intention is to transform (/'e mange la p~che I I eat the peach) into, e.g., (je manger, 13spres la p~che I I eat,x3spres the peach). Here, eat is analyzed into a root, eat, and an ending, x3spres, that indicates the present tense form used except in the third person singular. Similarly, mange is analyzed into a root, manger, and an ending, 13spres, that indicates the present tense form used for the first and third persons singular.</Paragraph> <Paragraph position="4"> These transformations are invertible and should reduce the French vocabulary by about 50% and the English vocabulary by about 20%. We hope that this will significantly improve the statistics in our models.</Paragraph> </Section> </Section> <Section position="8" start_page="294" end_page="295" type="metho"> <SectionTitle> 7. Discussion </SectionTitle> <Paragraph position="0"> That interesting bilingual lexical correlations can be extracted automatically from a large bilingual corpus was pointed out by Brown et al. (1988). The algorithm that they describe is, roughly speaking, equivalent to carrying out the first iteration of the EM algorithm for our Model 1 starting from an initial guess in which each French word is equally probable as a translation for each English word. They were unaware of a connection to the EM algorithm, but they did realize that their method is not entirely satisfactory. For example, once it is clearly established that in (La porte est rouge I The door is red), it is red that produces rouge, one is uncomfortable using this sentence as support for red producing porte or door producing rouge. They suggest removing words once a correlation between them has been clearly established and then reprocessing the resulting impoverished translations hoping to recover less obvious correlations now revealed by the departure of their more prominent relatives. From our present perspective, we see that the proper way to proceed is simply to carry out more iterations of the EM algorithm. The likelihood for Model 1 has a unique local maximum for any set of training data. As iterations proceed, the count for porte as a translation of red will dwindle away.</Paragraph> <Paragraph position="1"> In a later paper, Brown et al. (1990) describe a model that is essentially the same as our Model 3. They sketch the EM algorithm and show that, once trained, their model can be used to extract word-by-word alignments for pairs of sentences. They did not realize that the logarithm of the likelihood for Model 1 is concave and, hence, has a unique local maximum. They were also unaware of the trick by which we are able to sum over all alignments when evaluating the counts for Models 1 and 2, and of the trick by which we are able to sum over all alignments when transferring parameters from Model 2 to Model 3. As a result, they were unable to handle large vocabularies and so restricted themselves to vocabularies of only 9,000 words. Nonetheless, they were able to align phrases in French with the English words that produce them as illustrated in their Figure 3.</Paragraph> <Paragraph position="2"> More recently, Gale and Church (1991a) describe an algorithm similar to the one described in Brown et al. (1988). Like Brown et al., they consider only the simultaneous appearance of words in pairs of sentences that are translations of one another. Although algorithms like these are extremely simple, many of the correlations between English and French words are so pronounced as to fall prey to almost any effort to expose them. Thus, the correlation of pairs like (eau I water), (lait \] milk), (pourquoi I why), (chambre I house), and many others, simply cannot be missed. They shout from the data, and any method that is not stone deaf will hear them. But many of the correlations speak in a softer voice: to hear them clearly, we must model the translation process, as Computational Linguistics Volume 19, Number 2 Brown et al. (1988) suggest and as Brown et al. (1990) and the current paper actually do. Only in this way can one hope to hear the quiet call of (marquees d'un ast&isque I starred) or the whisper of (qui s'est fait bousculer I embattled).</Paragraph> <Paragraph position="3"> The series of models that we have described constitutes a mathematical embodiment of the powerfully compelling intuitive feeling that a word in one language can be translated into a word or phrase in another language. In some cases, there may be several or even several tens of translations depending on the context in which the word appears, but we should be quite surprised to find a word with hundreds of mutually exclusive translations. Although we use these models as part of an automatic system for translating French into English, they provide, as a byproduct, very satisfying accounts of the word-by-word alignment of pairs of French and English strings.</Paragraph> <Paragraph position="4"> Our work has been confined to French and English, but we believe that this is purely adventitious: had the early Canadian trappers been Manchurians later to be outnumbered by swarms of conquistadores, and had the two cultures clung stubbornly each to its native tongue, we should now be aligning Spanish and Chinese. We conjecture that local alignment of the component parts of any corpus of parallel texts is inherent in the corpus itself, provided only that it be large enough. Between any pair of languages where mutual translation is important enough that the rate of accumulation of translated examples sufficiently exceeds the rate of mutation of the languages involved, there must eventually arise such a corpus.</Paragraph> <Paragraph position="5"> The linguistic content of our program thus far is scant indeed. It is limited to one set of rules for analyzing a string of characters into a string of words, and another set of rules for analyzing a string of words into a string of sentences. Doubtless even these can be recast in terms of some information theoretic objective function. But it is not our intention to ignore linguistics, neither to replace it. Rather, we hope to enfold it in the embrace of a secure probabilistic framework so that the two together may draw strength from one another and guide us to better natural language processing systems in general and to better machine translation systems in particular.</Paragraph> </Section> <Section position="9" start_page="295" end_page="295" type="metho"> <SectionTitle> Acknowledgments </SectionTitle> <Paragraph position="0"> We would like to thank many of our colleagues who read and commented on early versions of the manuscript, especially John Lafferty. We would also like to thank the reviewers, who made a number of invaluable suggestions about the organization of the paper and pointed out many weaknesses in our original manuscript. If any weaknesses remain, it is not because of their failure to point them out, but because of our ineptness at responding adequately to their criticisms.</Paragraph> </Section> class="xml-element"></Paper>