File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/j96-1002_metho.xml
Size: 56,432 bytes
Last Modified: 2025-10-06 14:14:18
<?xml version="1.0" standalone="yes"?> <Paper uid="J96-1002"> <Title>Renaissance Technologies</Title> <Section position="3" start_page="0" end_page="40" type="metho"> <SectionTitle> 2. A Maximum Entropy Overview </SectionTitle> <Paragraph position="0"> We introduce the concept of maximum entropy through a simple example. Suppose we wish to model an expert translator's decisions concerning the proper French rendering of the English word in. Our model p of the expert's decisions assigns to each French word or phrase f an estimate, p(f), of the probability that the expert would choose f as a translation of in. To guide us in developing p, we collect a large sample of instances of the expert's decisions. Our goal is to extract a set of facts about the decision-making process from the sample (the first task of modeling) that will aid us in constructing a model of this process (the second task).</Paragraph> <Paragraph position="1"> One obvious clue we might glean from the sample is the list of allowed translations. For example, we might discover that the expert translator always chooses among the following five French phrases: {dans, en, ?l, au cours de, pendant}. With this information in hand, we can impose our first constraint on our model p: p(dans) + p(en) + p(h) + p(au cours de) + p(pendant) = 1 This equation represents our first statistic of the process; we can now proceed to search for a suitable model that obeys this equation. Of course, there are an infinite number of models p for which this identity holds. One model satisfying the above equation is p(dans) = 1; in other words, the model always predicts dans. Another model obeying this constraint predicts pendant with a probability of 1/2, and ~ with a probability of 1/2. But both of these models offend our sensibilities: knowing only that the expert always chose from among these five French phrases, how can we justify either of these probability distributions? Each seems to be making rather bold assumptions, with no empirical justification. Put another way, these two models assume more than we actually know about the expert's decision-making process. All we know is that the expert chose exclusively from among these five French phrases; given this, Computational Linguistics Volume 22, Number I these questions, how do we go about finding the most uniform model subject to a set of constraints like those we have described? The maximum entropy method answers both of these questions, as we will demonstrate in the next few pages. Intuitively, the principle is simple: model all that is known and assume nothing about that which is unknown. In other words, given a collection of facts, choose a model consistent with all the facts, but otherwise as uniform as possible. This is precisely the approach we took in selecting our model p at each step in the above example.</Paragraph> <Paragraph position="2"> The maximum entropy concept has a long history. Adopting the least complex hypothesis possible is embodied in Occam's razor (&quot;Nunquam ponenda est pluralitas sine necesitate.') and even appears earlier, in the Bible and the writings of Herotodus (Jaynes 1990). Laplace might justly be considered the father of maximum entropy, having enunciated the underlying theme 200 years ago in his &quot;Principle of Insufficient Reason:&quot; when one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely (Guiasu and Shenitzer 1985). As E. T. Jaynes, a more recent pioneer of maximum entropy, put it (Jaynes 1990): ... the fact that a certain probability distribution maximizes entropy subject to certain constraints representing our incomplete information, is the fundamental property which justifies use of that distribution for inference; it agrees with everything that is known, but carefully avoids assuming anything that is not known. It is a transcription into mathematics of an ancient principle of wisdom ...</Paragraph> </Section> <Section position="4" start_page="40" end_page="48" type="metho"> <SectionTitle> 3. Maximum Entropy Modeling </SectionTitle> <Paragraph position="0"> We consider a random process that produces an output value y, a member of a finite set 3;. For the translation example just considered, the process generates a translation of the word in, and the output y can be any word in the set {dans, en, ?~, au cours de, pendant}.</Paragraph> <Paragraph position="1"> In generating y, the process may be influenced by some contextual information x, a member of a finite set X. In the present example, this information could include the words in the English sentence surrounding in.</Paragraph> <Paragraph position="2"> Our task is to construct a stochastic model that accurately represents the behavior of the random process. Such a model is a method of estimating the conditional probability that, given a context x, the process will output y. We will denote by p(ylx) the probability that the model assigns to y in context x. With a slight abuse of notation, we will also use p(ylx) to denote the entire conditional probability distribution provided by the model, with the interpretation that y and x are placeholders rather than specific instantiations. The proper interpretation should be clear from the context. We will denote by/~ the set of all conditional probability distributions. Thus a model p(y\[x) is, by definition, just an element of ~v.</Paragraph> <Section position="1" start_page="40" end_page="42" type="sub_section"> <SectionTitle> 3.1 Training Data </SectionTitle> <Paragraph position="0"> To study the process, we observe the behavior of the random process for some time, collecting a large number of samples (xl,yl), (x2, y2) ..... (XN, YN). In the example we have been considering, each sample would consist of a phrase x containing the words surrounding in, together with the translation y of in that the process produced. For now, we can imagine that these training samples have been generated by a human expert who was presented with a number of random phrases containing in and asked to choose a good translation for each. When we discuss real-world applications in Computational Linguistics Volume 22, Number 1 Combining (1), (2) and (3) yields the more explicit equation</Paragraph> <Paragraph position="2"> We call the requirement (3) a constraint equation or simply a constraint. By restricting attention to those models p(ylx) for which (3) holds, we are eliminating from consideration those models that do not agree with the training sample on how often the output of the process should exhibit the feature f.</Paragraph> <Paragraph position="3"> To sum up so far, we now have a means of representing statistical phenomena inherent in a sample of data (namely, ~(f)), and also a means of requiring that our model of the process exhibit these phenomena (namely, p(f) =/5(f)).</Paragraph> <Paragraph position="4"> One final note about features and constraints bears repeating: although the words &quot;feature&quot; and &quot;constraint&quot; are often used interchangeably in discussions of maximum entropy, we will be vigilant in distinguishing the two and urge the reader to do likewise. A feature is a binary-valued function of (x,y); a constraint is an equation between the expected value of the feature function in the model and its expected value in the training data.</Paragraph> </Section> <Section position="2" start_page="42" end_page="44" type="sub_section"> <SectionTitle> 3.3 The Maximum Entropy Principle </SectionTitle> <Paragraph position="0"> Suppose that we are given n feature functions fi, which determine statistics we feel are important in modeling the process. We would like our model to accord with these statistics. That is, we would like p to lie in the subset C of 7 ~ defined by C,=_{pEP\[p(fi)=P(fi) fori E {1,2 ..... n}} (4) Figure 1 provides a geometric interpretation of this setup. Here 7 ~ is the space of all (unconditional) probability distributions on three points, sometimes called a simplex. If we impose no constraints (depicted in (a)), then all probability models are allowable. Imposing one linear constraint Q restricts us to those p E P that lie on the region defined by C1, as shown in (b). A second linear constraint could determine p exactly, if the two constraints are satisfiable; this is the case in (c), where the intersection of C1 and C2 is non-empty. Alternatively, a second linear constraint could be inconsistent with the first--for instance, the first might require that the probability of the first point is 1/3 and the second that the probability of the third point is 3/4--this is shown in (d). In the present setting, however, the linear constraints are extracted from the training sample and cannot, by construction, be inconsistent. Furthermore, the linear constraints in our applications will not even come close to determining p C/v uniquely as they do in (c); instead, the set C = Q ~ C2 M ... N C, of allowable models will be infinite.</Paragraph> <Paragraph position="1"> Among the models p E C, the maximum entropy philosophy dictates that we select the most uniform distribution. But now we face a question left open in Section 2: what does &quot;uniform&quot; mean? A mathematical measure of the uniformity of a conditional distribution p(y\[x) is provided by the conditional entropy 1</Paragraph> <Paragraph position="3"> with joint distribution ~(x)p(y\[x). To emphasize the dependence of the entropy on the probability distribution p, we have adopted the alternate notation H(p).</Paragraph> <Paragraph position="4"> Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach Figure 1 Four different scenarios in constrained optimization. ~ represents the space of all probability distributions. In (a), no constraints are applied, and all p C ~ are allowable. In (b), the constraint C1 narrows the set of allowable models to those that lie on the line defined by the linear constraint. In (c), two consistent constraints C1 and C2 define a single model p C CI A C2. In (d), the two constraints are inconsistent (i.e., Q N C3 = 0); no p E/~ can satisfy them both. The entropy is bounded from below by zero, the entropy of a model with no uncertainty at all, and from above by log lYl, the entropy of the uniform distribution over all possible lYl values of y. With this definition in hand, we are ready to present the principle of maximum entropy.</Paragraph> <Paragraph position="5"> Maximum Entropy Principle To select a model from a set C of allowed probability distributions, choose the model p. E C with maximum entropy H(p):</Paragraph> <Paragraph position="7"> It can be shown that p. is always well-defined; that is, there is always a unique model p. with maximum entropy in any constrained set C.</Paragraph> </Section> <Section position="3" start_page="44" end_page="46" type="sub_section"> <SectionTitle> 3.4 Parametric Form </SectionTitle> <Paragraph position="0"> The maximum entropy principle presents us with a problem in constrained optimization: find the p. E C that maximizes H(p). In simple cases, we can find the solution to Computational Linguistics Volume 22, Number 1 this problem analytically. This was true for the example presented in Section 2 when we imposed the first two constraints on p. Unfortunately, the solution to the general problem of maximum entropy cannot be written explicitly, and we need a more indirect approach. (The reader is invited to try to calculate the solution for the same example when the third constraint is imposed.) To address the general problem, we apply the method of Lagrange multipliers from the theory of constrained optimization. The relevant steps are outlined here; the reader is referred to Della Pietra et al. (1995) for a more thorough discussion of constrained optimization as applied to maximum entropy.</Paragraph> <Paragraph position="1"> * We will refer to the original constrained optimization problem,</Paragraph> <Paragraph position="3"> as the primal problem.</Paragraph> <Paragraph position="4"> For each feature fi we introduce a parameter hi (a Lagrange multiplier). We define the Lagrangian A(p, ~) by</Paragraph> <Paragraph position="6"> Holding A fixed, we compute the unconstrained maximum of the Lagrangian A(p, )~) over all p E ~. We denote by p~ the p where A(p, ,~) achieves its maximum and by ~(~) the value at this maximum:</Paragraph> <Paragraph position="8"> We call @(;~) the dual function. The functions p;~ and ~(;~) may be calculated explicitly using simple calculus. We find</Paragraph> <Paragraph position="10"> Finally, we pose the unconstrained dual optimization problem:</Paragraph> <Paragraph position="12"> Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach At first glance it is not clear what these machinations achieve. However, a fundamental principle in the theory of Lagrange multipliers, called generically the Kuhn-Tucker theorem, asserts that under suitable assumptions, the primal and dual problems are, in fact, closely related. This is the case in the present situation. Although a detailed account of this relationship is beyond the scope of this paper, it is easy to state the final result: Suppose that A* is the solution of the dual problem. Then Px* is the solution of the primal problem; that is p;~. = p,. In other words, The maximum entropy model subject to the constraints C has the parametric form 2 p;~. of (10), where the parameter values A* can be determined by maximizing the dual function ~(A).</Paragraph> <Paragraph position="13"> The most important practical consequence of this result is that any algorithm for finding the maximum A* of ~(A) can be used to find the maximum p, of H(p) for peC.</Paragraph> </Section> <Section position="4" start_page="46" end_page="47" type="sub_section"> <SectionTitle> 3.5 Relation to Maximum Likelihood </SectionTitle> <Paragraph position="0"> The log-likelihood L~(p) of the empirical distribution/5 as predicted by a model p is defined by 3</Paragraph> <Paragraph position="2"> It is easy to check that the dual function ~(A) of the previous section is, in fact, just the log-likelihood for the exponential model p~; that is</Paragraph> <Paragraph position="4"> With this interpretation, the result of the previous section can be rephrased as: The model p, E C with maximum entropy is the model in the parametric family p:~(ylx) that maximizes the likelihood of the training sample ~.</Paragraph> <Paragraph position="5"> This result provides an added justification for the maximum entropy principle: If the notion of selecting a model p, on the basis of maximum entropy isn't compelling enough, it so happens that this same p, is also the model that can best account for the training sample, from among all models of the same parametric form (10).</Paragraph> <Paragraph position="6"> Table 1 summarizes the primal-dual framework we have established.</Paragraph> <Paragraph position="7"> Computational Linguistics Volume 22, Number 1 Table 1 The duality of maximum entropy and maximum likelihood is an example of the more general phenomenon of duality in constrained optimization.</Paragraph> </Section> <Section position="5" start_page="47" end_page="47" type="sub_section"> <SectionTitle> Primal Dual </SectionTitle> <Paragraph position="0"> problem argmaxp~ c H(p) argmax~ * ()`) description maximum entropy maximum likelihood type of search constrained optimization unconstrained optimization search domain p E C real-valued vectors {)`1, ),2...} solution p . ),* Kuhn-Tucker theorem: p, = p~.</Paragraph> </Section> <Section position="6" start_page="47" end_page="48" type="sub_section"> <SectionTitle> 3.6 Computing the Parameters </SectionTitle> <Paragraph position="0"> For all but the most simple problems, the ;~* that maximize ~()~) cannot be found analytically. Instead, we must resort to numerical methods. From the perspective of numerical optimization, the function @()0 is well behaved, since it is smooth and convex-~ in )~. Consequently, a variety of numerical methods can be used to calculate )~*. One simple method is coordinate-wise ascent, in which )~* is computed by iteratively maximizing q~()~) one coordinate at a time. When applied to the maximum entropy problem, this technique yields the popular Brown algorithm (Brown 1959).</Paragraph> <Paragraph position="1"> Other general purpose methods that can he used to maximize ~()~) include gradient ascent and conjugate gradient.</Paragraph> <Paragraph position="2"> An optimization method specifically tailored to the maximum entropy problem is the iterative scaling algorithm of Darroch and Ratcliff (1972). We present here a version of this algorithm specifically designed for the problem at hand; a proof of the monotonicity and convergence of the algorithm is given in Della Pietra et al. (1995).</Paragraph> <Paragraph position="3"> The algorithm is applicable whenever the feature functions fi (x, y) are nonnegative: fi(x,y) >>_ 0 for all i, x, and y (15) This is, of course, true for the binary-valued feature functions we are considering here. The algorithm generalizes the Darroch-Ratcliff procedure, which requires, in addition to the nonnegativity, that the feature functions satisfy ~ifi(x, Y) = 1 for all x, y.</Paragraph> <Paragraph position="4"> Input: Feature functions fl,f2 .... fn; empirical distribution ~(x,y) Output : Optimal parameter values )~*i; optimal model p~.</Paragraph> <Paragraph position="5"> .</Paragraph> <Paragraph position="6"> 2.</Paragraph> <Paragraph position="7"> Start with )~i = 0 for all i E {1, 2 ..... n} Do for each i c {1,2 ..... n}: a. Let A~i be the solution to</Paragraph> <Paragraph position="9"> Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach b. Update the value of hi according to: ,~i ~ '~i -'~ AAi 3. Go to step 2 if not all the &quot;~i have converged The key step in the algorithm is step (2a), the computation of the increments AAi that solve (16). If f#(x,y) is constant (f#(x,y) = M for all x,y, say) then AAi is given explicitly as</Paragraph> <Paragraph position="11"> effective way of doing this is by Newton's method. This method computes the solution a, of an equation g(a,) = 0 iteratively by the recurrence (18) with an appropriate choice for a0 and suitable attention paid to the domain of g.</Paragraph> </Section> </Section> <Section position="5" start_page="48" end_page="51" type="metho"> <SectionTitle> 4. Feature Selection </SectionTitle> <Paragraph position="0"> Earlier we divided the statistical modeling problem into two steps: finding appropriate facts about the data, and incorporating these facts into the model. Up to this point we have proceeded by assuming that the first task was somehow performed for us. Even in the simple example of Section 2, we did not explicitly state how we selected those particular constraints. That is, why is the fact that dans or ~ was chosen by the expert translator 50% of the time any more important than countless other facts contained in the data? In fact, the principle of maximum entropy does not directly concern itself with the issue of feature selection, it merely provides a recipe for combining constraints into a model. But the feature selection problem is critical, since the universe of possible constraints is typically in the thousands or even millions. In this section we introduce a method for automatically selecting the features to be included in a maximum entropy model, and then offer a series of refinements to ease the computational burden.</Paragraph> <Section position="1" start_page="48" end_page="51" type="sub_section"> <SectionTitle> 4.1 Motivation </SectionTitle> <Paragraph position="0"> We begin by specifying a large collection ~&quot; of candidate features. We do not require a priori that these features are actually relevant or useful. Instead, we let the pool be as large as practically possible. Only a small subset of this collection of features will eventually be employed in our final model.</Paragraph> <Paragraph position="1"> If we had a training sample of infinite size, we could determine the &quot;true&quot; expected value for a candidate feature f E ~- simply by computing the fraction of events in the sample for which f(x, y) = 1. In reaMife applications, however, we are provided with only a small sample of N events, which cannot be trusted to represent the process fully and accurately. Specifically, we cannot expect that for every feature f E ~', the estimate of ~(f) we derive from this sample will be close to its value in the limit as n grows large. Employing a larger (or even just a different) sample of data from the same process might result in different estimates of/5(f) for many candidate features.</Paragraph> <Paragraph position="2"> We would like to include in the model only a subset $ of the full set of candidate features jr. We will call 8 the set of active features. The choice of 8 must capture as much information about the random process as possible, yet only include features whose expected values can be reliably estimated.</Paragraph> <Paragraph position="3"> Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach By adding feature f to S, we obtain a new set of active features S U f. Following (19), this set of features determines a set of models</Paragraph> <Paragraph position="5"> Adding the feature f allows the model Paul to better account for the training sample; this results in a gain AL($,f) in the log-likelihood of the training data</Paragraph> <Paragraph position="7"> feature f E ~&quot; which maximizes the gain AL($,f); that is, we select the candidate feature which, when adjoined to the set of active features S, produces the greatest increase in likelihood of the training sample. This strategy is implemented in the algorithm below.</Paragraph> <Paragraph position="8"> Algorithm 2: Basic Feature Selection Input: Collection b v of candidate features; empirical distribution \]5(x, y) Output : Set S of active features; model Ps incorporating these features 1. Start with S = 0; thus Ps is uniform 2. Do for each candidate feature f E ~v: Compute the model Psuf using Algorithm 1 Compute the gain in the log-likelihood from adding this feature using (23) 3. Check the termination condition 4. Select the feature f with maximal gain AL(S,f) 5. Adjoinf to S 6. Compute Ps using Algorithm 1 7. Go to step 2 One issue left unaddressed by this algorithm is the termination condition. Obviously, we would like a condition which applies exactly when all the &quot;useful&quot; features have been selected. One reasonable stopping criterion is to subject each proposed feature to cross-validation on a sample of data withheld from the initial data set. If the feature does not lead to an increase in likelihood of the withheld sample of data, the feature is discarded. We will have more to say about the stopping criterion in Section 5.3.</Paragraph> <Paragraph position="9"> Computational Linguistics Volume 22, Number 1</Paragraph> </Section> <Section position="2" start_page="51" end_page="51" type="sub_section"> <SectionTitle> 4.3 Approximate Gains </SectionTitle> <Paragraph position="0"> Algorithm 2 is not a practical method for incremental feature selection. For each candidate feature f E ~&quot; considered in step 2, we must compute the maximum entropy model p u ' a task that is computationally costly even with the efficient iterative scaling algorith~ ~ntroduced earlier. We therefore introduce a modification to the algorithm, making it greedy but much more feasible. We replace the computation of the gain AL(S,f) of a feature f with an approximation, which we will denote by ~AL(S,f).</Paragraph> <Paragraph position="1"> Recall that a model p has a set of parameters )~, one for each feature in S. The model Ps contains thisaset of parameters, plus a single new parameter c~, corresponding ~fo f.4 Given this structure, we might hope that the optimal values for )~ do not change as the feature f is adjoined to S. Were this the case, imposing an additional constraint would require only optimizing the single parameter ~ to maximize the likelihood. Unfortunately, when a new constraint is imposed, the optimal values of all parameters change.</Paragraph> <Paragraph position="2"> However, to make the feature-ranking computation tractable, we make the approximation that the addition of a feature f affects only o~, leaving the )~-values associated with other features unchanged. That is, when determining the gain of f over the model Ps' we pretend that the best model containing features $ U f has the form</Paragraph> <Paragraph position="4"> The only parameter distinguishing models of the form (24) is c~. Among these models, we are interested in the one that maximizes the approximate gain</Paragraph> <Paragraph position="6"> We will denote the gain of this model by</Paragraph> <Paragraph position="8"> and the optimal model by ~' P,suf = argmax G s,f (c~) (28) P~,f Despite the rather unwieldy notation, the idea is simple. Computing the approximate gain in likelihood from adding feature f to Ps has been reduced to a simple one-dimensional optimization problem over the single parameter ~, which can be solved by any popular line-search technique, such as Newton's method. This yields a great savings in computational complexity over computing the exact gain, an n-dimensional The likelihood L(p) is a convex function of its parameters. If we start from a one-constraint model whose optimal parameter value is A = A0 and consider the increase in L~(p) from adjoining a second constraint with the parameter a, the exact answer requires a search over (A, a). We can simplify this task by holding A = A0 constant and performing a line search over the possible values of the new parameter a. In (a), the darkened line represents the search space we restrict attention to. In (b), we show the reduced problem: a line search over a. optimization problem requiring more sophisticated methods such as conjugate gradient. But the savings comes at a price: for any particular feature f, we are probably underestimating its gain, and there is a reasonable chance that we will select a feature f whose approximate gain ,,~AL($,f) was highest and pass over the feature f with maximal gain AL($,f).</Paragraph> <Paragraph position="9"> A graphical representation of this approximation is provided in Figure 3. Here the log-likelihood is represented as an arbitrary convex function over two parameters: A corresponds to the &quot;old&quot; parameter, and a to the &quot;new&quot; parameter. Holding A fixed and adjusting a to maximize the log-likelihood involves a search over the darkened line, rather than a search over the entire space of (A, a).</Paragraph> <Paragraph position="10"> The actual algorithms, along with the appropriate mathematical framework, are presented in the appendix.</Paragraph> </Section> </Section> <Section position="6" start_page="51" end_page="65" type="metho"> <SectionTitle> 5. Case Studies </SectionTitle> <Paragraph position="0"> In the next few pages we discuss several applications of maximum entropy modeling within Candide, a fully automatic French-to-English machine translation system under development at IBM. Over the past few years, we have used Candide as a test bed for exploring the efficacy of various techniques in modeling problems arising in machine translation.</Paragraph> <Paragraph position="1"> We begin in Section 5.1 with a review of the general theory of statistical translation, describing in some detail the models employed in Candide. In Section 5.2 we describe how we have applied maximum entropy modeling to predict the French translation of an English word in context. In Section 5.3 we describe maximum entropy models that predict differences between French word order and English word order. In Section 5.4 we describe a maximum entropy model that predicts how to divide a French sentence into short segments that can be translated sequentially.</Paragraph> <Paragraph position="2"> its sentence. Here al = 1, a2 = 2, a3 = a4 = 3, as = 4, and a6 = 5.</Paragraph> <Section position="1" start_page="53" end_page="55" type="sub_section"> <SectionTitle> 5.1 Review of Statistical Translation </SectionTitle> <Paragraph position="0"> When presented with a French sentence F, Candide's task is to find the English sentence E which is most likely given F:</Paragraph> <Paragraph position="2"> Candide estimates p(E)--the probability that a string E of English words is a well-formed English sentence--using a parametric model of the English language, commonly referred to as a language model. The system estimates p(FIE)--the probability that a French sentence F is a translation of E--using a parametric model of the process of English-to-French translation known as a translation model. These two models, plus a search strategy for finding the/~ that maximizes (30) for some F, comprise the engine of the translation system.</Paragraph> <Paragraph position="3"> We now briefly describe the translation model for the probability P(FIE); a more thorough account is provided in Brown et al. (1991). We imagine that an English sentence E generates a French sentence F in two steps. First, each word in E independently generates zero or more French words. These words are then ordered to give a French sentence F. We denote the ith word of E by ei and the jth word of F by yj. (We employ yj rather than the more intuitive }~ to avoid confusion with the feature function notation.) We denote the number of words in the sentence E by IEI and the number of words in the sentence F by IFI. The generative process yields not only the French sentence F but also an association of the words of F with the words of E. We call this association an alignment, and denote it by A. An alignment A is parametrized by a sequence of IFI numbers aj, with 1 < ai < IE\[. For every word position j in F, aj is the word position in E of the English word that generates yj. Figure 4 depicts a typical alignment.</Paragraph> <Paragraph position="4"> The probability p(FIE ) that F is the translation of E is expressed as the sum over all possible alignments A between E and F of the probability of F and A given E: p(FIE ) = ~_, p(F, AIE ) (31) A The sum in equation (31) is computationally unwieldy; it involves a sum over all IE\] IFI possible alignments between the words in the two sentences. We sometimes make the Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach simplifying assumption that there exists one extremely probable alignment ,4, called the &quot;Viterbi alignment,&quot; for which</Paragraph> <Paragraph position="6"> where n(ei) denotes the number of French words aligned with ei. In this expression</Paragraph> <Paragraph position="8"> We call the model described by equations (31) and (33) the basic translation model.</Paragraph> <Paragraph position="9"> We take the probabilities p(nle ) and p(yr e) as the fundamental parameters of the model, and parametrize the distortion probability in terms of simpler distributions.</Paragraph> <Paragraph position="10"> Brown et al. (1991) describe a method of estimating these parameters to maximize the likelihood of a large bilingual corpus of English and French sentences. Their method is based on the Estimation-Maximization (EM) algorithm, a well-known iterative technique for maximum likelihood training of a model involving hidden statistics. For the basic translation model, the hidden information is the alignment A between E and F.</Paragraph> <Paragraph position="11"> We employed the EM algorithm to estimate the parameters of the basic translation model so as to maximize the likelihood of a bilingual corpus obtained from the proceedings of the Canadian Parliament. For historical reasons, these proceedings are sometimes called &quot;Hansards.&quot; Our Hansard corpus contains 3.6 million English-French sentence pairs, for a total of a little under 100 million words in each language. Table 2 shows our parameter estimates for the translation probabilities p(y\[in). The basic translation model has worked admirably: given only the bilingual corpus, with no additional knowledge of the languages or any relation between them, it has uncovered some highly plausible translations.</Paragraph> <Paragraph position="12"> Nevertheless, the basic translation model has one major shortcoming: it does not take the English context into account. That is, the model does not account for surrounding English words when predicting the appropriate French rendering of an English word. As we pointed out in Section 3, this is not how successful translation works.</Paragraph> <Paragraph position="13"> The best French translation of in is a function of the surrounding English words: if a month's time are the subsequent words, pendant might be more likely, but if thefiscal year 1992 are what follows, then dans is more likely. The basic model is blind to context, always assigning a probability of 0.3004 to dans and 0.0044 to pendant.</Paragraph> <Paragraph position="14"> This can yield errors when Candide is called upon to translate a French sentence.</Paragraph> <Paragraph position="15"> Examples of two such errors are shown in Figure 5. In the first example, the system has chosen an English sentence in which the French word sup&ieures has been rendered as superior when greater or higher is a preferable translation. With no knowledge of context, an expert translator is also quite likely to select superior as the English word generating Typical errors encountered in using EM-based model of Brown et al. in a French-to-English translation system.</Paragraph> <Paragraph position="16"> sup~rieures. But an expert privy to the fact that 50% was among the next few words might be more inclined to select greater or higher. Similarly, in the second example, the incorrect rendering of II as He might have been avoided had the translation model used the fact that the word following it is appears.</Paragraph> </Section> <Section position="2" start_page="55" end_page="59" type="sub_section"> <SectionTitle> 5.2 Context-Dependent Word Models </SectionTitle> <Paragraph position="0"> In the hope of rectifying these errors, we consider the problem of context-sensitive modeling of word translation. We envision, in practice, a separate maximum entropy model, pe(y\]x), for each English word e, where pe(ylx) represents the probability that an expert translator would choose y as the French rendering of e, given the surrounding English context x. This is just a slightly recast version of a longstanding problem in computational linguistics, namely, sense disambiguation--the determination of a word's sense from its context.</Paragraph> <Paragraph position="1"> We begin with a training sample of English-French sentence pairs (E, F) randomly extracted from the Hansard corpus, such that E contains the English word in. For each sentence pair, we use the basic translation model to compute the Viterbi alignment between E and F. Using this alignment, we then construct an (x, y) training event. The event consists of a context x containing the six words in E surrounding in and a future Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach Table 3 Several actual training events for the maximum entropy translation model for in, extracted from the transcribed proceedings of the Canadian Parliament.</Paragraph> <Paragraph position="2"> de not given notice a letter to respect of the the fiscal year the same postal Canada , by the ordinary way y equal to the French word which is (according to the Viterbi alignment A) aligned with in. A few actual examples of such events for in are depicted in Table 3. Next we define the set of candidate features. For this application, we employ features that are indicator functions of simply described sets. Specifically, we consider functions f(x,y) that are one if y is some particular French word and the context x contains a given English word, and are zero otherwise. We employ the following notation to represent these features:</Paragraph> <Paragraph position="4"> fa(x,y) {10 otherwiseifY pen antand weeks I I I&quot; I&quot; I&quot; J Here fl = 1 when April follows in and en is the translation of in; f2 = 1 when weeks is one of the three words following in and pendant is the translation.</Paragraph> <Paragraph position="5"> The set of features under consideration is vast, but may be expressed in abbreviated form in Table 4. In the table, the symbol O is a placeholder for a possible French word and the symbol \[\] is a placeholder for a possible English word. The feature h mentioned above is thus derived from template 2 with O = en and \[\] = April; the feature f2 is derived from template 5 with O = pendant and \[\] = weeks. If there are IVEI total English words and IVy:I total French words, there are \]Vfl template-1 features, and IVEI. IVy l features of templates 2, 3, 4, and 5.</Paragraph> <Paragraph position="6"> Template 1 features give rise to constraints that enforce equality between the probability of any French translation y of in according to the model and the probability of that translation in the empirical distribution. Examples of such constraints are</Paragraph> <Paragraph position="8"> A maximum entropy model that uses only template 1 features predicts each French translation y with the probability ~(y) determined by the empirical data. This is exactly the distribution employed by the basic translation model.</Paragraph> <Paragraph position="9"> Since template 1 features are independent of x, the maximum entropy model that employs only constraints derived from template 1 features takes no account of contextual information in assigning a probability to y. When we include constraints derived from template 2 features, we take our first step towards a context-dependent model.</Paragraph> <Paragraph position="10"> Rather than simply constraining the expected probability of a French word y to equal its empirical probability, these constraints require that the expected joint probability of the English word immediately following in and the French rendering of in be equal to its empirical probability. An example of a template 2 constraint is p(y = pendant, e+l = several) = ~(y = pendant, e+\] = several) A maximum entropy model that incorporates this constraint will predict the translations of in in a manner consistent with whether or not the following word is several. In particular, if in the empirical sample the presence of several led to a greater probability for pendant, this will be reflected in a maximum entropy model incorporating this constraint. We have thus taken our first step toward context-sensitive translation modeling.</Paragraph> <Paragraph position="11"> Templates 3, 4, and 5 consider, each in a different way, various parts of the context.</Paragraph> <Paragraph position="12"> For instance, template 5 constraints allow us to model how an expert translator is biased by the appearance of a word somewhere in the three words following the word being translated. If house appears within the next three words (e.g., the phrases in the house and in the red house), then dans might be a more likely translation. On the other hand, if year appears within the same window of words (as in in the year 1941 or in that fateful year), then au cours de might be more likely. Together, the five constraint templates allow the model to condition its assignment of probabilities on a window of six words around e0, the word in question.</Paragraph> <Paragraph position="13"> We constructed a maximum entropy model Pin (ylx) by the iterative model-growing method described in Section 4. The automatic feature selection algorithm first selected a template 1 constraint for each of the translations of in seen in the sample (12 in Berger, Della Pietra, and Della Pietra A Maximum Entropy Approach Table 5 Maximum entropy model to predict French translation of in. Features shown here were the first features selected not from template 1. \[verb markerJ denotes a morphological marker inserted to indicate the presence of a verb as the next word.</Paragraph> <Paragraph position="14"> Featuref(x,y) ,,~AL($,f) L(p) y=~ and Canada E y=~ and House E y=en and the E y=pour and order E y=dans and speech C y=dans and area E y=de and increase C y=\[verb marker\] and my E y=dans and case E y=au cours de and year C all), thus constraining the model's expected probability of each of these translations to their empirical probabilities. The next few constraints selected by the algorithm are shown in Table 5. The first column gives the identity of the feature whose expected value is constrained; the second column gives ,-~AL($,f), the approximate increase in the model's log-likelihood on the data as a result of imposing this constraint; the third column gives L(p), the log-likelihood after adjoining the feature and recomputing the model.</Paragraph> <Paragraph position="15"> Let us consider the fifth row in the table. This constraint requires that the model's expected probability of dans, if one of the three words to the right of in is the word speech, is equal to that in the empirical sample. Before imposing this constraint on the model during the iterative model-growing process, the log-likelihood of the current model on the empirical sample was -2.8703 bits. The feature selection algorithm described in Section 4 calculated that if this constraint were imposed on the model, the log-likelihood would rise by approximately 0.019059 bits; since this value was higher than for any other constraint considered, the constraint was selected. After applying iterative scaling to recompute the parameters of the new model, the likelihood of the empirical sample rose to -2.8525 bits, an increase of 0.0178 bits.</Paragraph> <Paragraph position="16"> Table 6 lists the first few selected features for the model for translating the English word run. The &quot;Hansard flavor'--the rather specific domain of parliamentary discourse related to Canadian affairs--is easy to detect in many of the features in this table.</Paragraph> <Paragraph position="17"> It is not hard to incorporate the maximum entropy word translation models into a translation model P(FIE ) for a French sentence given an English sentence. We merely replace the simple context-independent models p(yI e) used in the basic translation model (33) with the more general context-dependent models pe(y\]x):</Paragraph> <Paragraph position="19"> where xaj denotes the context of the English word eaj.</Paragraph> <Paragraph position="20"> Figure 6 illustrates how using this improved translation model in the Candide system led to improved translations for the two sample sentences given earlier.</Paragraph> </Section> <Section position="3" start_page="59" end_page="63" type="sub_section"> <SectionTitle> 5.3 Segmentation </SectionTitle> <Paragraph position="0"> Though an ideal machine translation system could devour input sentences of unrestricted length, a typical stochastic system must cut the French sentences into polite lengths before digesting them. If the processing time is exponential in the length of the input passage (as is the case with the Candide system), then failing to split the French sentences into reasonably-sized segments would result in an exponential slowdown in translation.</Paragraph> <Paragraph position="1"> Thus, a common task in machine translation is to find safe positions at which Example of an unsafe segmentation. A word in the translated sentence (e3) is aligned to words (y3 and y4) in two different segments of the input sentence.</Paragraph> <Paragraph position="2"> to split input sentences in order to speed the translation process. &quot;Safe&quot; is a vague term; one might, for instance, reasonably define a safe segmentation as one which results in coherent blocks of words. For our purposes, however, a safe segmentation is dependent on the Viterbi alignment A between the input French sentence F and its English translation E.</Paragraph> <Paragraph position="3"> We define a rift as a position j in F such that for all k < j, ak <_ aj and for all k > j, ak > aj. In other words, the words to the left of the French word yj are generated by words to the left of the English word %, and the words to the right of yj are generated by words to the right of %. In the alignment of figure 4, for example, there are rifts at positions j = 1, 2, 4, 5 in the French sentence. One visual method of determining whether a rift occurs after the French word j is to try to trace a line from the last letter of yj up to the last letter of e~; if the line can be drawn without intersecting any alignment lines, position f is a rift.</Paragraph> <Paragraph position="4"> Using our definition of rifts, we can redefine a safe segmentation as one in which the segment boundaries are located only at rifts. Figure 7 illustrates an unsafe segmentation, in which a segment boundary (denoted by the II symbol) lies between a and mangd, where there is no rift. Figure 8, on the other hand, illustrates a safe segmentation.</Paragraph> <Paragraph position="5"> The reader will notice that a safe segmentation does not necessarily result in semantically coherent segments: mes and devoirs are certainly part of one logical unit, yet are separated in this safe segmentation. Once such a safe segmentation has been applied to the French sentence, we can make the assumption while searching for the appropriate English translation that no word in the translated English sentence will have to account for French words located in multiple segments. Disallowing inter- null (x,y) for sentence segmentation.</Paragraph> <Paragraph position="6"> segment alignments dramatically reduces the scale of the computation involved in generating a translation, particularly for large sentences. We can consider each segment sequentially while generating the translation, working from left to right in the French sentence.</Paragraph> <Paragraph position="7"> We now describe a maximum entropy model that assigns to each location in a French sentence a score that is a measure of the safety in cutting the sentence at that location. We begin as in the word translation problem, with a training sample of English-French sentence pairs (E, F) randomly extracted from the Hansard corpus. For each sentence pair we use the basic translation model to compute the Viterbi alignment between E and F. We also use a stochastic part-of-speech tagger as described in Merialdo (1990) to label each word in F with its part of speech. For each position j in F we then construct a (x, y) training event. The value y is rift if a rift belongs at position j and is no-rift otherwise. The context information x is reminiscent of that employed in the word translation application described earlier. It includes a six-word window of French words: three to the left of yj and three to the right of yj. It also includes the part-of-speech tags for these words, and the classes of these words as derived from a mutual-information clustering scheme described in Brown et al. (1990). The complete (x, y) pair is illustrated in Figure 9.</Paragraph> <Paragraph position="8"> In creating p(riftlx), we are (at least in principle) modeling the decisions of an expert French segmenter. We have a sample of his work in the training sample ~(x, y), and we measure the worth of a model by the log-likelihood L~(p). During the iterative model-growing procedure, the algorithm selects constraints on the basis of how much they increase this objective function. As the algorithm proceeds, more and more constraints are imposed on the model p, bringing it into ever-stricter compliance with the empirical data ~(x,y). This is useful to a point; insofar as the empirical data embodies the expert knowledge of the French segmenter, we would like to incorporate this knowledge into a model. But the data contains only so much expert knowledge; the algorithm should terminate when it has extracted this knowledge. Otherwise, the model p(ylx) will begin to fit itself to quirks in the empirical data.</Paragraph> <Paragraph position="9"> A standard approach in statistical modeling, to avoid the problem of overfitting the training data, is to employ cross-validation techniques. Separate the training data \]~(x, y) into a training portion, pr, and a withheld portion, \]~h. Use only \]gr in the model-growing process; that is, select features based on how much they increase the likelihood L~r (p). As the algorithm progresses, L~, (p) thus increases monotonically. As long as each new constraint imposed allows p to better account for the random process that generated both Pr and Ph, the quantity L~h(p ) also increases. At the point when overfitting begins, however, the new constraints no longer help p model the random process, but instead require p to model the noise in the sample Pr itself. At this point, L~r (p) continues to rise, but L~h (p) no longer does. It is at this point that the algorithm should terminate.</Paragraph> <Paragraph position="10"> Figure 10 illustrates the change in log-likelihood of training data L~r(p ) and withheld data L~h (p). Had the algorithm terminated when the log-likelihood of the withheld data stopped increasing, the final model p would contain slightly less than 40 features. We have employed this segmenting model as a component in a French-English machine translation system in the following manner: The model assigns to each position in the French sentence a score, p(r+-ft I x), which is a measure of how appropriate a split would be at that location. A dynamic programming algorithm then selects, given the &quot;appropriateness&quot; score at each position and the requirement that no segment may contain more than 10 words, an optimal (or, at least, reasonable) splitting of the sentence. Figure 11 shows the system's segmentation of four sentences selected at random from the Hansard data. We remind the reader to keep in mind when evaluating Figure 11 that the segmenter's task is not to produce logically coherent blocks of words, but to divide the sentence into blocks which can be translated sequentially from left to right.</Paragraph> </Section> <Section position="4" start_page="63" end_page="64" type="sub_section"> <SectionTitle> 5.4 Word Reordering </SectionTitle> <Paragraph position="0"> Translating a French sentence into English involves not only selecting appropriate English renderings of the words in the French sentence, but also selecting an ordering for the English words. This order is often very different from the French word order.</Paragraph> <Paragraph position="1"> One way Candide captures word-order differences in the two languages is to allow for alignments with crossing lines. In addition, Candide performs, during a preprocessing stage, a reordering step that shuffles the words in the input French sentence into an order more closely resembling English word order.</Paragraph> <Paragraph position="2"> One component of this word reordering step deals with French phrases which have the NOUN de NOUN form. For some NOUN de NOUN phrases, the best English translation is nearly word for word: conflit d'intOr~t, for example, is almost always rendered as conflict of interest. For other phrases, however, the best translation is obtained by interchanging the two nouns and dropping the de. The French phrase taux d'int&Ot, for example, is best rendered as interest rate. Table 7 gives several examples of NOUN de NOUN phrases together with their most appropriate English translations.</Paragraph> <Paragraph position="3"> In this section we describe a maximum entropy model that, given a French NOUN de NOUN phrase, estimates the probability that the best English translation involves an interchange of the two nouns. We begin with a sample of English-French sentence pairs (E, F) randomly extracted from the Hansard corpus, such that F contains a NOUN de NOUN phrase. For each sentence pair we use the basic translation model to compute the Viterbi alignment ,~ between the words in E and F. Using A we construct an (x, y) training event as follows: We let the context x be the pair of French nouns (NOUNL, NOUNR). We let y be no-interchange if the English translation is a word-for-word translation of the French phrase and y = interchange if the order of the nouns in the English and French phrases are interchanged.</Paragraph> <Paragraph position="4"> We define candidate features based upon the template features shown in Table 8.</Paragraph> </Section> <Section position="5" start_page="64" end_page="64" type="sub_section"> <SectionTitle> Word-for-word Phrases </SectionTitle> <Paragraph position="0"> somme d'argent sum of money pays d'origin country of origin question de privilege question of privilege conflit d'intdrOt conflict of interest</Paragraph> </Section> <Section position="6" start_page="64" end_page="65" type="sub_section"> <SectionTitle> Interchanged Phrases </SectionTitle> <Paragraph position="0"> bureau de poste post office taux d'int&Ot interest rate compagnie d'assurance insurance company gardien de prison prison guard In this table, the symbol ~ is a placeholder for either interchange or no-interchange and the symbols \[\]1 and 02 are placeholders for possible French words. If there are 1~71 total French words, there are 2IVy- I possible features of templates 1 and 2 and 2Ddy, I 2 features of template 3.</Paragraph> <Paragraph position="1"> Template 1 features consider only the left noun. We expect these features to be relevant when the decision of whether to interchange the nouns is influenced by the identity of the left noun. For example, including the template 1 feature</Paragraph> <Paragraph position="3"> gives the model sensitivity to the fact that the nouns in French NOUN de NOUN phrases beginning with systOme (such as syst~me de surveillance.and systOme de quota) are more likely to be interchanged in the English translation. Similarly, including the template 1 feature 1 if y=no-interchange and NOUNL = mois f(x,y)= 0 otherwise gives the model sensitivity to the fact that French NOUN de NOUN phrases beginning with mois, such as mois de mai (month of May) are more likely to be translated word for word.</Paragraph> <Paragraph position="4"> Template 3 features are useful in dealing with translating NOUN de NOUN phrases in which the interchange decision is influenced by both nouns. For example, NOUN de NOUN phrases ending in intdrOt are sometimes translated word for word, as in conflit d'intdrOt (conflict of interest) and are sometimes interchanged, as in taux d'intdrOt (interest rate).</Paragraph> <Paragraph position="5"> We used the feature-selection algorithm of section 4 to construct a maximum entropy model from candidate features derived from templates 1, 2, and 3. The model was grown on 10,000 training events randomly selected from the Hansard corpus. The final model contained 358 constraints.</Paragraph> <Paragraph position="6"> To test the model, we constructed a NOUN de NOUN word-reordering module which interchanges the order of the nouns if p(interchange\[x) > 0.5 and keeps the order the same otherwise. Table 9 compares performance on a suite of test data against a baseline NOUN de NOUN reordering module that never swaps the word order.</Paragraph> <Paragraph position="7"> Predictions of the NOUN de NOUN interchange model on phrases selected from a corpus unseen during the training process.</Paragraph> <Paragraph position="8"> Table 12 shows some randomly-chosen NOUN de NOUN phrases extracted from this test suite along with p(interchangelx), the probability assigned by the model to inversion. On the right are phrases such as saison d'hiver for which the model strongly predicted an inversion. On the left are phrases the model strongly prefers not to interchange, such as somme d'argent, abus de privil~ge and chambre de commerce. Perhaps most intriguing are those phrases that lie in the middle, such as taux d'inflation, which can translate either to inflation rate or rate of inflation.</Paragraph> </Section> </Section> class="xml-element"></Paper>