File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1095_metho.xml
Size: 16,995 bytes
Last Modified: 2025-10-06 14:09:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1095"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 755-762, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Translating with non-contiguous phrases</Title> <Section position="3" start_page="0" end_page="756" type="metho"> <SectionTitle> 2 Non-contiguous phrases </SectionTitle> <Paragraph position="0"> Why should it be a good thing to use phrases composed of possibly non-contiguous sequences of words? In doing so we expect to improve translation quality by better accounting for additional linguistic phenomena as well as by extending the effect of contextual semantic disambiguation and example-based translation inherent in phrase-based MT. An example of a phenomenon best described using non-contiguous units is provided by English phrasal verbs. Consider the sentence &quot;Mary switches her table lamp off&quot;. Word-based statistical models would be at odds when selecting the appropriate translation of the verb. If French were the target language, for instance, corpus evidence would come from both examples in which &quot;switch&quot; is translated as &quot;allumer&quot; (to switch on) and as &quot;'eteindre&quot; (to switch off). If many-to-one word alignments are not allowed from English to French, as it is usually the case, then the best thing a word-based model could do in this case would be to align &quot;off&quot; to the empty word and hope to select the correct translation from &quot;switch&quot; only, basically a 50-50 bet. While handling inseparable phrasal verbs such as &quot;to run out&quot; correctly, previously proposed phrase-based models would be helpless in this case. A comparable behavior is displayed by German separable verbs. Moreover, non-contiguous linguistic units are not limited to verbs. Negation is formed, in French, by inserting the words &quot;ne&quot; and &quot;pas&quot; before and after a verb respectively. So, the sentence &quot;Pierre ne mange pas&quot; and its English translation display a complex word-level alignment (Figure 1) current models cannot account for.</Paragraph> <Paragraph position="1"> Flexible idioms, allowing for the insertion of linguistic material, are other phenomena best modeled with non-contiguous units.</Paragraph> <Section position="1" start_page="755" end_page="756" type="sub_section"> <SectionTitle> 2.1 Definition and library construction </SectionTitle> <Paragraph position="0"> We define a bi-phrase as a pair comprising a source phrase and a target phrase: b = <~s,~t> . Each of the source and target phrases is a sequence of words and gaps (indicated by the symbol diamondmath); each gap acts as a placeholder for exactly one unspecified word. For example, ~w = w1w2diamondmathw3diamondmathdiamondmathw4 is a phrase of length 7, made up of two contiguous words w1 and w2, a first gap, a third word w3, two consecutive gaps and a final word w4. To avoid redundancy, phrases may not begin or end with a gap. If a phrase does not contain any gaps, we say it is contiguous; otherwise it is non-contiguous. Likewise, a bi-phrase is said to be contiguous if both its phrases are contiguous.</Paragraph> <Paragraph position="1"> The translation of a source sentence s is produced by combining together bi-phrases so as to cover the source sentence, and produce a well-formed target-language sentence (i.e. without gaps). A complete translation for s can be described as an ordered sequence of bi-phrases b1...bK. When piecing together the final translation, the target-language portion ~t1 of the first bi-phrase b1 is first layed down, then each subsequent ~tk is positioned on the first &quot;free&quot; position in the target language sentence, i.e. either the leftmost gap, or the right end of the sequence. Figure 2 illustrates this process with an example.</Paragraph> <Paragraph position="2"> To produce translations, our approach therefore relies on a collection of bi-phrases, what we call a bi-phrase library. Such a library is constructed from a corpus of existing translations, aligned at the word level.</Paragraph> <Paragraph position="3"> Two strategies come to mind to produce non-contiguous bi-phrases for these libraries. The first is to align the words using a &quot;standard&quot; word alignement technique, such as the Refined Method described in (Och and Ney, 2003) (the intersection of two IBM Viterbi alignments, forward and reverse, enriched with alignments from the union) and then generate bi-phrases by combining together individual alignments that co-occur in the same pair of sentences. This is the strategy that is usually adopted in other phrase-based MT approaches (Zens and Ney, 2003; Och and Ney, 2004). Here, the difference is that we are not restricted to combinations that produce strictly contiguous bi-phrases.</Paragraph> <Paragraph position="4"> The second strategy is to rely on a word-alignment method that naturally produces many-to-many alignments between non-contiguous words, such as the method described in (Goutte et al., 2004). By means of a matrix factorization, this method produces a parallel partition of the two texts, seen as sets of word tokens. Each token therefore belongs to one, and only one, subset within this partition, and corresponding subsets in the source and target make up what are called cepts. For example, in Figure 1, these cepts are represented by the circles numbered 1, 2 and 3; each cept thus connects word tokens in the source and the target, regardless of position or contiguity. These cepts naturally constitute bi-phrases, and can be used directly to produce a bi-phrase library.</Paragraph> <Paragraph position="5"> Obviously, the two strategies can be combined, and it is always possible to produce increasingly large and complex bi-phrases by combining together co-occurring bi-phrases, contiguous or not. One problem with this approach, however, is that the resulting libraries can become very large. With con- null tiguous phrases, the number of bi-phrases that can be extracted from a single pair of sentences typically grows quadratically with the size of the sentences; with non-contiguous phrases, however, this growth is exponential. As it turns out, the number of available bi-phrases for the translation of a sentence has a direct impact on the time required to compute the translation; we will therefore typically rely on various filtering techniques, aimed at keeping only those bi-phrases that are more likely to be useful. For example, we may retain only the most frequently observed bi-phrases, or impose limits on the number of cepts, the size of gaps, etc.</Paragraph> </Section> </Section> <Section position="4" start_page="756" end_page="757" type="metho"> <SectionTitle> 3 The Model </SectionTitle> <Paragraph position="0"> In statistical machine translation, we are given a source language input sJ1 = s1...sJ, and seek the target-language sentence tI1 = t1...tI that is its most</Paragraph> <Paragraph position="2"> Our approach is based on a direct approximation of the posterior probability Pr(tI1|sJ1), using a log-</Paragraph> <Paragraph position="4"> parenrightBigg In such a model, the contribution of each feature function hm is determined by the corresponding model parameter lm; ZsJ denotes a normalization constant. This type of model is now quite widely used for machine translation (Tillmann and Xia, 2003; Zens and Ney, 2003)1.</Paragraph> <Paragraph position="5"> Additional variables can be introduced in such a model, so as to account for hidden characteristics, and the feature functions can be extended accordingly. For example, our model must take into account the actual set of bi-phrases that was used to produce this translation:</Paragraph> <Paragraph position="7"> parenrightBigg Our model currently relies on seven feature functions, which we describe here.</Paragraph> <Paragraph position="8"> * The bi-phrase feature function hbp: it represents the probability of producing tI1 using some set of bi-phrases, under the assumption that each source phrase produces a target phrase independently of the others:</Paragraph> <Paragraph position="10"> Individual bi-phrase probabilities Pr(~tk|~sk) are estimated based on occurrence counts in the word-aligned training corpus.</Paragraph> <Paragraph position="11"> lar concerns to those motivating our work by introducing a Synchronous CFG for bi-phrases. If on one hand SCFGs allow to better control the order of the material inserted in the gaps, on the other gap size does not seem to be taken into account, and phrase dovetailing such as the one involving &quot;do diamondmathwant&quot; and &quot;not diamondmathdiamondmathdiamondmathanymore&quot; in Fig. 2 is disallowed. hbp's strong tendency to overestimate the probability of rare bi-phrases; it is computed as in equation (2), except that bi-phrase probabilities are computed based on individual word translation probabilities, somewhat as in IBM model</Paragraph> <Paragraph position="13"> * The target language feature function htl: this is based on a N-gram language model of the target language. As such, it ignores the source language sentence and the decomposition of the target into bi-phrases, to focus on the actual sequence of target-language words produced by the combination of bi-phrases:</Paragraph> <Paragraph position="15"> * The word-count and bi-phrase count feature functions hwc and hbc: these control the length of the translation and the number of bi-phrases used to produce it: reordering between bi-phrases of the source and target sentences.</Paragraph> <Paragraph position="16"> * the gap count feature function hgc: It takes as value the total number of gaps (source and target) within the bi-phrases of bK1 , thus allowing the model some control over the nature of the bi-phrases it uses, in terms of the discontiguities they contain.</Paragraph> </Section> <Section position="5" start_page="757" end_page="758" type="metho"> <SectionTitle> 4 Parameter Estimation </SectionTitle> <Paragraph position="0"> The values of the l parameters of the log-linear model can be set so as to optimize a given criterion. For instance, one can maximize the likelyhood of some set of training sentences. Instead, and as suggested by Och (2003), we chose to maximize directly the quality of the translations produced by the system, as measured with a machine translation evaluation metric.</Paragraph> <Paragraph position="1"> Say we have a set of source-language sentences S. For a given value of l, we can compute the set of corresponding target-language translations T. Given a set of reference (&quot;gold-standard&quot;) translations R for S and a function E(T,R) which measures the &quot;error&quot; in T relative to R, then we can formulate the parameter estimation problem as2:</Paragraph> <Paragraph position="3"> As pointed out by Och, one notable difficulty with this approach is that, because the computation of T is based on an argmax operation (see eq. 1), it is not continuous with regard to l, and standard gradientdescent methods cannot be used to solve the optimization. Och proposes two workarounds to this problem: the first one relies on a direct optimization method derived from Powell's algorithm; the second introduces a smoothed (continuous) version of the error function E(T,R) and then relies on a gradient-based optimization method.</Paragraph> <Paragraph position="4"> We have opted for this last approach. Och shows how to implement it when the error function can be computed as the sum of errors on individual sentences. Unfortunately, this is not the case for such widely used MT evaluation metrics as BLEU (Papineni et al., 2002) and NIST (Doddington, 2002).</Paragraph> <Paragraph position="5"> We show here how it can be done for NIST; a similar derivation is possible for BLEU.</Paragraph> <Paragraph position="6"> The NIST evaluation metric computes a weighted n-gram precision between T and R, multiplied by a factor B(S,T,R) that penalizes short translations.</Paragraph> <Paragraph position="7"> It can be formulated as: where N is the largest n-gram considered (usually N = 4), In(ts,rs) is a weighted count of common n-grams between the target (ts) and reference (rs) translations of sentence s, and Cn(ts) is the total number of n-grams in ts.</Paragraph> <Paragraph position="8"> To derive a version of this formula that is a continuous function of l, we will need multiple translations ts,1,...,ts,K for each source sentence s. The general idea is to weight each of these translations 2For the sake of simplicity, we consider a single reference translation per source sentence, but the argument can easily be extended to multiple references.</Paragraph> <Paragraph position="9"> by a factor w(l,s,k), proportional to the score ml(ts,k|s) that ts,k is assigned by the log-linear model for a given l:</Paragraph> <Paragraph position="11"> where a is the smoothing factor. Thus, in the smoothed version of the NIST function, the term In(ts,rs) in equation (3) is replaced bysummationtext k w(l,s,k)In(ts,k,rs), and the term Cn(ts) is replaced by summationtextk w(l,s,k)Cn(ts,k). As for the brevity penalty factor B(S,T,R), it depends on the total length of translation T, i.e. summationtexts |ts|. In the smoothed version, this term is replaced bysummationtext s summationtext k w(l,s,k)|ts,k|. Note that, when a - [?], then w(l,s,k) - 0 for all translations of s, except the one for which the model gives the highest score, and so the smooth and normal NIST functions produce the same value. In practice, we determine some &quot;good&quot; value for a by trial and error (5 works fine). We thus obtain a scoring function for which we can compute a derivative relative to l, and which can be optimized using gradient-based methods. In practice, we use the OPT++ implementation of a quasi-Newton optimization (Meza, 1994). As observed by Och, the smoothed error function is not convex, and therefore this sort of minimum-error rate training is quite sensitive to the initialization values for the l parameters. Our approach is to use a random set of initializations for the parameters, perform the optimization for each initialization, and select the model which gives the overall best performance.</Paragraph> <Paragraph position="12"> Globally, parameter estimation proceeds along these steps: 1. Initialize the training set: using random parameter values l0, for each source sentence of some given set of sentences S, we compute multiple translations. (In practice, we use the M-best translations produced by our decoder; see Section 5).</Paragraph> <Paragraph position="13"> 2. Optimize the parameters: using the method described above, we find l that produces the best smoothed NIST score on the training set.</Paragraph> <Paragraph position="14"> 3. Iterate: we then re-translate the sentences of S with this new l, combine the resulting multiple translations with those already in the training set, and go back to step 2.</Paragraph> <Paragraph position="15"> Steps 2 and 3 can be repeated until the smooothed NIST score does not increase anymore3.</Paragraph> </Section> <Section position="6" start_page="758" end_page="758" type="metho"> <SectionTitle> 5 Decoder </SectionTitle> <Paragraph position="0"> We implemented a version of the beam-search stack decoder described in (Koehn, 2003), extended to cope with non-contiguous phrases. Each translation is the result of a sequence of decisions, each of which involves the selection of a bi-phrase and of a target position. The final result is obtained by combining decisions, as in Figure 2. Hypotheses, corresponding to partial translations, are organised in a sequence of priority stacks, one for each number of source words covered. Hypotheses are extended by filling the first available uncovered position in the target sentence; each extended hypotheses is then inserted in the stack corresponding to the updated number of covered source words. Each hypothesis is assigned a score which is obtained as a combination of the actual feature function values and of admissible heuristics, adapted to deal with gaps in phrases, estimating the future cost for completing a translation. Each stack undergoes both threshold and histogram pruning. Whenever two hypotheses are indistinguishable as far as the potential for further extension is concerned, they are merged and only the highest-scoring is further extended. Complete translations are eventually recovered in the &quot;last&quot; priority stack, i.e. the one corresponding to the total number of source words: the best translation is the one with the highest score, and that does not have any remaining gaps in the target.</Paragraph> </Section> class="xml-element"></Paper>