File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1096_metho.xml
Size: 21,537 bytes
Last Modified: 2025-10-06 14:10:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1096"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics An End-to-End Discriminative Approach to Machine Translation</Title> <Section position="5" start_page="761" end_page="761" type="metho"> <SectionTitle> 3 Dataset </SectionTitle> <Paragraph position="0"> Our experiments were done on the French-English portion of the Europarl corpus (Koehn, 2002), various statistics on length 5-15 sentences. The number of French word tokens is given, along with the number that were not seen among the 414K total sentences in TRAIN (which includes all lengths).</Paragraph> <Paragraph position="1"> which consists of European parliamentary proceedings from 1996 to 2003.</Paragraph> <Paragraph position="2"> We split the data into three sets according to Table 1. TRAIN served two purposes: it was used to construct the features, and the 5-15 length sentences were used for tuning the parameters of those features. DEV, which consisted of the first 1K length 5-15 sentences in 2002, was used to evaluate the performance of the system as we developed it. Note that the DEV set was not used to tune any parameters; tuning was done exclusively on TRAIN. At the end we ran our models once on</Paragraph> </Section> <Section position="6" start_page="761" end_page="763" type="metho"> <SectionTitle> TEST to get final numbers.2 4 Models </SectionTitle> <Paragraph position="0"> Our experiments used phrase-based models (Koehn et al., 2003), which require a translation table and language model for decoding and feature computation. To facilitate comparison with previous work, we created the translation tables using the same techniques as Koehn et al.</Paragraph> <Paragraph position="1"> (2003).3 The language model was a Kneser-Ney interpolated trigram model generated using the SRILM toolkit (Stolcke, 2002). We built our own phrase-based beam decoder that can handle arbitrary features.4 The contributions of features are incrementally added into the score as decoding 2We also experimented with several combinations of jackknifing to prevent overfitting, in which we selected features on TRAIN-OLD (1996-1998 Europarl corpus) and tuned the parameters on TRAIN, or vice-versa. However, it turned out that using TRAIN-OLD was suboptimal since that data is less relevant to DEV. Another alternative is to combine TRAIN-OLD and TRAIN into one dual-purpose dataset. The differences between this and our current approach were inconclusive. null proceeds.</Paragraph> <Paragraph position="2"> We experimented with two levels of distortion: monotonic, where the phrasal alignment is monotonic (but word reordering is still possible within a phrase) and limited distortion, where only adjacent phrases are allowed to exchange positions (Zens and Ney, 2004). In the future, we plan to explore our discriminative framework on a full distortion model (Koehn et al., 2003) or even a hierarchical model (Chiang, 2005).</Paragraph> <Paragraph position="3"> Throughout the following experiments, we trained the perceptron algorithm for 10 iterations.</Paragraph> <Paragraph position="4"> The weights were initialized to 1 on the translation table, 1 on the language model (the blanket features in Section 6), and 0 elsewhere. The next two sections give experiments on the two key components of a discriminative machine translation system: choosing the proper update strategy (Section 5) and including powerful features (Section 6).</Paragraph> </Section> <Section position="7" start_page="763" end_page="764" type="metho"> <SectionTitle> 5 Update strategies </SectionTitle> <Paragraph position="0"> This section describes the importance of choosing a good update strategy--the difference in BLEU score can be as large as 1.2 between different strategies. An update strategy specifies the target (yt,ht) that we update towards (Equation 3) given the current set of parameters and a provided reference translation (xi,yi). As mentioned in Section 2.2, faithful output (i.e. yt = yi) does not imply that updating towards (yt,ht) is desirable.</Paragraph> <Paragraph position="1"> In fact, such a constrained target might not even be reachable by the decoder, for example, if the reference is very non-literal.</Paragraph> <Paragraph position="2"> We explored the following three ways to choose the target (yt,ht): * Bold updating: Update towards the highest scoring option (y,h), where y is constrained to be the reference yi but h is unconstrained.</Paragraph> <Paragraph position="3"> Examples not reachable by the decoder are skipped.</Paragraph> <Paragraph position="4"> * Local updating: Generate an n-best list using the current parameters. Update towards the option with the highest BLEU score.5 5Since BLEU score (k-BLEU with k = 4) involves computing a geometric mean over i-grams, i = 1,..., k, it is zero if the translation does not have at least one k-gram in common with the reference translation. Since a BLEU score of zero is both unhelpful for choosing from the n-best and common when computed on just a single example, we instead used a smoothed version for choosing the target: P4i=1 i-BLEU(x,y)24[?]i+1 . We still report NIST's usual 4-gram BLEU.</Paragraph> <Paragraph position="6"> coder produces an n-best list. The reference translation may or may not be reachable.</Paragraph> <Paragraph position="7"> Bold updating most resembles the traditional perceptron update rule (Equation 2). We are ensured that the target output y will be correct, although the correspondence h might be bad. Another weakness of bold updating is that we might not make full use of the training data.</Paragraph> <Paragraph position="8"> Local updating uses every example, but its steps are more cautious. It can be viewed as &quot;dynamic reranking,&quot; where parameters are updated using the best option on the n-best list, similar to standard static reranking. The key difference is that, unlike static reranking, the parameter updates propagate back to the baseline classifier, so that the n-best list improves over time. In this regard, dynamic reranking remedies one of the main weaknesses of static reranking, which is that the performance of the system is directly limited by the quality of the baseline classifier.</Paragraph> <Paragraph position="9"> Hybrid updating combines the two strategies: it makes full use of the training data as in local updating, but still tries to make swift progress towards the reference translation as in bold updating. null We conducted experiments to see which of the updating strategies worked best. We trained on ferent updating strategies for the monotonic and limited distortion decoders on DEV.</Paragraph> <Paragraph position="10"> 5000 of the 67K available examples, using the BLANKET+LEX+POS feature set (Section 6). Table 2 shows that local updating is the most effective, especially when using the limited distortion decoder.</Paragraph> <Paragraph position="11"> In bold updating, only a small fraction of the 5000 examples (1296 for the monotonic decoder and 1601 for the limited distortion decoder) had reachable reference translations, and, therefore, contributed to parameter updates. One might therefore hypothesize that local updating performs better simply because it is able to leverage more data. This is not the full story, however, since the hybrid approach (which makes the same number of updates) performs significantly worse than local updating when using the limited distortion decoder. null To see the problem with bold updating, recall the example in Figure 1. Bold updating tries to reach the reference at all costs, even if it means abusing the hidden correspondence in the process. In the example, the alignment (', a) is unreasonable, but the algorithm has no way to recognize this. Local updating is much more stable since it only updates towards sentences in the n-best list. When using the limited distortion decoder, bold updating is even more problematic because the added flexibility of phrase swaps allows more preposterous alignments to be produced. Limited distortion decoding actually performs worse than monotonic decoding with bold updating, but better with local updating.</Paragraph> <Paragraph position="12"> Another difference between bold updating and local updating is that the BLEU score on the training data is dramatically higher for bold updating than for local (or hybrid) updating: 80 for the former versus 40 for the latter. This is not surprising given that bold updating aggressively tries to obtain the references. However, what is surprising is that although bold updating appears to be overfitting severely, its BLEU score on the DEV does not suffer much in the monotonic case.</Paragraph> </Section> <Section position="8" start_page="764" end_page="766" type="metho"> <SectionTitle> 6 Features </SectionTitle> <Paragraph position="0"> This section shows that by adding an array of expressive features and discriminatively learning their weights, we can obtain a 2.3 increase in BLEU score on DEV. We add these features incrementally, first tuning blanket features (Section 6.1), then adding lexical features (Section 6.2), and finally adding part-of-speech (POS) features (Section 6.3). Table 3 summarizes the performance gains.</Paragraph> <Paragraph position="1"> For the experiments in this section, we used the local updating strategy and the monotonic decoder for efficiency. We train on all 67K of the length 5-15 sentences in TRAIN.6</Paragraph> <Section position="1" start_page="764" end_page="765" type="sub_section"> <SectionTitle> 6.1 Blanket features </SectionTitle> <Paragraph position="0"> The blanket features (BLANKET) consist of the translation log-probability and the language model log-probability, which are two of the components of the Pharaoh model (Section 2.1). After discriminative training, the relative weight of these two features is roughly 2:1, resulting in a BLEU score increase from 33.0 (setting both weights to 1) to 33.4.</Paragraph> <Paragraph position="1"> The following simple example gives a flavor of the discriminative approach. The untuned system translated the French phrase trente-cinq langues into five languages in a DEV example.</Paragraph> <Paragraph position="2"> Although the probability P(five |trente-cinq) = 0.065 is rightly much smaller than P(thirty-five | trente-cinq) = 0.279, the language model favors five languages over thirty-five languages. The trained system downweights the language model and recovers the correct translation.</Paragraph> <Paragraph position="3"> 6We used sentences of length 5-15 to facilitate comparisons with Koehn et al. (2003) and to enable rapid experimentation with various feature sets. Experiments on sentences of length 5-50 showed similar gains in performance.</Paragraph> </Section> <Section position="2" start_page="765" end_page="765" type="sub_section"> <SectionTitle> 6.2 Lexical features </SectionTitle> <Paragraph position="0"> The blanket features provide a rough guide for translation, but they are far too coarse to fix specific mistakes. We therefore add lexical features (LEX) to allow for more fine-grained control. These features come in two varieties. Lexical phrase features indicate the presence of a specific translation phrase, such as (y a-t-il, are there), and lexical language model features indicate the presence of a specific output n-gram, such as of the.</Paragraph> <Paragraph position="1"> Lexical language model features have been exploited successfully in discriminative language modeling to improve speech recognition performance (Roark et al., 2004). We confirm the utility of the two kinds of lexical features: BLANKET+LEX achieves a BLEU score of 35.0, an improvement of 1.6 over BLANKET.</Paragraph> <Paragraph position="2"> To understand the effect of adding lexical features, consider the ten with highest and lowest These features can in fact be traced back to the following example: Input y a-t-il des observations ? B are there any of comments ? B+L are there any comments ? The second and third rows are the outputs of BLANKET (wrong) and BLANKET+LEX (correct), respectively. The correction can be accredited to two changes in feature weights. First, the lexical feature (y a-t-il, are there any) has been assigned a negative weight and (y a-t-il, are there) a positive weight to counter the fact that the former phrase incorrectly had a higher score in the original translation table. Second, (des, of) is preferred over (des, any), even though the former is a better translation in isolation. This apparent degradation causes no problems, because when des should actually be translated to of, these words are usually embedded in larger phrases, in which case the isolated translation probability plays no role.</Paragraph> <Paragraph position="3"> Another example of a related phenomenon is the following: Input ... pour cela que j ' ai vot'e favorablement . B ... for that i have voted in favour .</Paragraph> <Paragraph position="4"> B+L ... for this reason i voted in favour .</Paragraph> <Paragraph position="5"> Counterintuitively, the phrase pair (j ' ai, I have) ends up with a very negative weight. The reason behind this is that in French, j ' ai is often used in a paraphrastic construction which should be translated into the simple past in English. For that to happen, j ' ai needs to be aligned with I. Since (j ' ai, I) has a small score compare to (j ' ai, I have) in the original translation table, downweighting the latter pair allows this sentence to be translated correctly.</Paragraph> <Paragraph position="6"> A general trend is that literal phrase translations are downweighted. Lessening the pressure to literally translate certain phrases allows the language model to fill in the gaps appropriately with suitable non-literal translations. This point highlights the strength of discriminative training: weights are jointly tuned to account for the intricate interactions between overlapping phrases, which is something not achievable by estimating the weights directly from surface statistics.</Paragraph> </Section> <Section position="3" start_page="765" end_page="766" type="sub_section"> <SectionTitle> 6.3 Part-of-speech features </SectionTitle> <Paragraph position="0"> While lexical features are useful for eliminating specific errors, they have limited ability to generalize to related phrases. This suggests the use of similar features which are abstracted to the POS level.7 In our experiments, we used the TreeTagger POS tagger (Schmid, 1994), which ships pre-trained on several languages, to map each word to its majority POS tag. We could also relatively easily base our features on context-dependent POS tags: the entire input sentence is available before decoding begins, and the output sentence is decoded left-to-right and could be tagged incrementally. null Where we had lexical phrase features, such as (la r'ealisation du droit, the right), we now also have their POS abstractions, for instance (DT NN IN NN, DT NN). This phrase pair is undesirable, not because of particular lexical facts about la r'ealisation, but because dropping a nominal head is generally to be avoided. The lexical language model features have similar POS counterparts. With these two kinds of POS features, we obtained an 0.3 increase in BLEU score from BLANKET+LEX to BLANKET+LEX+POS.</Paragraph> <Paragraph position="1"> Finally, when we use the limited distortion decoder, it is important to learn when to swap adjacent phrases. Unlike Pharaoh, which simply has a uniform penalty for swaps, we would like to use context--in particular, POS information. For example, we would like to know that if a (JJ, JJ) ple phrase pairs. Constellations (a) and (b) have large positive weights and (c) has a large negative weight.</Paragraph> <Paragraph position="2"> phrase is constructed after a (NN, NN) phrase, they are reasonable candidates for swapping because of regular word-order differences between French and English. While the bulk of our results are presented for the monotonic case, the limited distortion results of Table 2 use these lexical swap features; without parameterized swap features, accuracy was below the untuned monotonic baseline. An interesting statistic is the number of nonzero feature weights that were learned using each feature set. BLANKET has only 4 features, while BLANKET+LEX has 1.55 million features.8 Remarkably, BLANKET+LEX+POS has fewer features--only 1.24 million. This is an effect of generalization ability--POS information somewhat reduces the need for specific lexical features.</Paragraph> </Section> <Section position="4" start_page="766" end_page="766" type="sub_section"> <SectionTitle> 6.4 Alignment constellation features </SectionTitle> <Paragraph position="0"> Koehn et al. (2003) demonstrated that choosing the appropriate heuristic for extracting phrases is very important. They showed that the difference in BLEU score between various heuristics was as large as 2.0.</Paragraph> <Paragraph position="1"> The process of phrase extraction is difficult to optimize in a non-discriminative setting: many heuristics have been proposed (Koehn et al., 2003), but it is not obvious which one should be chosen for a given language pair. We propose a natural way to handle this part of the translation pipeline. The idea is to push the learning process all the way down to the phrase extraction by parameterizing the phrase extraction heuristic itself. The heuristics in Koehn et al. (2003) decide whether to extract a given phrase pair based on the underlying word alignments (see Figure 3 for three examples), which we call constellations. Since we do not know which constellations correspond to 8Both the language model and translation table components have two features, one for known words and one for unknown words.</Paragraph> <Paragraph position="2"> good phrase pairs, we introduce an alignment constellation feature to indicate the presence of a particular alignment constellation.9 Table 4 details the effect of adding constellation features on top of our previous feature sets.10 We get a minor increase in BLEU score from each feature set, although there is no gain by adding POS features in addition to constellation features, probably because POS and constellation features provide redundant information for French-English translations.</Paragraph> <Paragraph position="3"> It is interesting to look at the constellations with highest and lowest weights, which are perhaps surprising at first glance. At the top of the list are word inversions (Figure 3 (a) and (b)), while long monotonic constellations fall at the bottom of the list (c). Although monotonic translations are much more frequent than word inversions in our dataset, when translations are monotonic, shorter segmentations are preferred. This phenomenon is another manifestation of the complex interaction of phrase segmentations.</Paragraph> </Section> </Section> <Section position="9" start_page="766" end_page="767" type="metho"> <SectionTitle> 7 Final results </SectionTitle> <Paragraph position="0"> The last column of Table 3 shows the performance of our methods on the final TEST set. Our best test BLEU score is 29.6 using BLANKET+LEX+POS, an increase of 1.3 BLEU over our untuned feature set BLANKET. The discrepancy between DEV performance and TEST performance is due to temporal distance from TRAIN and high variance in BLEU score.11 We also compared our model with Pharaoh (Koehn et al., 2003). We tuned Pharaoh's four parameters using minimum error rate training (Och, 2003) on DEV.12 We obtained an increase of 0.8 BLEU over the Pharaoh, run with the monotone flag.13 Even though we are using a monotonic decoder, our best results are still slightly better than the version of Pharaoh that permits arbitrary distortion. null</Paragraph> </Section> <Section position="10" start_page="767" end_page="767" type="metho"> <SectionTitle> 8 Related work </SectionTitle> <Paragraph position="0"> In machine translation, most discriminative approaches currently fall into two general categories.</Paragraph> <Paragraph position="1"> The first approach is to reuse the components of a generative model, but tune their relative weights in a discriminative fashion (Och and Ney, 2002; Och, 2003; Chiang, 2005). This approach only works in practice with a small handful of parameters.</Paragraph> <Paragraph position="2"> The second approach is to use reranking, in which a baseline classifier generates an n-best list of candidate translations, and a separate discriminative classifier chooses amongst them (Shen et al., 2004; Och et al., 2004). The major limitation of a reranking system is its dependence on the underlying baseline system, which bounds the potential improvement from discriminative training. In machine translation, this limitation is a real concern; it is common for all translations on moderately-sized n-best lists to be of poor quality. For instance, Och et al. (2004) reported that a 1000-best list was required to achieve performance gains from reranking. In contrast, the decoder in our system can use the feature weights learned in the previous iteration.</Paragraph> <Paragraph position="3"> Tillmann and Zhang (2005) present a discriminative approach based on local models. Their formulation explicitly decomposed the score of a translation into a sequence of local decisions, while our formulation allows global estimation.</Paragraph> </Section> class="xml-element"></Paper>