File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1023_metho.xml
Size: 13,642 bytes
Last Modified: 2025-10-06 14:08:54
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1023"> <Title>Discriminative Reranking for Machine Translation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Discriminative Reranking for MT </SectionTitle> <Paragraph position="0"> The reranking approach for MT is defined as follows: First, a baseline system generates a0 -best candidates. Features that can potentially discriminate between good vs.</Paragraph> <Paragraph position="1"> bad translations are extracted from these a0 -best candidates. These features are then used to determine a new ranking for the a0 -best list. The new top ranked candidate in this a0 -best list is our new best candidate translation.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Advantages of Discriminative Reranking </SectionTitle> <Paragraph position="0"> Discriminative reranking allows us to use global features which are unavailable for the baseline system. Second, we can use features of various kinds and need not worry about fine-grained smoothing issues. Finally, the statistical machine learning approach has been shown to be effective in many NLP tasks. Reranking enables rapid experimentation with complex feature functions, because the complex decoding steps in SMT are done once to generate the N-best list of translations.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Problems applying reranking to MT </SectionTitle> <Paragraph position="0"> First, we consider how to apply discriminative reranking to machine translation. We may directly use those algorithms that have been successfully used in parse reranking. However, we immediately find that those algorithms are not as appropriate for machine translation. Let a4a18a17 be the candidate ranked at the a19 th position for the source sentence, where ranking is defined on the quality of the candidates. In parse reranking, we look for parallel hyperplanes successfully separating a4a21a20 and a4a18a22a24a23a25a23a25a23a26 for all the source sentences, but in MT, for each source sentence, we have a set of reference translations instead of a single gold standard. For this reason, it is hard to define which candidate translation is the best. Suppose we have two translations, one of which is close to reference translation refa27 while the other is close to reference translation refa28 . It is difficult to say that one candidate is better than the other.</Paragraph> <Paragraph position="1"> Although we might invent metrics to define the quality of a translation, standard reranking algorithms cannot be directly applied to MT. In parse reranking, each training sentence has a ranked list of 27 candidates on average (Collins, 2000), but for machine translation, the number of candidate translations in the a0 -best list is much higher. (SMT Team, 2003) show that to get a reasonable improvement in the BLEU score at least 1000 candidates need to be considered in the a0 -best list.</Paragraph> <Paragraph position="2"> In addition, the parallel hyperplanes separating a4a21a20 and a4a18a22a29a23a25a23a25a23a26 actually are unable to distinguish good translations from bad translations, since they are not trained to distinguish any translations in a4a18a22a24a23a25a23a25a23a26 . Furthermore, many good translations in a4 a22a29a23a25a23a25a23a26 may differ greatly from a4 a20 , since there are multiple references. These facts cause problems for the applicability of reranking algorithms.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Splitting </SectionTitle> <Paragraph position="0"> Our first attempt to handle this problem is to redefine the notion of good translations versus bad translations. Instead of separating a4 a20 and a4 a22a29a23a25a23a25a23a26 , we say the top a30 of the</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Ordinal Regression </SectionTitle> <Paragraph position="0"> Furthermore, if we only look for the hyperplanes to separate the good and the bad translations, we, in fact, discard the order information of translations of the same class.</Paragraph> <Paragraph position="1"> Maybe knowing that a4a21a20a46a45a47a45 is better than a4a21a20a48a45a49a20 may be useless for training to some extent, but knowing a4a18a22 is better than a4a18a50 a45a47a45 is useful, if a30 a36a51a42a52a40a9a40 . Although we cannot give an affirmative answer at this time, it is at least reasonable to use the ordering information. The problem is how to use the ordering information. In addition, we only want to maintain the order of two candidates if their ranks are far away from each other. On the other hand, we do not care the order of two translations whose ranks are very close, e.g. 100 and 101. Thus insensitive ordinal regression is more desirable and is the approach we follow in this paper.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.5 Uneven Margins </SectionTitle> <Paragraph position="0"> However, reranking is not an ordinal regression problem. In reranking evaluation, we are only interested in the quality of the translation with the highest score, and we do not care the order of bad translations. Therefore we cannot simply regard a reranking problem as an ordinal regression problem, since they have different definitions for the loss function.</Paragraph> <Paragraph position="1"> As far as linear classifiers are concerned, we want to maintain a larger margin in translations of high ranks and a smaller margin in translations of low ranks. For example, null margina2a16a4 a20 a41a53a4a18a50 a45 a10a55a54 margina2a16a4 a20 a41a47a4 a20a46a45 a10a55a54 margina2a16a4 a22a56a20 a41a53a4a18a50 a45 a10 The reason is that the scoring function will be penalized if it can not separate a4a21a20 from a4a21a20a46a45 , but not for the case of</Paragraph> <Paragraph position="3"/> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.6 Large Margin Classifiers </SectionTitle> <Paragraph position="0"> There are quite a few linear classifiers1 that can separate samples with large margin, such as SVMs (Vapnik, 1998), Boosting (Schapire et al., 1997), Winnow (Zhang, 2000) and Perceptron (Krauth and Mezard, 1987). The performance of SVMs is superior to other linear classifiers because of their ability to margin maximization.</Paragraph> <Paragraph position="1"> However, SVMs are extremely slow in training since they need to solve a quadratic programming search. For example, SVMs even cannot be used to train on the whole Penn Treebank in parse reranking (Shen and Joshi, 2003).</Paragraph> <Paragraph position="2"> Taking this into account, we use perceptron-like algorithms, since the perceptron algorithm is fast in training which allow us to do experiments on real-world data. Its large margin version is able to provide relatively good results in general.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.7 Pairwise Samples </SectionTitle> <Paragraph position="0"> In previous work on the PRank algorithm, ranks are defined on the entire training and test data. Thus we can define boundaries between consecutive ranks on the entire data. But in MT reranking, ranks are defined over every single source sentence. For example, in our data set, the rank of a translation is only the rank among all the translations for the same sentence. The training data includes about 1000 sentences, each of which normally has 1000 candidate translations with the exception of short sentences that have a smaller number of candidate translations. As a result, we cannot use the PRank algorithm in the reranking task, since there are no global ranks or boundaries for all the samples.</Paragraph> <Paragraph position="1"> However, the approach of using pairwise samples does work. By pairing up two samples, we compute the relative distance between these two samples in the scoring metric. In the training phase, we are only interested in whether the relative distance is positive or negative.</Paragraph> <Paragraph position="2"> However, the size of generated training samples will be very large. For a0 samples, the total number of pair-wise samples in (Herbrich et al., 2000) is roughly a0 . In the next section, we will introduce two perceptron-like algorithms that utilize pairwise samples while keeping the complexity of data space unchanged.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Reranking Algorithms </SectionTitle> <Paragraph position="0"> Considering the desiderata discussed in the last section, we present two perceptron-like algorithms for MT reranking. The first one is a splitting algorithm specially designed for MT reranking, which has similarities to a 1Here we only consider linear kernels such as polynomial kernels.</Paragraph> <Paragraph position="1"> classification algorithm. We also experimented with an ordinal regression algorithm proposed in (Shen and Joshi, 2004). For the sake of completeness, we will briefly describe the algorithm here.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Splitting </SectionTitle> <Paragraph position="0"> In this section, we will propose a splitting algorithm which separates translations of each sentence into two parts, the top a30 translations and the bottom a31 translations. All the separating hyperplanes are parallel by sharing the same weight vector a57 . The margin is defined on the distance between the top a30 items and the bottom a31 items in each cluster, as shown in Figure 1.</Paragraph> <Paragraph position="1"> Let a58 a17a60a59a61 be the feature vector of the a62a9a63a16a64 translation of the a19a65a63a16a64 sentence, and a66 a17a16a59a61 be the rank for this translation among all the translations for the a19a67a63a16a64 sentence. Then the set of training samples is:</Paragraph> <Paragraph position="3"> where a72 is the number of clusters and a0 is the length of ranks for each cluster.</Paragraph> <Paragraph position="4"> Let a77 a2 a58 a10a78a36 a57a80a79a82a81a24a58 be a linear function, where a58 is the feature vector of a translation, and a57a80a79 is a weight vector. We construct a hypothesis function a83 a79a85a84a87a86a89a88a91a90 with a77 as follows.</Paragraph> <Paragraph position="6"> where a30a14a97 a0 a31 is a function that takes a list of scores for the candidate translations computed according to the evaluation metric and returns the rank in that list. For example</Paragraph> <Paragraph position="8"> which means that a77 can successfully separate the good translations and the bad translations.</Paragraph> <Paragraph position="9"> Suppose there exists a linear function a77 satisfying (1) and (2), we say a69a18a2 a58 a17a16a59a61a9a41 a66 a17a16a59a61a11a10a56a75 is a111 a1a87a112 a19a65a113a46a113a48a97a18a114 a112a16a115 by a77 given Algorithm 1 is a perceptron-like algorithm that looks for a function that splits the training data. The idea of the algorithm is as follows. For every two translations a58 However, the updating is not executed until all the inconsistent pairs in a sentence are found for the purpose of speeding up the algorithm. When sentence a19 is selected, we first compute and store a57 a63 a81a46a58 a17a60a59a61 for all a62 . Thus we do not need to recompute a57a130a63a153a81a52a58 a17a16a59a61 again in the inner loop. Now the complexity of a repeat iteration is</Paragraph> <Paragraph position="11"> a0a139a161a70a10 , where a161 is the average number of active features in vector a58 a17a60a59a61 . If we updated the weight vector whenever an inconsistent pair was found, the complexity of a loop would be a160 a2 a72 a0</Paragraph> <Paragraph position="13"> The following theorem will show that Algorithm 1 will stop in finite steps, outputting a function that splits the training data with a large margin, if the training data is splittable. Due to lack of space, we omit the proof for Theorem 1 in this paper.</Paragraph> <Paragraph position="14"> Theorem 1 Suppose the training samples a69a70a2 a58 a17a16a59a61a9a41 a66 a17a60a59a61a11a10a49a75 are a111 a1a87a112 a19a65a113a46a113a48a97a18a114 a112a16a115 by a linear function defined on the weight vector a57a107a162 with a splitting margin a117 , where a6a94a6a57a107a162 a6a94a6a76a36a163a38 . Let a164 a36 a72 a97a70a165 a17a16a59a61 a6a95a6a58 a17a16a59a61 a6a94a6. Then Algorithm 1 makes at most</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Ordinal Regression </SectionTitle> <Paragraph position="0"> The second algorithm that we will use for MT reranking is the a173 -insensitive ordinal regression with uneven margin, which was proposed in (Shen and Joshi, 2004), as shown in Algorithm 2.</Paragraph> <Paragraph position="1"> In Algorithm 2, the function a161 a19a46a111 is used to control the level of insensitivity, and the function a174 is used to control the learning margin between pairs of translations with different ranks as described in Section 3.5. There are many candidates for a174 . The following definition for a174 is one of the simplest solutions.</Paragraph> <Paragraph position="2"> We will use this function in our experiments on MT reranking.</Paragraph> </Section> </Section> class="xml-element"></Paper>