File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-1023_evalu.xml
Size: 6,370 bytes
Last Modified: 2025-10-06 13:59:09
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1023"> <Title>Discriminative Reranking for Machine Translation</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Experiments and Analysis </SectionTitle> <Paragraph position="0"> We provide experimental results on the NIST 2003 Chinese-English large data track evaluation. We use the data set used in (SMT Team, 2003). The training data consists of about 170M English words, on which the baseline translation system is trained. The training data is also used to build language models which are used to define feature functions on various syntactic levels. The development data consists of 993 Chinese sentences. Each Chinese sentence is associated with 1000-best English translations generated by the baseline MT system. The development data set is used to estimate the parameters for the feature functions for the purpose of reranking. The Every single feature was combined with the 6 baseline features for the training and test. The minimum error training (Och, 2003) was used on the development data for parameter estimation.</Paragraph> <Paragraph position="1"> test data consists of 878 Chinese sentences. Each Chinese sentence is associated with 1000-best English translations too. The test set is used to assess the quality of the reranking output.</Paragraph> <Paragraph position="2"> In (SMT Team, 2003), 450 features were generated.</Paragraph> <Paragraph position="3"> Six features from (Och, 2003) were used as baseline features. Each of the 450 features was evaluated independently by combining it with 6 baseline features and assessing on the test data with the minimum error training. The baseline BLEU score on the test set is 31.6%. Table 1 shows some of the best performing features.</Paragraph> <Paragraph position="4"> In (SMT Team, 2003), aggressive search was used to combine features. After combining about a dozen features, the BLEU score did not improve any more, and the score was 32.9%. It was also noticed that the major improvement came from the Model 1 feature. By combining the four features, Model 1, matched parentheses, matched quotation marks and POS language model, the system achieved a BLEU score of 32.6%.</Paragraph> <Paragraph position="5"> In our experiments, we will use 4 different kinds of feature combinations: a157 Baseline: The 6 baseline features used in (Och, 2003), such as cost of word penalty, cost of aligned template penalty.</Paragraph> <Paragraph position="6"> a157 Best Feature: Baseline + IBM Model 1 + matched parentheses + matched quotation marks + POS language model.</Paragraph> <Paragraph position="7"> a157 Top Twenty: Baseline + 14 features with individual BLEU score no less than 31.9% with the minimum error training.</Paragraph> <Paragraph position="8"> a157 Large Set: Baseline + 50 features with individual BLEU score no less than 31.7% with the minimum error training. Since the baseline is 31.6% and the 95% confidence range is a182 0.9%, most of the features in this set are not individually discriminative with respect to the BLEU metric.</Paragraph> <Paragraph position="9"> We apply Algorithm 1 and 2 to the four feature sets.</Paragraph> <Paragraph position="10"> For algorithm 1, the splitting algorithm, we set a31 a36a183a42a9a40a52a40 in the 1000-best translations given by the baseline MT system. For algorithm 2, the ordinal regression algorithm, we set the updating condition as a66 a17a16a59a61a82a184 a102 a147a89a66 a17a16a59 , which means one's rank number is at most half of the other's and there are at least 20 ranks in between. Figures 2-9 show the results of using Algorithm 1 and 2 with the four feature sets. The a165 -axis represents the number of iterations in the training. The left a66 -axis stands for the BLEU% score on the test data, and the right a66 -axis stands for log of the loss function on the development data.</Paragraph> <Paragraph position="11"> Algorithm 1, the splitting algorithm, converges on the first three feature sets. The smaller the feature set is, the faster the algorithm converges. It achieves a BLEU score of 31.7% on the Baseline, 32.8% on the Best Feature, but only 32.6% on the Top Twenty features. However it is within the range of 95% confidence. Unfortunately on the Large Set, Algorithm 1 converges very slowly.</Paragraph> <Paragraph position="12"> In the Top Twenty set there are a fewer number of individually non-discriminative feature making the pool of features &quot;better&quot;. In addition, generalization performance in the Top Twenty set is better than the Large Set due to the smaller set of &quot;better&quot; features, cf. (Shen and Joshi, 2004). If the number of the non-discriminative features is large enough, the data set becomes unsplittable. We have tried using the a185 trick as in (Li et al., 2002) to make data separable artificially, but the performance could not be improved with such features.</Paragraph> <Paragraph position="13"> We achieve similar results with Algorithm 2, the ordinal regression with uneven margin. It converges on the first 3 feature sets too. On the Baseline, it achieves 31.4%. We notice that the model is over-trained on the development data according to the learning curve. In the Best Feature category, it achieves 32.7%, and on the Top Twenty features, it achieves 32.9%. This algorithm does not converge on the Large Set in 10000 iterations.</Paragraph> <Paragraph position="14"> We compare our perceptron-like algorithms with the minimum error training used in (SMT Team, 2003) as shown in Table 2. The splitting algorithm achieves slightly better results on the Baseline and the Best Feature set, while the minimum error training and the regression algorithm tie for first place on feature combinations. However, the differences are not significant.</Paragraph> <Paragraph position="15"> We notice in those separable feature sets the performance on the development data and the test data are tightly consistent. Whenever the log-loss on the development set is decreased, and BLEU score on the test set goes up, and vice versa. This tells us the merit of these the development data, we can improve performance on the test data. This property is guaranteed by the theoretical analysis and is borne out in the experimental results.</Paragraph> </Section> class="xml-element"></Paper>