File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-3227_evalu.xml
Size: 7,177 bytes
Last Modified: 2025-10-06 13:59:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3227"> <Title>Phrase Pair Rescoring with Term Weightings for Statistical Machine Translation</Title> <Section position="7" start_page="31" end_page="31" type="evalu"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> Experiments were carried out on the so-called large data track Chinese-English TIDES translation task, using the June 2002 test data. The training data used to train the statistical lexicon and to extract the phrase translation pairs was selected from a 120 million word parallel corpus in such a way as to cover the phrases in test sentences. The restricted training corpus contained then approximately 10 million words.. A trigram model was built on 20 million words of general newswire text, using the SRILM toolkit (Stolcke, 2002). Decoding was carried out as described in section 2.2. The test data consists of 878 Chinese sentences or 24,337 words after word segmentation. There are four human translations per Chinese sentence as references. Both NIST score and Bleu score (in percentage) are reported for adequacy and fluency aspects of the translation quality.</Paragraph> <Section position="1" start_page="31" end_page="31" type="sub_section"> <SectionTitle> 6.1 Transducers </SectionTitle> <Paragraph position="0"> Four transducers were used in our experiments: LDC, BiBr, HMM, and ISA.</Paragraph> <Paragraph position="1"> LDC was built from the LDC Chinese-English dictionary in two steps: first, morphological variations are created. For nouns and noun phrases plural forms and entries with definite and indefinite determiners were generated. For verbs additional word forms with -s -ed and -ing were generated, and the infinitive form with 'to'. Second, a large monolingual English corpus was used to filter out the new word forms. If they did not appear in the corpus, the new entries were not added to the transducer (Vogel, 2004).</Paragraph> <Paragraph position="2"> BiBr extracts sub-tree mappings from Bilingual Bracketing alignments (Wu, 1997); HMM extracts partial path mappings from the Viterbi path in the Hidden Markov Model alignments (Vogel et. al., 1996). ISA is an integrated segmentation and alignment for phrases (Zhang et.al, 2003), which is an extension of (Marcu and Wong, 2002).</Paragraph> <Paragraph position="3"> transducers extracted for the translation task. N is the total number of phrase pairs in the transducer. LDC is the largest one having 425K entries, as the other transducers are restricted to 'useful' entries, i.e. those translation pairs where the source phrase matches a sequence of words in one of the test sentence. Notice that the LDC dictionary has a large number of long translations, leading to a high source to target length ratio.</Paragraph> </Section> <Section position="2" start_page="31" end_page="31" type="sub_section"> <SectionTitle> 6.2 Cosine vs BM25 </SectionTitle> <Paragraph position="0"> The normalized cosine and bm25 distances defined in (8) and (9) respectively, are plugged into (11) to calculate the translation probabilities.</Paragraph> <Paragraph position="1"> Initial experiments are reported on the LDC transducer, which gives already a good translation, and therefore allows for fast and yet meaningful experimentation.</Paragraph> <Paragraph position="2"> In the first uniform probabilities are assigned to each phrase pair in the transducer. The second one (Base-m1) is using Equation (1) with a statistical lexicon trained using IBM Model-1, and Base-m4 is using the lexicon from IBM Model-4.</Paragraph> <Paragraph position="3"> Base-m4S is using IBM Model-4, but we skipped 194 high frequency English stop words in the calculation of Equation (1).</Paragraph> <Paragraph position="4"> Table-2 shows that the translation score defined by Equation (1) is much better than a uniform model, as expected. Base-m4 is slightly worse than Base-m1.on NIST score, but slightly better using the Bleu metric. Both differences are not statistically significant. The result for Base-m4S shows that skipping English stop words in Equation (1) gives a disadvantage. One reason is that skipping ignores too much non-trivial statistics from parallel corpus especially for short phrases. These high frequency words actually account already for more than 40% of the tokens in the corpus.</Paragraph> <Paragraph position="5"> Using the vector model, both with the cosine cos d and the bm25 25bm d distance, is significantly better than Base-m1 and Base-m4 models, which confirms our intuition of the vector model as an additional useful evidence for translation quality. The length regularization (12) helps only slightly for LDC. Since bm25's parameters could be tuned for potentially better performance, we selected bm25 with length regularization as the model tested in further experiments.</Paragraph> <Paragraph position="6"> A full-loaded system is tested using the LM020 with and without word-reordering in decoding. The results are presented in Table-3. Table-3 shows consistent improvements on all configurations: the individual transducers, combinations of transducers, and different decoder settings of word-reordering. Because each phrase pair is treated as a &quot;bag-of-words&quot;, the grammar structure is not well represented in the vector model. Thus our model is more tuned towards the adequacy aspect, corresponding to NIST score improvement.</Paragraph> <Paragraph position="7"> Because the transducers of BiBr, HMM, and ISA are extracted from the same training data, they have significant overlaps with each other. This is why we observe only small improvements when adding more transducers.</Paragraph> <Paragraph position="8"> The final NIST score of the full system is 8.24, and the Bleu score is 22.37. This corresponds to 3.1% and 11.8% relative improvements over the baseline. These improvements are statistically significant according to a previous study (Zhang et.al., 2004), which shows that a 2% improvement in NIST score and a 5% improvement in Bleu score is significant for our translation system on the June 2002 test data.</Paragraph> </Section> <Section position="3" start_page="31" end_page="31" type="sub_section"> <SectionTitle> 6.3 Mean Reciprocal Rank </SectionTitle> <Paragraph position="0"> To further investigate the effects of the rescoring function in (11), Mean Reciprocal Rank (MRR) experiments were carried out. MRR for a labeled set is the mean of the reciprocal rank of the individual phrase pair, at which the best candidate translation is found (Kantor and Voorhees, 1996).</Paragraph> <Paragraph position="1"> Totally 9,641 phrase pairs were selected containing 216 distinct source phrases. Each source phrase was labeled with its best translation candidate without ambiguity. The rank of the labeled candidate is calculated according to translation scores. The results are shown in Table-4. null The rescore functions improve the MRR from 0.40 to 0.58 using cosine distance, and to 0.75 using bm25. This confirms our intuitions that good translation candidates move up in the rank after the rescoring.</Paragraph> </Section> </Section> class="xml-element"></Paper>