File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/n06-1003_evalu.xml
Size: 5,851 bytes
Last Modified: 2025-10-06 13:59:38
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1003"> <Title>Improved Statistical Machine Translation Using Paraphrases</Title> <Section position="6" start_page="19" end_page="22" type="evalu"> <SectionTitle> 5 Results </SectionTitle> <Paragraph position="0"> We produced translations under five conditions for each of our training corpora: a set of baseline translations without any additional entries in the phrase table, a condition where we added the translations of paraphrases for unseen source words along with paraphrase probabilities, a condition where we added the translations of paraphrases of multi-word phrases along with paraphrase probabilities, and two additional conditions where we added the translations of paraphrases of single and multi-word paraphrase without paraphrase probabilities.</Paragraph> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 5.1 Bleu scores </SectionTitle> <Paragraph position="0"> Table 2 gives the Bleu scores for each of these conditions. We were able to measure a translation improvement for all sizes of training corpora, under both the single word and multi-word conditions, except for the largest Spanish-English corpus. For the single word condition, it would have been surprising if we had seen a decrease in Bleu score. Because we are translating words that were previously untranslatable it would be unlikely that we could do any worse. In the worst case we would be replacing one word that did not occur in the reference translation with another, and thus have no effect on Bleu.</Paragraph> <Paragraph position="1"> More interesting is the fact that by paraphrasing unseen multi-word units we get an increase in quality above and beyond the single word paraphrases.</Paragraph> <Paragraph position="2"> These multi-word units may not have been observed in the training data as a unit, but each of the component words may have been. In this case translating a paraphrase would not be guaranteed to received an improved or identical Bleu score, as in the single word case. Thus the improved Bleu score is notable.</Paragraph> <Paragraph position="3"> Table 3 shows that incorporating the paraphrase probability into the model's feature functions plays a critical role. Without it, the multi-word paraphrases harm translation performance when compared to the baseline.</Paragraph> </Section> <Section position="2" start_page="21" end_page="22" type="sub_section"> <SectionTitle> 5.2 Manual evaluation </SectionTitle> <Paragraph position="0"> We performed a manual evaluation by judging the accuracy of phrases for 100 paraphrased translations from each of the sets using the manual word alignments.1 Table 4 gives the percentage of time that each of the translations of paraphrases were judged to have the same meaning as the equivalent target phrase. In the case of the translations of single word paraphrases for the Spanish accuracy ranged from just below 50% to just below 70%. This number is impressive in light of the fact that none of those items are correctly translated in the baseline model, which simply inserts the foreign language word. As with the Bleu scores, the translations of multi-word paraphrases were judged to be more accurate than the translations of single word paraphrases.</Paragraph> <Paragraph position="1"> In performing the manual evaluation we were additionally able to determine how often Bleu was capable of measuring an actual improvement in translation. For those items judged to have the same meaning as the gold standard phrases we could track how many would have contributed to a higher Bleu score (that is, which of them were exactly the same as the reference translation phrase, or had some words in common with the reference translation phrase). By counting how often a correct phrase would have contributed to an increased Bleu score, and how often it would fail to increase the Bleu score we were able to determine with what frequency Bleu was sensitive to our improvements. We found that Bleu was insensitive to our translation improvements between 60-75% of the time, thus re- null which have translations in each of the Spanish-English training corpora prior to paraphrasing inforcing our belief that it is not an appropriate measure for translation improvements of this sort.</Paragraph> </Section> <Section position="3" start_page="22" end_page="22" type="sub_section"> <SectionTitle> 5.3 Increase in coverage </SectionTitle> <Paragraph position="0"> As illustrated in Figure 1, translation models suffer from sparse data. When only a very small parallel corpus is available for training, translations are learned for very few of the unique phrases in a test set. If we exclude 451 words worth of names, numbers, and foreign language text in 2,000 sentences that comprise the Spanish portion of the Europarl test set, then the number of unique n-grams in text are: 7,331 unigrams, 28,890 bigrams, 44,194 trigrams, and 48,259 4-grams. Table 5 gives the percentage of these which have translations in each of the three training corpora, if we do not use paraphrasing. null In contrast after expanding the phrase table using the translations of paraphrases, the coverage of the unique test set phrases goes up dramatically (shown in Table 6). For the first training corpus with 10,000 sentence pairs and roughly 200,000 words of text in each language, the coverage goes up from less than 50% of the vocabulary items being covered to 90%.</Paragraph> <Paragraph position="1"> The coverage of unique 4-grams jumps from 3% to 16% - a level reached only after observing more which have translations in each of the Spanish-English training corpora after paraphrasing than 100,000 sentence pairs, or roughly three million words of text, without using paraphrases.</Paragraph> </Section> </Section> class="xml-element"></Paper>