File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-1006_evalu.xml
Size: 4,435 bytes
Last Modified: 2025-10-06 13:59:33
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1006"> <Title>Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages</Title> <Section position="8" start_page="44" end_page="46" type="evalu"> <SectionTitle> 7 Experiments and Results </SectionTitle> <Paragraph position="0"> We first investigated to what extent the OOV rate on the development data could be reduced by our backoff procedure. Table 3 shows the percentage of words that are still untranslatable after backoff. A comparison with Table 1 shows that the backoff model reduces the OOV rate, with a larger reduction effect observed when the training set is smaller. We next performed translation with backoff systems trained on each data partition. In each case, the combination weights for the indi- null and test sets under the backoff model (word types/tokens).</Paragraph> <Paragraph position="1"> vidual model scores were re-optimized. Table 4 shows the evaluation results on the dev set. Since the BLEU score alone is often not a good indicator of successful translations of unknown words (theunigram orbigram precision maybeincreased but may not have a strong effect on the over-all BLEU score), position-independent word error rate (PER) rate was measured as well. We see improvements in BLEU score and PERs in almost all cases. Statistical significance was measured on PERusing a difference of proportions significance test and on BLEU using a segment-level paired t-test. PER improvements are significant almost all training conditions for both languages; BLEU improvements are significant in all conditions for Finnish and for the two smallest training sets for German. Theeffect on the overall development set (consisting of both sentences with known words only and sentences withunknown words) isshown in Table 5. As expected, the impact on overall performance is smaller, especially for larger training data sets, due to the relatively small percentage of OOV tokens (see Table 1). The evaluation results for the test set are shown in Tables 6 (for the sub-set of sentences with OOVs) and 7 (for the entire test set), with similar conclusions.</Paragraph> <Paragraph position="2"> The examples A and B in Figure 2 demonstrate higher-scoring translations produced by the backoff system as opposed to the baseline system. An analysis of the backoff system output showed that in some cases (e.g. examples C and pass output). Here and in the following tables, statistically significant differences to the baseline model are shown in boldface (p < 0.05).</Paragraph> <Paragraph position="3"> word error rate (PER) for the test set (subset with OOV words).</Paragraph> <Paragraph position="4"> D in Figure 2), the backoff model produced a good translation, but the translation was a paraphrase rather than an identical match to the reference translation. Since only a single reference translation is available for the Europarl data (preventing the computation of a BLEU score based on multiple hand-annotated references), good but non-matching translations are not taken into account by our evaluation method. In other cases the unknown word was translated correctly, but since it was translated as single-word phrase the segmentation of the entire sentence was affected. This may cause greater distortion effects since the sentence is segmented into a larger number of smaller phrases, each of which can be reordered.</Paragraph> <Paragraph position="5"> We therefore added the possibility of translating an unknown word in its phrasal context by stemming up to m words to the left and right in the original sentence and finding translations for the entire stemmed phrase (i.e. the function stem() is now applied to the entire phrase). This step is inserted before the stemming of a single word f in the backoff model described above. However, since translations for entire stemmed phrases were found only in about 1% of all cases, there was no significant effect on the BLEU score. Another possibility of limiting reordering effects resulting from single-word translations of OOVs is to restrict the distortion limit of the decoder. Our word error rate (PER) for the test set (entire test set).</Paragraph> <Paragraph position="6"> experiments showed that this improves the BLEU score slightly for both the baseline and the backoff system; the relative difference, however, remained the same.</Paragraph> </Section> class="xml-element"></Paper>