File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/p03-1021_evalu.xml

Size: 3,513 bytes

Last Modified: 2025-10-06 13:58:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1021">
  <Title>Minimum Error Rate Training in Statistical Machine Translation</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
7 Results
</SectionTitle>
    <Paragraph position="0"> We present results on the 2002 TIDES Chinese-English small data track task. The goal is the translation of news text from Chinese to English. Table 1 provides some statistics on the training, development and test corpus used. The system we use does not include rule-based components to translate numbers, dates or names. The basic feature functions were trained using the training corpus. The development corpus was used to optimize the parameters of the log-linear model. Translation results are reported on the test corpus.</Paragraph>
    <Paragraph position="1"> Table 2 shows the results obtained on the development corpus and Table 3 shows the results obtained  correspond to larger BLEU and NIST scores and to smaller error rates. Italic numbers refer to results for which the difference to the best result (indicated in bold) is not statistically significant. error criterion used in training mWER [%] mPER [%] BLEU [%] NIST # words  on the test corpus. Italic numbers refer to results for which the difference to the best result (indicated in bold) is not statistically significant. For all error rates, we show the maximal occurring 95% confidence interval in any of the experiments for that column. The confidence intervals are computed using bootstrap resampling (Press et al., 2002). The last column provides the number of words in the produced translations which can be compared with the average number of reference words occurring in the development and test corpora given in Table 1.</Paragraph>
    <Paragraph position="2"> We observe that if we choose a certain error criterion in training, we obtain in most cases the best results using the same criterion as the evaluation metric on the test data. The differences can be quite large: If we optimize with respect to word error rate, the results are mWER=68.3%, which is better than if we optimize with respect to BLEU or NIST and the difference is statistically significant. Between BLEU and NIST, the differences are more moderate, but by optimizing on NIST, we still obtain a large improvement when measured with NIST compared to optimizing on BLEU.</Paragraph>
    <Paragraph position="3"> The MMI criterion produces significantly worse results on all error rates besides mWER. Note that, due to the re-definition of the notion of reference translation by using minimum edit distance, the results of the MMI criterion are biased toward mWER.</Paragraph>
    <Paragraph position="4"> It can be expected that by using a suitably defined a46 gram precision to define the pseudo-references for MMI instead of using edit distance, it is possible to obtain better BLEU or NIST scores.</Paragraph>
    <Paragraph position="5"> An important part of the differences in the translation scores is due to the different translation length (last column in Table 3). The mWER and MMI criteria prefer shorter translations which are heavily penalized by the BLEU and NIST brevity penalty.</Paragraph>
    <Paragraph position="6"> We observe that the smoothed error count gives almost identical results to the unsmoothed error count. This might be due to the fact that the number of parameters trained is small and no serious overfitting occurs using the unsmoothed error count.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML