File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-1022_evalu.xml

Size: 4,410 bytes

Last Modified: 2025-10-06 13:59:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1022">
  <Title>Minimum Bayes-Risk Decoding for Statistical Machine Translation</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Performance of MBR Decoders
</SectionTitle>
    <Paragraph position="0"> We performed our experiments on the Large-Data Track of the NIST Chinese-to-English MT task (NIST, 2003).</Paragraph>
    <Paragraph position="1"> The goal of this task is the translation of news stories from Chinese to English. The test set has a total of 1791 sentences, consisting of 993 sentences from the NIST 2001 MT-eval set and 878 sentences from the NIST 2002 MT-eval set. Each Chinese sentence in this set has four reference translations.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Evaluation Metrics
</SectionTitle>
      <Paragraph position="0"> The performance of the baseline and the MBR decoders under the different loss functions was measured with respect to the four reference translations provided for the test set. Four evaluation metrics were used. These were multi-reference Word Error Rate (mWER) (Och, 2002), multi-reference Position-independent word Error Rate (mPER) (Och, 2002) , BLEU and multi-reference BiTree Error Rate.</Paragraph>
      <Paragraph position="1"> Among these evaluation metrics, the BLEU score directly takes into account multiple reference translations (Papineni et al., 2001). In case of the other metrics, we consider multiple references in the following way. For each sentence, we compute the error rate of the hypothesis translation with respect to the most similar reference translation under the corresponding loss function.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Decoder Performance
</SectionTitle>
      <Paragraph position="0"> In our experiments, a baseline translation model (JHU, 2003), trained on a Chinese-English parallel corpus (NIST, 2003) (a46a58a57a77a87a53a59 English words and a46a23a60a53a57a7a59 Chinese words), was used to generate 1000-best translation hypotheses for each Chinese sentence in the test set. The 1000-best lists were then rescored using the different translation loss functions described in Section 2.</Paragraph>
      <Paragraph position="1"> The English sentences in the a50 -best lists were parsed using the Collins parser (Collins, 1999), and the Chinese sentences were parsed using a Chinese parser provided to us by D. Bikel (Bikel and Chiang, 2000). The English parser was trained on the Penn Treebank and the Chinese parser on the Penn Chinese treebank.</Paragraph>
      <Paragraph position="2"> Under each loss function, the MBR decoding was performed using Equation 3. We say we have a matched condition when the same loss function is used in both the error rate and the decoder design. The performance of the MBR decoders on the NIST 2001+2002 test set is reported in Table 3. For all performance metrics, we show the 70% confidence interval with respect to the MAP baseline computed using bootstrap resampling (Press et al., 2002; Och, 2003). We note that this significance level does meet the customary criteria for minimum significance intervals of 68.3% (Press et al., 2002).</Paragraph>
      <Paragraph position="3"> We observe in most cases that the MBR decoder under a loss function performs the best under the corresponding error metric i.e. matched conditions perform the best. The gains from MBR decoding under matched conditions are statistically significant in most cases. We note that the MAP decoder is not optimal in any of the cases. In particular, the translation performance under the BLEU metric can be improved by using MBR relative to MAP decoding. This shows the value of finding decoding procedure matched to the performance criterion of interest.</Paragraph>
      <Paragraph position="4"> We also notice some affinity among the loss functions.</Paragraph>
      <Paragraph position="5"> The MBR decoding under the Bitree Loss function performs better under the WER relative to the MAP decoder, but perform poorly under the BLEU metric. The MBR decoder under WER and PER perform better than the MAP decoder under all error metrics. The MBR decoder under BLEU loss function obtains a similar (or worse) performance relative to MAP decoder on all metrics other than BLEU.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML