XML Viewer - w06-1606

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1606_evalu.xml
Size: 6,783 bytes
Last Modified: 2025-10-06 13:59:50
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1606">
  <Title>SPMT: Statistical Machine Translation with Syntactified Target Language Phrases</Title>
  <Section position="8" start_page="48" end_page="50" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="48" end_page="49" type="sub_section">
      <SectionTitle>
4.1 Automatic evaluation of the models
</SectionTitle>
      <Paragraph position="0"> We evaluate our models on a Chinese to English machine translation task. We use the same training corpus, 138.7M words of parallel Chinese-English data released by LDC, in order to train several statistical-based MT systems: PBMT, a strong state of the art phrase-based system that implements the alignment template model (Och and Ney, 2004); this is the system ISI has used in the 2004 and 2005 NIST evaluations.</Paragraph>
      <Paragraph position="1"> four SPMT systems (M1, M1C, M2, M2C) that implement each of the models discussed in this paper; a SPMT system, Comb, that combines the outputs of all SPMT models using the procedure described in Section 3.2.</Paragraph>
      <Paragraph position="2"> In all systems, we use a rule extraction algorithm that limits the size of the foreign/source phrases to four words. For all systems, we use a Kneser-Ney (1995) smoothed trigram language model trained on 2.3 billion words of English. As development data for the SPMT systems, we used the sentences in the 2002 NIST development corpus that are shorter than 20 words; we made this choice in order to finish all experiments in time for this submission. The PBMT system used all sentences in the 2002 NIST corpus for development. As test data, we used the 2003 NIST test set.</Paragraph>
      <Paragraph position="3"> Table 1 shows the number of string-to-string or tree-to-string rules extracted by each system and the performance on both the subset of sentences in the test corpus that were shorter than 20 words and the entire test corpus. The performance is measured using the Bleu metric (Papineni et al., 2002) on lowercased, tokenized outputs/references.</Paragraph>
      <Paragraph position="4"> The results show that the SPMT models clearly outperform the phrase-based systems - the 95% confidence intervals computed via bootstrap re-sampling in all cases are around 1 Bleu point. The results also show that the simple system combination procedure that we have employed is effective in our setting. The improvement on the development corpus transfers to the test setting as well. A visual inspection of the outputs shows significant differences between the outputs of the four models. The models that use composed rules prefer to produce outputs by using mostly lexicalized  rules; in contrast, the simple M1 and M2 models produce outputs in which content is translated primarily using lexicalized rules and reorderings and word insertions are explained primarily by the non-lexical rules. It appears that the two strategies are complementary, succeeding and failing in different instances. We believe that this complementarity and the overcoming of some of the search errors in our decoder during the model rescoring phase explain the success of the system combination experiments.</Paragraph>
      <Paragraph position="5"> We suspect that our decoder still makes many search errors. In spite of this, the SPTM outputs are still significantly better than the PBMT outputs. null</Paragraph>
    </Section>
    <Section position="2" start_page="49" end_page="50" type="sub_section">
      <SectionTitle>
4.2 Human-based evaluation of the models
</SectionTitle>
      <Paragraph position="0"> We also tested whether the Bleu score improvements translate into improvements that can be perceived by humans. To this end, we randomly selected 138 sentences of less than 20 words from our development corpus; we expected the translation quality of sentences of this size to be easier to assess than that of sentences that are very long.</Paragraph>
      <Paragraph position="1"> We prepared a web-based evaluation interface that showed for each input sentence: the Chinese input; three English reference translations; the output of seven &amp;quot;MT systems&amp;quot;.</Paragraph>
      <Paragraph position="2"> The evaluated &amp;quot;MT systems&amp;quot; were the six systems shown in Table 1 and one of the reference translations. The reference translation presented as automatically produced output was selected from the set of four reference translations provided by NIST so as to be representative of human translation quality. More precisely, we chose the second best reference translation in the NIST corpus according to its Bleu score against the other three reference translations. The seven outputs were randomly shuffled and presented to three English speakers for assessment.</Paragraph>
      <Paragraph position="3"> The judges who participated in our experiment were instructed to carefully read the three reference translations and seven machine translation outputs, and assign a score between 1 and 5 to each translation output on the basis of its quality. Human judges were told that the translation quality assessment should take into consideration both the grammatical fluency of the outputs and their translation adequacy. Table 2 shows the average scores obtained by each system according to each judge. For convenience, the table also shows the Bleu scores of all systems (including the human translations) on three reference translations.</Paragraph>
      <Paragraph position="4"> The results in Table 2 show that the human judges are remarkably consistent in preferring the syntax-based outputs over the phrase-based outputs. On a 1 to 5 quality scale, the difference between the phrase-based and syntax-based systems was, on average, between 0.2 and 0.3 points. All differences between the phrase-based baseline and the syntax-based outputs were statistically significant. For example, when comparing the phrase-based baseline against the combined system, the improvement in human scores was significant at</Paragraph>
      <Paragraph position="6"> The results also show that the LDC reference translations are far from being perfect. Although we selected from the four references the second best according to the Bleu metric, this human reference was judged to be at a quality level of only 4.67 on a scale from 1 to 5. Most of the translation errors were fluency errors. Although the human outputs had most of the time the right meaning, the syntax was sometimes incorrect.</Paragraph>
      <Paragraph position="7"> In order to give readers a flavor of the types of re-orderings enabled by the SPMT models, we present in Table 3, several translation outputs produced by the phrase-based baseline and the com-</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML