File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/p01-1050_evalu.xml

Size: 4,669 bytes

Last Modified: 2025-10-06 13:58:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1050">
  <Title>Towards a Unified Approach to Memoryand Statistical-Based Machine Translation</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> We extracted from the test corpus a collection of 505 French sentences, uniformly distributed across the lengths 6, 7, 8, 9, and 10. For each French sentence, we had access to the human-generated English translation in the test corpus, and to translations generated by two commercial systems. We produced translations using three versions of the greedy decoder: one used only the statistical translation model, one used the translation model and the FTMEM, and one used the translation model and the PTMEM.</Paragraph>
    <Paragraph position="1"> We initially assessed how often the translations obtained from TMEM seeds had higher proba-Sent. Found Higher Same Higher length in prob. result prob.</Paragraph>
    <Paragraph position="2">  Sent. Found Higher Same Higher length in prob. result prob.</Paragraph>
    <Paragraph position="3">  bility than the translations obtained from simple glosses. Tables 4 and 5 show that the translation memories significantly help the decoder find translations of high probability. In about 30% of the cases, the translations are simply copied from a TMEM and in about 13% of the cases the translations obtained from a TMEM seed have higher probability that the best translations obtained from a simple gloss. In 40% of the cases both seeds (the TMEM and the gloss) yield the same translation. Only in about 15-18% of the cases the translations obtained from the gloss are better than the translations obtained from the TMEM seeds. It appears that both TMEMs help the decoder find translations of higher probability consistently, across all sentence lengths.</Paragraph>
    <Paragraph position="4"> In a second experiment, a bilingual judge scored the human translations extracted from the automatically aligned test corpus; the translations produced by a greedy decoder that use both TMEM and gloss seeds; the translations produced by a greedy decoder that uses only the statistical model and the gloss seed; and translations produced by two commercial systems (A and B).</Paragraph>
    <Paragraph position="5"> a65 If an English translation had the very same meaning as the French original, it was considered semantically correct. If the meaning was just a little different, the translation was considered semantically incorrect. For example, &amp;quot;this is rather provision disturbing&amp;quot; was judged as a correct semantical translation of &amp;quot;voil`a une disposition plot^ot inqui'etante&amp;quot;, but &amp;quot;this disposal is rather disturbing&amp;quot; was judged as incorrect.</Paragraph>
    <Paragraph position="6"> a65 If a translation was perfect from a grammatical perspective, it was considered to be grammatical. Otherwise, it was considered incorrect. For example, &amp;quot;this is rather provision disturbing&amp;quot; was judged as ungrammatical, although one may very easily make sense of it.</Paragraph>
    <Paragraph position="7"> We decided to use such harsh evaluation criteria because, in previous experiments, we repeatedly found that harsh criteria can be applied consistently. To ensure consistency during evaluation, the judge used a specialized interface: once the correctness of a translation produced by a system S was judged, the same judgment was automatically recorded with respect to the other systems as well. This way, it became impossible for a translation to be judged as correct when produced by one system and incorrect when produced by another system.</Paragraph>
    <Paragraph position="8"> Table 6, which summarizes the results, displays the percent of perfect translations (both semantically and grammatically) produced by a variety of systems. Table 6 shows that translations produced using both TMEM and gloss seeds are much better than translations that do not use TMEMs.</Paragraph>
    <Paragraph position="9"> The translation systems that use both a TMEM and the statistical model outperform significantly the two commercial systems. The figures in Table 6 also reflect the harshness of our evaluation metric: only 82% of the human translations extracted from the test corpus were considered perfect translation. A few of the errors were genuine, and could be explained by failures of the sentence alignment program that was used to create the corpus (Melamed, 1999). Most of the errors were judged as semantic, reflecting directly the harshness of our evaluation metric.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML