File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-3114_evalu.xml

Size: 9,777 bytes

Last Modified: 2025-10-06 13:59:54

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-3114">
  <Title>Out-of-domain test set</Title>
  <Section position="6" start_page="107" end_page="109" type="evalu">
    <SectionTitle>
4 Results and Analysis
</SectionTitle>
    <Paragraph position="0"> The results of the manual and automatic evaluation of the participating system translations is detailed in the figures at the end of this paper. The scores and confidence intervals are detailed first in the Figures 7-10 in table form (including ranks), and then in graphical form in Figures 11-16. In the graphs, system scores are indicated by a point, the confidence intervals by shaded areas around the point.</Paragraph>
    <Paragraph position="1"> In all figures, we present the per-sentence normalized judgements. The normalization on a per-judge basis gave very similar ranking, only slightly less consistent with the ranking from the pairwise comparisons. null The confidence intervals are computed by bootstrap resampling for BLEU, and by standard significance testing for the manual scores, as described earlier in the paper.</Paragraph>
    <Paragraph position="2"> Pairwise comparison is done using the sign test.</Paragraph>
    <Paragraph position="3"> Often, two systems can not be distinguished with a confidence of over 95%, so there are ranked the same. This actually happens quite frequently (more below), so that the rankings are broad estimates. For instance: if 10 systems participate, and one system does better than 3 others, worse then 2, and is not significant different from the remaining 4, its rank is in the interval 3-7.</Paragraph>
    <Paragraph position="4">  of-domain test sets, averaged over all systems</Paragraph>
    <Section position="1" start_page="107" end_page="107" type="sub_section">
      <SectionTitle>
4.1 Close results
</SectionTitle>
      <Paragraph position="0"> At first glance, we quickly recognize that many systems are scored very similar, both in terms of manual judgement and BLEU. There may be occasionally a system clearly at the top or at the bottom, but most systems are so close that it is hard to distinguish them.</Paragraph>
      <Paragraph position="1"> In Figure 4, we displayed the number of system comparisons, for which we concluded statistical significance. For the automatic scoring method BLEU, we can distinguish three quarters of the systems.</Paragraph>
      <Paragraph position="2"> While the Bootstrap method is slightly more sensitive, it is very much in line with the sign test on text blocks.</Paragraph>
      <Paragraph position="3"> For the manual scoring, we can distinguish only half of the systems, both in terms of fluency and adequacy. More judgements would have enabled us to make better distinctions, but it is not clear what the upper limit is. We can check, what the consequences of less manual annotation of results would have been: With half the number of manual judgements, we can distinguish about 40% of the systems, 10% less.</Paragraph>
    </Section>
    <Section position="2" start_page="107" end_page="108" type="sub_section">
      <SectionTitle>
4.2 In-domain vs. out-of-domain
</SectionTitle>
      <Paragraph position="0"> The test set included 2000 sentences from the Europarl corpus, but also 1064 sentences out-of-domain test data. Since the inclusion of out-of-domain test data was a very late decision, the participants were not informed of this. So, this was a surprise element due to practical reasons, not malice. null All systems (except for Systran, which was not tuned to Europarl) did considerably worse on out-of-domain training data. This is demonstrated by average scores over all systems, in terms of BLEU, fluency and adequacy, as displayed in Figure 5.</Paragraph>
      <Paragraph position="1"> The manual scores are averages over the raw un-normalized scores.</Paragraph>
    </Section>
    <Section position="3" start_page="108" end_page="108" type="sub_section">
      <SectionTitle>
4.3 Language pairs
</SectionTitle>
      <Paragraph position="0"> It is well know that language pairs such as English-German pose more challenges to machine translation systems than language pairs such as French-English. Different sentence structure and rich target language morphology are two reasons for this.</Paragraph>
      <Paragraph position="1"> Again, we can compute average scores for all systems for the different language pairs (Figure 6). The differences in difficulty are better reflected in the BLEU scores than in the raw un-normalized manual judgements. The easiest language pair according to BLEU (English-French: 28.33) received worse manual scores than the hardest (English-German: 14.01). This is because different judges focused on different language pairs. Hence, the different averages of manual scores for the different language pairs reflect the behaviour of the judges, not the quality of the systems on different language pairs.</Paragraph>
    </Section>
    <Section position="4" start_page="108" end_page="108" type="sub_section">
      <SectionTitle>
4.4 Manual judgement vs. BLEU
</SectionTitle>
      <Paragraph position="0"> Given the closeness of most systems and the wide over-lapping confidence intervals it is hard to make strong statements about the correlation between human judgements and automatic scoring methods such as BLEU.</Paragraph>
      <Paragraph position="1"> We confirm the finding by Callison-Burch et al. (2006) that the rule-based system of Systran is not adequately appreciated by BLEU. In-domain Systran scores on this metric are lower than all statistical systems, even the ones that have much worse human scores. Surprisingly, this effect is much less obvious for out-of-domain test data. For instance, for out-of-domain English-French, Systran has the best BLEU and manual scores.</Paragraph>
      <Paragraph position="2"> Our suspicion is that BLEU is very sensitive to jargon, to selecting exactly the right words, and not synonyms that human judges may appreciate as equally good. This is can not be the only explanation, since the discrepancy still holds, for instance, for out-of-domain French-English, where Systran receives among the best adequacy and fluency scores, but a worse BLEU score than all but one statistical system.</Paragraph>
      <Paragraph position="3"> This data set of manual judgements should provide a fruitful resource for research on better automatic scoring methods.</Paragraph>
    </Section>
    <Section position="5" start_page="108" end_page="108" type="sub_section">
      <SectionTitle>
4.5 Best systems
</SectionTitle>
      <Paragraph position="0"> So, who won the competition? The best answer to this is: many research labs have very competitive systems whose performance is hard to tell apart. This is not completely surprising, since all systems use very similar technology.</Paragraph>
      <Paragraph position="1"> For some language pairs (such as German-English) system performance is more divergent than for others (such as English-French), at least as measured by BLEU.</Paragraph>
      <Paragraph position="2"> The statistical systems seem to still lag behind the commercial rule-based competition when translating into morphological rich languages, as demonstrated by the results for English-German and English-French.</Paragraph>
      <Paragraph position="3"> The predominate focus of building systems that translate into English has ignored so far the difficult issues of generating rich morphology which may not be determined solely by local context.</Paragraph>
    </Section>
    <Section position="6" start_page="108" end_page="109" type="sub_section">
      <SectionTitle>
4.6 Comments on Manual Evaluation
</SectionTitle>
      <Paragraph position="0"> This is the first time that we organized a large-scale manual evaluation. While we used the standard metrics of the community, the we way presented translations and prompted for assessment differed from other evaluation campaigns. For instance, in the recent IWSLT evaluation, first fluency annotations were solicited (while withholding the source sentence), and then adequacy annotations.</Paragraph>
      <Paragraph position="1"> Almost all annotators reported difficulties in maintaining a consistent standard for fluency and adequacy judgements, but nevertheless most did not explicitly move towards a ranking-based evaluation.</Paragraph>
      <Paragraph position="2"> Almost all annotators expressed their preference to move to a ranking-based evaluation in the future. A few pointed out that adequacy should be broken up  into two criteria: (a) are all source words covered? (b) does the translation have the same meaning, including connotations? Annotators suggested that long sentences are almost impossible to judge. Since all long sentence translation are somewhat muddled, even a contrastive evaluation between systems was difficult. A few annotators suggested to break up long sentences into clauses and evaluate these separately.</Paragraph>
      <Paragraph position="3"> Not every annotator was fluent in both the source and the target language. While it is essential to be fluent in the target language, it is not strictly necessary to know the source language, if a reference translation was given. However, ince we extracted the test corpus automatically from web sources, the reference translation was not always accurate -- due to sentence alignment errors, or because translators did not adhere to a strict sentence-by-sentence translation (say, using pronouns when referring to entities mentioned in the previous sentence). Lack of correct reference translations was pointed out as a short-coming of our evaluation. One annotator suggested that this was the case for as much as 10% of our test sentences. Annotators argued for the importance of having correct and even multiple references. It was also proposed to allow annotators to skip sentences that they are unable to judge.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML