XML Viewer - c04-1072

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1072_evalu.xml
Size: 6,016 bytes
Last Modified: 2025-10-06 13:59:03
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1072">
  <Title>ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation</Title>
  <Section position="5" start_page="2" end_page="3" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> Comparing automatic evaluation metrics using the ORANGE evaluation method is straightforward.</Paragraph>
    <Paragraph position="1"> To simulate real world scenario, we use n-best lists from ISI's state-of-the-art statistical machine translation system, AlTemp (Och 2003), and the 2002 NIST Chinese-English evaluation corpus as the test corpus. There are 878 source sentences in Chinese and 4 sets of reference translations provided by LDC  . For exploration study, we generate 1024-best list using AlTemp for 872 source sentences. AlTemp generates less than 1024 alternative translations for 6 out of the 878 source  Linguistic Data Consortium prepared these manual translations as part of the DARPA's TIDES project. sentences. These 6 source sentences are excluded from the 1024-best set. In order to compute BLEU at sentence level, we apply the following smoothing technique: Add one count to the n-gram hit and total n-gram count for n &gt; 1. Therefore, for candidate translations with less than n words, they can still get a positive smoothed BLEU score from shorter n-gram matches; however if nothing matches then they will get zero scores.</Paragraph>
    <Paragraph position="2"> We call the smoothed BLEU: BLEUS. For each candidate translation in the 1024-best list and each reference, we compute the following scores:  1. BLEUS1 to 9 2. NIST, PER, and WER 3. ROUGE-L 4. ROUGE-W with weight ranging from 1.1 to 2.0 with increment of 0.1 5. ROUGE-S with maximum skip distance  ranging from 0 to 9 (ROUGE-S0 to S9) and without any skip distance limit (ROUGE-S*) We compute the average score of the references and then rank the candidate translations and the references according to these automatic scores. The ORANGE score for each metric is calculated as the average rank of the average reference (oracle) score over the whole corpus (872 sentences) divided by the length of the n-best list plus 1. Assuming the length of the n-best list is N and the size of the corpus is S (in number of sentences), we compute Orange as follows:  ) is the average rank of source sentence i's reference translations in n-best list i. Table 2 shows the results for BLEUS1 to 9. To assess the reliability of the results, 95% confidence intervals (95%-CI-L for lower bound and CI-U for upper bound) of average rank of the oracles are  and ROUGE-S*.</Paragraph>
    <Paragraph position="3"> estimated using bootstrap resampling (Davison and Hinkley). According to Table 2, BLEUS6 (dark/green cell) is the best performer among all BLEUSes, but it is statistically equivalent to BLEUS3, 4, 5, 7, 8, and 9 with 95% of confidence. Table 3 shows Pearson's correlation coefficient for BLEUS1 to 9 over 8 participants in 2003 NIST Chinese-English machine translation evaluation. According to Table 3, we find that shorter BLEUS has better correlation with adequacy. However, correlation with fluency increases when longer n-gram is considered but decreases after BLEUS5. There is no consensus winner that achieves best correlation with adequacy and fluency at the same time. So which version of BLEUS should we use? A reasonable answer is that if we would like to optimize for adequacy then choose BLEUS1; however, if we would like to optimize for fluency then choose BLEUS4 or BLEUS5. According to Table 2, we know that BLEUS6 on average places reference translations at rank 235 in a 1024-best list machine translations that is significantly better than BLEUS1 and BLEUS2. Therefore, we have better chance of finding more human-like translations on the top of an n-best list by choosing BLEUS6 instead of BLEUS2. To design automatic metrics better than BLEUS6, we can carry out error analysis over the machine translations that are ranked higher than their references. Based on the results of error analysis, promising modifications can be identified. This indicates that the ORANGE evaluation method provides a natural automatic evaluation metric development cycle.</Paragraph>
    <Paragraph position="4"> Table 4 shows the ORANGE scores for ROUGE-L and ROUGE-W-1.1 to 2.0. ROUGE-W 1.1 does have better ORANGE score but it is equivalent to other ROUGE-W variants and ROUGE-L. Table 5 lists performance of different ROUGE-S variants.</Paragraph>
    <Paragraph position="5"> ROUGE-S4 is the best performer but is only significantly better than ROUGE-S0 (bigram), ROUGE-S1, ROUGE-S9 and ROUGE-S*. The relatively worse performance of ROUGE-S* might to due to spurious matches such as &amp;quot;the the&amp;quot; or &amp;quot;the of&amp;quot;.</Paragraph>
    <Paragraph position="6"> Table 6 summarizes the performance of 7 different metrics. ROUGE-S4 (dark/green cell) is the best with an ORANGE score of 19.66% that is statistically equivalent to ROUGE-L and ROUGE-W-1.1 (gray cells) and is significantly better than BLEUS6, NIST, PER, and WER. Among them PER is the worst.</Paragraph>
    <Paragraph position="7"> To examine the length effect of n-best lists on the relative performance of automatic metrics, we use the AlTemp SMT system to generate a 16384-best list and compute ORANGE scores for BLEUS4, PER, WER, ROUGE-L, ROUGE-W-1.2, and ROUGE-S4. Only 474 source sentences that have more than 16384 alternative translations are used in this experiment. Table 7 shows the results. It confirms that when we extend the length of the n-best list to 16 times the size of the 1024-best, the relative performance of each automatic evaluation metric group stays the same. ROUGE-S4 is still the best performer. Figure 1 shows the trend of ORANGE scores for these metrics over N-best list of N from 1 to 16384 with length increment of 64. It is clear that relative performance of these metrics stay the same over the entire range.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML