File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1072_metho.xml
Size: 7,233 bytes
Last Modified: 2025-10-06 14:08:43
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1072"> <Title>ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation</Title> <Section position="4" start_page="1" end_page="2" type="metho"> <SectionTitle> 3 Three New Metrics </SectionTitle> <Paragraph position="0"> ROUGE-L and ROUGE-S are described in details in Lin and Och (2004). Since these two metrics are relatively new, we provide short summaries of them in Section 3.1 and Section 3.3 respectively.</Paragraph> <Paragraph position="1"> ROUGE-W, an extension of ROUGE-L, is new and is explained in details in Section 3.2.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 ROUGE-L: Longest Common Sub- </SectionTitle> <Paragraph position="0"> sequence Given two sequences X and Y, the longest common subsequence (LCS) of X and Y is a common subsequence with maximum length (Cormen et al. 1989). To apply LCS in machine translation evaluation, we view a translation as a sequence of words. The intuition is that the longer the LCS of two translations is, the more similar the two translations are. We propose using LCS-based F-measure to estimate the similarity between two translations X of length m and Y of length n, assuming X is a reference translation and Y is a candidate translation, as follows:</Paragraph> <Paragraph position="2"> . We call the LCS-based F-measure, i.e. Equation 3, ROUGE-L. Notice that ROUGE-L is 1 when X = Y since LCS(X,Y) = m or n; while ROUGE-L is zero when LCS(X,Y) = 0, i.e. there is nothing in common between X and Y.</Paragraph> <Paragraph position="3"> One advantage of using LCS is that it does not require consecutive matches but in-sequence matches that reflect sentence level word order as ngrams. The other advantage is that it automatically includes longest in-sequence common n-grams, therefore no predefined n-gram length is necessary. By only awarding credit to in-sequence unigram matches, ROUGE-L also captures sentence level structure in a natural way. Consider the following example: S1. police killed the gunman S2. police kill the gunman S3. the gunman kill police Using S1 as the reference translation, S2 has a ROUGE-L score of 3/4 = 0.75 and S3 has a ROUGE-L score of 2/4 = 0.5, with b = 1. Therefore S2 is better than S3 according to ROUGE-L. This example illustrated that ROUGE-L can work reliably at sentence level. However, LCS suffers one disadvantage: it only counts the main in-sequence words; therefore, other alternative LCSes and shorter sequences are not reflected in the final score. In the next section, we introduce ROUGE-W.</Paragraph> </Section> <Section position="2" start_page="1" end_page="2" type="sub_section"> <SectionTitle> 3.2 ROUGE-W: Weighted Longest Common Subsequence </SectionTitle> <Paragraph position="0"> LCS has many nice properties as we have described in the previous sections. Unfortunately, the basic LCS also has a problem that it does not differentiate LCSes of different spatial relations within their embedding sequences. For example, given a reference sequence X and two candidate has consecutive matches.</Paragraph> <Paragraph position="1"> To improve the basic LCS method, we can simply remember the length of consecutive matches encountered so far to a regular two dimensional dynamic program table computing LCS. We call this weighted LCS (WLCS) and use k to indicate the length of the current consecutive matches</Paragraph> <Paragraph position="3"> . Given two sentences X and Y, the recurrent relations can be written as follows:</Paragraph> <Paragraph position="5"> length of consecutive matches ended at c table position i and j, and f is a function of consecutive matches at the table position, c(i,j). Notice that by providing different weighting function f, we can parameterize the WLCS algorithm to assign different credit to consecutive in-sequence matches.</Paragraph> <Paragraph position="6"> The weighting function f must have the property that f(x+y) > f(x) + f(y) for any positive integers x and y. In other words, consecutive matches are awarded more scores than non-consecutive matches. For example, f(k)-=-ak - b when k >= 0, and a, b > 0. This function charges a gap penalty of -b for each non-consecutive n-gram sequences.</Paragraph> <Paragraph position="7"> Another possible function family is the polynomial family of the form k a where -a > 1. However, in order to normalize the final ROUGE-W score, we also prefer to have a function that has a close form inverse function. For example, f(k)-=-k as the weighting function, the ROUGE-W scores for sequences Y1 and Y2 are 0.571 and 0.286 respectively. Therefore, Y1 would be ranked higher than Y2 using WLCS. We use the polynomial function of the form k a in the experiments described in Section 4 with the weighting factor a varying from 1.1 to 2.0 with 0.1 increment. ROUGE-W is the same as ROUGE-L when a is set to 1.</Paragraph> <Paragraph position="8"> In the next section, we introduce the skip-bigram co-occurrence statistics.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 ROUGE-S: Skip-Bigram Co-Occurrence Statistics </SectionTitle> <Paragraph position="0"> Skip-bigram is any pair of words in their sentence order, allowing for arbitrary gaps. Skip-bigram co-occurrence statistics measure the overlap of skip-bigrams between a candidate translation and a set of reference translations. Using the example given in Section 3.1: S1. police killed the gunman S2. police kill the gunman S3. the gunman kill police S4. the gunman police killed each sentence has C(4,2) = 6 skip-bigrams. For example, S1 has the following skip-bigrams: (&quot;police killed&quot;, &quot;police the&quot;, &quot;police gunman&quot;, &quot;killed the&quot;, &quot;killed gunman&quot;, &quot;the gunman&quot;) Given translations X of length m and Y of length n, assuming X is a reference translation and Y is a candidate translation, we compute skip-bigram-based F-measure as follows: combination function. We call the skip-bigram-based F-measure, i.e. Equation 9, ROUGE-S. Using Equation 9 with b = 1 and S1 as the reference, S2's ROUGE-S score is 0.5, S3 is 0.167, and S4 is 0.333. Therefore, S2 is better than S3 and S4, and S4 is better than S3.</Paragraph> <Paragraph position="1"> One advantage of skip-bigram vs. BLEU is that it does not require consecutive matches but is still sensitive to word order. Comparing skip-bigram with LCS, skip-bigram counts all in-order matching word pairs while LCS only counts one longest common subsequence. We can limit the maximum skip distance, between two in-order words to control the admission of a skip-bigram. We use skip distances of 1 to 9 with increment of 1 (ROUGE-S1 to 9) and without any skip distance constraint (ROUGE-S*).</Paragraph> <Paragraph position="2"> In the next section, we present the evaluations of BLEU, NIST, PER, WER, ROUGE-L, ROUGE-W, and ROUGE-S using the ORANGE evaluation method described in Section 2.</Paragraph> <Paragraph position="3"> Combinations: C(4,2) = 4!/(2!*2!) = 6.</Paragraph> </Section> </Section> class="xml-element"></Paper>