File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1077_metho.xml
Size: 23,056 bytes
Last Modified: 2025-10-06 14:08:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1077"> <Title>Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Longest Common Subsequence </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 ROUGE-L </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> ], if there exists a strict increasing sequence [i</Paragraph> <Paragraph position="4"> (Cormen et al. 1989). Given two sequences X and Y, the longest common subsequence (LCS) of X and Y is a common subsequence with maximum length. We can find the LCS of two sequences of length m and n using standard dynamic programming technique in O(mn) time.</Paragraph> <Paragraph position="5"> LCS has been used to identify cognate candidates during construction of N-best translation lexicons from parallel text. Melamed (1995) used the ratio (LCSR) between the length of the LCS of two words and the length of the longer word of the two words to measure the cognateness between them. He used as an approximate string matching algorithm. Saggion et al. (2002) used normalized pairwise LCS (NP-LCS) to compare similarity between two texts in automatic summarization evaluation. NP-LCS can be shown as a special case of Equation (6) with b = 1. However, they did not provide the correlation analysis of NP-LCS with This is a real machine translation output.</Paragraph> <Paragraph position="6"> The &quot;kill&quot; in S2 or S3 does not match with &quot;killed&quot; in S1 in strict word-to-word comparison.</Paragraph> <Paragraph position="7"> human judgments and its effectiveness as an automatic evaluation measure.</Paragraph> <Paragraph position="8"> To apply LCS in machine translation evaluation, we view a translation as a sequence of words. The intuition is that the longer the LCS of two translations is, the more similar the two translations are. We propose using LCS-based F-measure to estimate the similarity between two translations X of length m and Y of length n, assuming X is a reference translation and Y is a candidate translation, as follows:</Paragraph> <Paragraph position="10"> . We call the LCS-based Fmeasure, i.e. Equation 6, ROUGE-L. Notice that ROUGE-L is 1 when X = Y since LCS(X,Y) = m or n; while ROUGE-L is zero when LCS(X,Y) = 0, i.e. there is nothing in common between X and Y. F-measure or its equivalents has been shown to have met several theoretical criteria in measuring accuracy involving more than one factor (Van Rijsbergen 1979). The composite factors are LCS-based recall and precision in this case. Melamed et al.</Paragraph> <Paragraph position="11"> (2003) used unigram F-measure to estimate machine translation quality and showed that unigram F-measure was as good as BLEU.</Paragraph> <Paragraph position="12"> One advantage of using LCS is that it does not require consecutive matches but in-sequence matches that reflect sentence level word order as ngrams. The other advantage is that it automatically includes longest in-sequence common n-grams, therefore no predefined n-gram length is necessary.</Paragraph> <Paragraph position="13"> ROUGE-L as defined in Equation 6 has the prop-erty that its value is less than or equal to the minimum of unigram F-measure of X and Y. Unigram recall reflects the proportion of words in X (reference translation) that are also present in Y (candidate translation); while unigram precision is the proportion of words in Y that are also in X. Uni-gram recall and precision count all co-occurring words regardless their orders; while ROUGE-L counts only in-sequence co-occurrences.</Paragraph> <Paragraph position="14"> By only awarding credit to in-sequence unigram matches, ROUGE-L also captures sentence level structure in a natural way. Consider again the example given in Section 2 that is copied here for convenience: S1. police killed the gunman S2. police kill the gunman S3. the gunman kill police As we have shown earlier, BLEU-2 cannot differentiate S2 from S3. However, S2 has a ROUGE-L score of 3/4 = 0.75 and S3 has a ROUGE-L score of 2/4 = 0.5, with b = 1. Therefore S2 is better than S3 according to ROUGE-L. This example also illustrated that ROUGE-L can work reliably at sentence level.</Paragraph> <Paragraph position="15"> However, LCS only counts the main in-sequence words; therefore, other longest common subsequences and shorter sequences are not reflected in the final score. For example, consider the following candidate sentence: S4. the gunman police killed Using S1 as its reference, LCS counts either &quot;the gunman&quot; or &quot;police killed&quot;, but not both; therefore, S4 has the same ROUGE-L score as S3. BLEU-2 would prefer S4 over S3. In Section 4, we will introduce skip-bigram co-occurrence statistics that do not have this problem while still keeping the advantage of in-sequence (not necessary consecutive) matching that reflects sentence level word order.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Multiple References </SectionTitle> <Paragraph position="0"> So far, we only demonstrated how to compute ROUGE-L using a single reference. When multiple references are used, we take the maximum LCS matches between a candidate translation, c, of n words and a set of u reference translations of m j words. The LCS-based F-measure can be computed as follows: lcs-multi.</Paragraph> <Paragraph position="1"> This procedure is also applied to computation of ROUGE-S when multiple references are used. In the next section, we introduce the skip-bigram co-occurrence statistics. In the next section, we describe how to extend ROUGE-L to assign more credits to longest common subsequences with consecutive words.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 ROUGE-W: Weighted Longest Common Subsequence </SectionTitle> <Paragraph position="0"> LCS has many nice properties as we have described in the previous sections. Unfortunately, the basic LCS also has a problem that it does not differentiate LCSes of different spatial relations within their embedding sequences. For example, given a reference sequence X and two candidate has consecutive matches. To improve the basic LCS method, we can simply remember the length of consecutive matches encountered so far to a regular two dimensional dynamic program table computing LCS. We call this weighted LCS (WLCS) and use k to indicate the length of the current consecutive matches ending at words x</Paragraph> <Paragraph position="2"> . Given two sentences X and Y, the WLCS score of X and Y can be computed using the following dynamic programming procedure:</Paragraph> <Paragraph position="4"> Where c is the dynamic programming table, c(i,j) stores the WLCS score ending at word x</Paragraph> <Paragraph position="6"> of Y, w is the table storing the length of consecutive matches ended at c table position i and j, and f is a function of consecutive matches at the table position, c(i,j). Notice that by providing different weighting function f, we can parameterize the WLCS algorithm to assign different credit to consecutive in-sequence matches.</Paragraph> <Paragraph position="7"> The weighting function f must have the property that f(x+y) > f(x) + f(y) for any positive integers x and y. In other words, consecutive matches are awarded more scores than non-consecutive matches. For example, f(k)-=-ak - b when k >= 0, and a, b > 0. This function charges a gap penalty of -b for each non-consecutive n-gram sequences.</Paragraph> <Paragraph position="8"> Another possible function family is the polynomial family of the form k a where -a > 1. However, in order to normalize the final ROUGE-W score, we also prefer to have a function that has a close form inverse function. For example, f(k)-=-k using WLCS. We use the polynomial function of the form k a in the ROUGE evaluation package. In the next section, we introduce the skip-bigram co-occurrence statistics.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="3" type="metho"> <SectionTitle> 4 ROUGE-S: Skip-Bigram Co-Occurrence </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="3" type="sub_section"> <SectionTitle> Statistics </SectionTitle> <Paragraph position="0"> Skip-bigram is any pair of words in their sentence order, allowing for arbitrary gaps. Skip-bigram co-occurrence statistics measure the overlap of skip-bigrams between a candidate translation and a set of reference translations. Using the example given in Section 3.1: S1. police killed the gunman S2. police kill the gunman S3. the gunman kill police S4. the gunman police killed Each sentence has C(4,2) = 6 skip-bigrams. For example, S1 has the following skip-bigrams: Combination: C(4,2) = 4!/(2!*2!) = 6. (&quot;police killed&quot;, &quot;police the&quot;, &quot;police gunman&quot;, &quot;killed the&quot;, &quot;killed gunman&quot;, &quot;the gunman&quot;) S2 has three skip-bigram matches with S1 (&quot;police the&quot;, &quot;police gunman&quot;, &quot;the gunman&quot;), S3 has one skip-bigram match with S1 (&quot;the gunman&quot;), and S4 has two skip-bigram matches with S1 (&quot;police killed&quot;, &quot;the gunman&quot;). Given translations X of length m and Y of length n, assuming X is a reference translation and Y is a candidate translation, we compute skip-bigram-based F-measure as follows: null , and C is the combination function. We call the skip-bigram-based Fmeasure, i.e. Equation 15, ROUGE-S.</Paragraph> <Paragraph position="1"> Using Equation 15 with b = 1 and S1 as the reference, S2's ROUGE-S score is 0.5, S3 is 0.167, and S4 is 0.333. Therefore, S2 is better than S3 and S4, and S4 is better than S3. This result is more intuitive than using BLEU-2 and ROUGE-L. One advantage of skip-bigram vs. BLEU is that it does not require consecutive matches but is still sensitive to word order. Comparing skip-bigram with LCS, skip-bigram counts all in-order matching word pairs while LCS only counts one longest common subsequence.</Paragraph> <Paragraph position="2"> We can limit the maximum skip distance, d skip , between two in-order words that is allowed to form a skip-bigram. Applying such constraint, we limit skip-bigram formation to a fix window size. Therefore, computation time can be reduced and hopefully performance can be as good as the version without such constraint. For example, if we set d skip to 0 then ROUGE-S is equivalent to bigram overlap. If we set d skip to 4 then only word pairs of at most 4 words apart can form skip-bigrams.</Paragraph> <Paragraph position="3"> Adjusting Equations 13, 14, and 15 to use maximum skip distance limit is straightforward: we only count the skip-bigram matches, SKIP2(X,Y), within the maximum skip distance and replace denominators of Equations 13, C(m,2), and 14, C(n,2), with the actual numbers of within distance skip-bigrams from the reference and the candidate respectively.</Paragraph> <Paragraph position="4"> In the next section, we present the evaluations of ROUGE-L, ROUGE-S, and compare their performance with other automatic evaluation measures. null</Paragraph> </Section> </Section> <Section position="6" start_page="3" end_page="4" type="metho"> <SectionTitle> 5 Evaluations </SectionTitle> <Paragraph position="0"> One of the goals of developing automatic evaluation measures is to replace labor-intensive human evaluations. Therefore the first criterion to assess the usefulness of an automatic evaluation measure is to show that it correlates highly with human judgments in different evaluation settings. However, high quality large-scale human judgments are hard to come by. Fortunately, we have access to eight MT systems' outputs, their human assessment data, and the reference translations from 2003 NIST Chinese MT evaluation (NIST 2002a). There were 919 sentence segments in the corpus. We first computed averages of the adequacy and fluency scores of each system assigned by human evaluators. For the input of automatic evaluation methods, we created three evaluation sets from the MT outputs: 1. Case set: The original system outputs with case information.</Paragraph> <Paragraph position="1"> 2. NoCase set: All words were converted into lower case, i.e. no case information was used. This set was used to examine whether human assessments were affected by case information since not all MT systems generate properly cased output.</Paragraph> <Paragraph position="2"> 3. Stem set: All words were converted into lower case and stemmed using the Porter stemmer (Porter 1980). Since ROUGE computed similarity on surface word level, stemmed version allowed ROUGE to perform more lenient matches.</Paragraph> <Paragraph position="3"> To accommodate multiple references, we use a Jackknifing procedure. Given N references, we compute the best score over N sets of N-1 references. The final score is the average of the N best scores using N different sets of N-1 references. The Jackknifing procedure is adopted since we often need to compare system and human performance and the reference translations are usually the only human translations available. Using this procedure, we are able to estimate average human performance by averaging N best scores of one reference vs. the rest N-1 references.</Paragraph> <Paragraph position="4"> We then computed average BLEU1-12</Paragraph> </Section> <Section position="7" start_page="4" end_page="6" type="metho"> <SectionTitle> , GTM </SectionTitle> <Paragraph position="0"> with exponents of 1.0, 2.0, and 3.0, NIST, WER, and PER scores over these three sets. Finally we applied ROUGE-L, ROUGE-W with weighting function k , and ROUGE-S without skip distance BLEUN computes BLEU over n-grams up to length N. Only BLEU1, BLEU4, and BLEU12 are shown in Table 1. limit and with skip distant limits of 0, 4, and 9. Correlation analysis based on two different correlation statistics, Pearson's r and Spearman's r , with respect to adequacy and fluency are shown in Table 1.</Paragraph> <Paragraph position="1"> The Pearson's correlation coefficient measures the strength and direction of a linear relationship between any two variables, i.e. automatic metric score and human assigned mean coverage score in our case. It ranges from +1 to -1. A correlation of 1 means that there is a perfect positive linear relationship between the two variables, a correlation of -1 means that there is a perfect negative linear relationship between them, and a correlation of 0 means that there is no linear relationship between them. Since we would like to use automatic evaluation metric not only in comparing systems For a quick overview of the Pearson's coefficient, see: http://davidmlane.com/hyperstat/A34739.html. but also in in-house system development, a good linear correlation with human judgment would enable us to use automatic scores to predict corresponding human judgment scores. Therefore, Pearson's correlation coefficient is a good measure to look at.</Paragraph> <Paragraph position="2"> Spearman's correlation coefficient is also a measure of correlation between two variables. It is a non-parametric measure and is a special case of the Pearson's correlation coefficient when the values of data are converted into ranks before computing the coefficient. Spearman's correlation coefficient does not assume the correlation between the variables is linear. Therefore it is a useful correlation indicator even when good linear correlation, for example, according to Pearson's correlation coefficient between two variables could and fluency: BLEU1, 4, and 12 are BLEU with maximum of 1, 4, and 12 grams, NIST is the NIST score, ROUGE-L is LCS-based F-measure (b = 1), ROUGE-W is weighted LCS-based F-measure (b = 1). ROUGE-S* is skip-bigram-based co-occurrence statistics with any skip distance limit, ROUGE-SN is skip-bigram-based F-measure (b = 1) with maximum skip distance of N, PER is position independent word error rate, and WER is word error rate. GTM 10, 20, and 30 are general text matcher with exponents of 1.0, 2.0, and 3.0. (Note, only BLEU1, 4, and 12 are shown here to preserve space.) not be found. It also suits the NIST MT evaluation scenario where multiple systems are ranked according to some performance metrics.</Paragraph> <Paragraph position="3"> To estimate the significance of these correlation statistics, we applied bootstrap resampling, generating random samples of the 919 different sentence segments. The lower and upper values of 95% confidence interval are also shown in the table. Dark (green) cells are the best correlation numbers in their categories and light gray cells are statistically equivalent to the best numbers in their categories. Analyzing all runs according to the adequacy and fluency table, we make the following observations: Applying the stemmer achieves higher correlation with adequacy but keeping case information achieves higher correlation with fluency except for BLEU7-12 (only BLEU12 is shown). For example, the Pearson's r (P) correlation of ROUGE-S* with adequacy increases from 0.85 (Case) to 0.95 (Stem) while its Pearson's r correlation with fluency drops from 0.84 (Case) to 0.78 (Stem). We will focus our discussions on the Stem set in adequacy and Case set in fluency.</Paragraph> <Paragraph position="4"> The Pearson's r correlation values in the Stem set of the Adequacy Table, indicates that ROUGE-L and ROUGE-S with a skip distance longer than 0 correlate highly and linearly with adequacy and outperform BLEU and NIST. ROUGE-S* achieves that best correlation with a Pearson's r of 0.95.</Paragraph> <Paragraph position="5"> Measures favoring consecutive matches, i.e.</Paragraph> <Paragraph position="6"> BLEU4 and 12, ROUGE-W, GTM20 and 30, ROUGE-S0 (bigram), and WER have lower Pearson's r . Among them WER (0.48) that tends to penalize small word movement is the worst performer. One interesting observation is that longer BLEU has lower correlation with adequacy.</Paragraph> <Paragraph position="7"> Spearman's r values generally agree with Pearson's r but have more equivalents.</Paragraph> <Paragraph position="8"> The Pearson's r correlation values in the Stem set of the Fluency Table, indicates that BLEU12 has the highest correlation (0.93) with fluency. However, it is statistically indistinguishable with 95% confidence from all other metrics shown in the Case set of the Fluency Table except for WER and GTM10.</Paragraph> <Paragraph position="9"> GTM10 has good correlation with human judgments in adequacy but not fluency; while GTM20 and GTM30, i.e. GTM with exponent larger than 1.0, has good correlation with human judgment in fluency but not adequacy.</Paragraph> <Paragraph position="10"> ROUGE-L and ROUGE-S*, 4, and 9 are good automatic evaluation metric candidates since they perform as well as BLEU in fluency correlation analysis and outperform BLEU4 and 12 significantly in adequacy. Among them, ROUGE-L is the best metric in both adequacy and fluency correlation with human judgment according to Spearman's correlation coefficient and is statistically indistinguishable from the best metrics in both adequacy and fluency correlation with human judgment according to Pearson's correlation coefficient. null</Paragraph> </Section> <Section position="8" start_page="6" end_page="6" type="metho"> <SectionTitle> 6 Conclusion </SectionTitle> <Paragraph position="0"> In this paper we presented two new objective automatic evaluation methods for machine translation, ROUGE-L based on longest common subsequence (LCS) statistics between a candidate translation and a set of reference translations.</Paragraph> <Paragraph position="1"> Longest common subsequence takes into account sentence level structure similarity naturally and identifies longest co-occurring in-sequence n-grams automatically while this is a free parameter in BLEU.</Paragraph> <Paragraph position="2"> To give proper credit to shorter common sequences that are ignored by LCS but still retain the flexibility of non-consecutive matches, we proposed counting skip bigram co-occurrence. The skip-bigram-based ROUGE-S* (without skip distance restriction) had the best Pearson's r correlation of 0.95 in adequacy when all words were lower case and stemmed. ROUGE-L, ROUGE-W, ROUGE-S*, ROUGE-S4, and ROUGE-S9 were equal performers to BLEU in measuring fluency.</Paragraph> <Paragraph position="3"> However, they have the advantage that we can apply them on sentence level while longer BLEU such as BLEU12 would not differentiate any sentences with length shorter than 12 words (i.e. no 12-gram matches). We plan to explore their correlation with human judgments on sentence-level in the future.</Paragraph> <Paragraph position="4"> We also confirmed empirically that adequacy and fluency focused on different aspects of machine translations. Adequacy placed more emphasis on terms co-occurred in candidate and reference translations as shown in the higher correlations in Stem set than Case set in Table 1; while the reverse was true in the terms of fluency.</Paragraph> <Paragraph position="5"> The evaluation results of ROUGE-L, ROUGE-W, and ROUGE-S in machine translation evaluation are very encouraging. However, these measures in their current forms are still only applying string-to-string matching. We have shown that better correlation with adequacy can be reached by applying stemmer. In the next step, we plan to extend them to accommodate synonyms and paraphrases. For example, we can use an existing thesaurus such as WordNet (Miller 1990) or creating a customized one by applying automated synonym set discovery methods (Pantel and Lin 2002) to identify potential synonyms. Paraphrases can also be automatically acquired using statistical methods as shown by Barzilay and Lee (2003).</Paragraph> <Paragraph position="6"> Once we have acquired synonym and paraphrase data, we then need to design a soft matching function that assigns partial credits to these approximate matches. In this scenario, statistically generated data has the advantage of being able to provide scores reflecting the strength of similarity between synonyms and paraphrased.</Paragraph> <Paragraph position="7"> ROUGE-L, ROUGE-W, and ROUGE-S have also been applied in automatic evaluation of summarization and achieved very promising results (Lin 2004). In Lin and Och (2004), we proposed a framework that automatically evaluated automatic MT evaluation metrics using only manual translations without further human involvement. According to the results reported in that paper, ROUGE-L, ROUGE-W, and ROUGE-S also outperformed</Paragraph> </Section> class="xml-element"></Paper>