File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1013_metho.xml

Size: 25,433 bytes

Last Modified: 2025-10-06 14:09:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1013">
  <Title>ROUGE: A Package for Automatic Evaluation of Summaries</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 ROUGE-N: N-gram Co-Occurrence Statistics
</SectionTitle>
    <Paragraph position="0"> Formally, ROUGE-N is an n-gram recall between a candidate summary and a set of reference summaries. ROUGE-N is computed as follows:  Where n stands for the length of the n-gram, gramn, and Countmatch(gramn) is the maximum number of n-grams co-occurring in a candidate summary and a set of reference summaries.</Paragraph>
    <Paragraph position="1"> It is clear that ROUGE-N is a recall-related measure because the denominator of the equation is the total sum of the number of n-grams occurring at the reference summary side. A closely related measure, BLEU, used in automatic evaluation of machine translation, is a precision-based measure. BLEU measures how well a candidate translation matches a set of reference translations by counting the percentage of n-grams in the candidate translation overlapping wit h the references. Please see Papineni et al. (2001) for details about BLEU.</Paragraph>
    <Paragraph position="2"> Note that the number of n-grams in the denominator of the ROUGE-N formula increases as we add more references. This is intuitive and reasonable because there might exist multiple good summaries.</Paragraph>
    <Paragraph position="3"> Every time we add a reference into the pool, we expand the space of alternative summaries. By controlling what types of references we add to the reference pool, we can design evaluations that focus on different aspects of summarization. Also note that the numerator sums over all reference summaries. This effectively gives more weight to matching n-grams occurring in multiple references. Therefore a candidate summary that contains words shared by more references is favored by the ROUGE-N measure. This is again very intuitive and reasonable because we normally prefer a candidate summary that is more similar to consensus among reference summaries. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Multiple References
</SectionTitle>
      <Paragraph position="0"> So far, we only demonstrated how to compute ROUGE-N using a single reference. When mult iple references are used, we compute pairwise summary-level ROUGE-N between a candidate summary s and every reference, ri, in the reference set. We then take the maximum of pairwise summary-level ROUGE-N scores as the final multiple reference ROUGE-N score. This can be written as follows:</Paragraph>
      <Paragraph position="2"> This procedure is also applied to computation of ROUGE-L (Section 3), ROUGE-W (Section 4) , and ROUGE-S (Section 5). In the implementation, we use a Jackknifing procedure. Given M references, we compute the best score over M sets of M-1 references. The final ROUGE-N score is the average of the M ROUGE-N scores using different M-1 references. The Jackknifing procedure is adopted since we often need to compare system and human performance and the reference summaries are usually the only human summaries available. Using this procedure, we are able to estimate average human performance by averaging M ROUGE-N scores of one reference vs. the rest M-1 references. Although the Jackknif ing procedure is not necessary when we just want to compute ROUGE scores using multiple references, it is applied in all ROUGE score computations in the ROUGE evaluation package.</Paragraph>
      <Paragraph position="3"> In the next section, we describe a ROUGE measure based on longest common subsequences between two summaries.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 ROUGE-L: Longest Common Subs equence
</SectionTitle>
    <Paragraph position="0"> A sequence Z = [z1, z2, ..., zn] is a subsequence of another sequence X = [x1, x2, ..., xm], if there exists a strict increasing sequence [i1, i2, ..., ik] of indices of X such that for all j = 1, 2, ..., k, we have xij = zj (Cormen et al., 1989). Given two sequences X and Y, the longest common subsequence (LCS) of X and Y is a common subsequence with maximum length.</Paragraph>
    <Paragraph position="1"> LCS has been used in identifying cognate candidates during construction of N-best translation lexicon from parallel text. Melamed (1995) used the ratio (LCSR) between the length of the LCS of two words and the length of the longer word of the two words to measure the cognateness between them.</Paragraph>
    <Paragraph position="2"> He used LCS as an approximate string matching algorithm. Saggion et al. (2002) used normalized pairwise LCS to compare simila rity between two texts in automatic summarization evaluation.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Sentence-Level LCS
</SectionTitle>
      <Paragraph position="0"> To apply LCS in summarization evaluation, we view a summary sentence as a sequence of words.</Paragraph>
      <Paragraph position="1"> The intuition is that the longer the LCS of two summary sentences is, the more similar the two summaries are. We propose using LCS-based F-measure to estimate the similarity between two summaries X of length m and Y of length n, assuming X is a reference summary sentence and Y is a candidate summary sentence, as follows:</Paragraph>
      <Paragraph position="3"> Where LCS (X,Y) is the length of a longest common subsequence of X and Y, and ss = Plcs/Rlcs when ?Flcs/?Rlcs_=_?Flcs/?Plcs. In DUC, ss is set to a very big number (? 8). Therefore, only Rlcs is considered. We call the LCS-based F-measure, i.e. Equation 4, ROUGE-L. Notice that ROUGE-L is 1 when X = Y; while ROUGE-L is zero when LCS(X,Y) = 0, i.e.</Paragraph>
      <Paragraph position="4"> there is nothing in common between X and Y. F-measure or its equivalents has been shown to have met several theoretical criteria in measuring accuracy involving more than one factor (Van Rijsbergen, 1979). The composite factors are LCS-based recall and precision in this case. Melamed et al.</Paragraph>
      <Paragraph position="5"> (2003) used unigram F-measure to estimate machine translation quality and showed that unigram F-measure was as good as BLEU.</Paragraph>
      <Paragraph position="6"> One advantage of using LCS is that it does not require consecutive matches but in-sequence matches that reflect sentence level word order as n-grams.</Paragraph>
      <Paragraph position="7"> The other advantage is that it automatically includes longest in-sequence common n-grams, therefore no predefined n-gram length is necessary.</Paragraph>
      <Paragraph position="8"> ROUGE-L as defined in Equation 4 has the prop-erty that its value is less than or equal to the min imum of unigram F-measure of X and Y. Unigram recall reflects the proportion of words in X (reference summary sentence) that are also present in Y (candidate summary sentence); while unigram precision is the proportion of words in Y that are also in X. Unigram recall and precision count all co-occurring words regardless their orders; while ROUGE-L counts only in-sequence co-occurrences.</Paragraph>
      <Paragraph position="9"> By only awarding credit to in-sequence unigram matches, ROUGE-L also captures sentence level structure in a natural way. Consider the following example: S1. police killed the gunman S2. police kill the gunman S3. the gunman kill police We only consider ROUGE-2, i.e. N=2, for the purpose of explanation. Using S1 as the reference and S2 and S3 as the candidate summary sentences, S2 and S3 would have the same ROUGE-2 score, since they both have one bigram, i.e. &amp;quot;the gunman&amp;quot;. However, S2 and S3 have very different meanings. In the case of ROUGE-L, S2 has a score of 3/4 = 0.75 and S3 has a score of 2/4 = 0.5, with ss = 1. Therefore S2 is better than S3 according to ROUGE-L. This example also illustrated that ROUGE-L can work reliably at sentence level.</Paragraph>
      <Paragraph position="10"> However, LCS suffers one disadvantage that it only counts the main in-sequence words; therefore, other alternative LCSes and shorter sequences are not reflected in the final score. For example, given the following candidate sentence: S4. the gunman police killed Using S1 as its reference, LCS counts either &amp;quot;the gunman&amp;quot; or &amp;quot;police killed&amp;quot;, but not both; therefore, S4 has the same ROUGE-L score as S3. ROUGE-2 would prefer S4 than S3.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Summary-Level LCS
</SectionTitle>
      <Paragraph position="0"> Previous section described how to compute sentence-level LCS-based F-measure score. When applying to summary-level, we take the union LCS matches between a reference summary sentence, ri, and every candidate summary sentence, cj. Given a reference summary of u sentences containing a total of m words and a candidate summary of v sentences containing a total of n words, the summary-level LCS-based F-measure can be computed as follows:</Paragraph>
      <Paragraph position="2"> Again ss is set to a very big number (? 8) in DUC, i.e. only Rlcs is considered. ),( CrLCS i[?] is the LCS score of the union longest common subsequence between reference sentence ri and candidate summary C. For example, if ri = w1 w2 w3 w4 w5, and C contains two sentences: c1 = w1 w2 w6 w7 w8 and c2 = w1 w3 w8 w9 w5, then the longest common subsequence of ri and c1 is &amp;quot;w1 w2&amp;quot; and the longest common subsequence of ri and c2 is &amp;quot;w1 w3 w5&amp;quot;. The union longest common subsequence of ri, c1, and c2 is &amp;quot;w1 w2 w3 w5&amp;quot; and ),( CrLCS i[?] = 4/5.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 ROUGE-L vs. Normalized Pairwise LCS
</SectionTitle>
      <Paragraph position="0"> The normalized pairwise LCS proposed by Radev et al. (page 51, 2002) between two summaries S1 and S2, LCS(S1 ,S2)MEAD , is written as follows:</Paragraph>
      <Paragraph position="2"> Assuming S1 has m words and S2 has n words, Equation 8 can be rewritten as Equation 9 due to symmetry:</Paragraph>
      <Paragraph position="4"> We then define MEAD LCS recall (Rlcs-MEAD) and</Paragraph>
      <Paragraph position="6"> Equation 12 shows that normalized pairwise LCS as defined in Radev et al. (2002) and implemented in MEAD is also a F-measure with ss = 1. Sentence-level normalized pairwise LCS is the same as ROUGE-L with ss = 1. Besides setting ss = 1, summary-level normalized pairwise LCS is different from ROUGE-L in how a sentence gets its LCS score from its references. Normalized pairwise LCS takes the best LCS score while ROUGE-L takes the union LCS score.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 ROUGE-W: Weighted Longest Common Sub-
</SectionTitle>
    <Paragraph position="0"> sequence LCS has many nice properties as we have described in the previous sections. Unfortunately, the basic LCS also has a problem that it does not differentiate LCSes of different spatial relations within their embedding sequences. For example, given a reference  sequence X and two candidate sequences Y1 and Y2 as follows: X: [A B C D E F G] Y1: [A B C D H I K] Y2: [A H B K C I D] Y1 and Y2 have the same ROUGE-L score. However, in this case, Y1 should be the better choice than Y2 because Y1 has consecutive matches. To improve the basic LCS method, we can simply remember the  length of consecutive matches encountered so far to a regular two dimensional dynamic program table computing LCS. We call this weighted LCS (WLCS) and use k to indicate the length of the current consecutive matches ending at words xi and yj. Given two sentences X and Y, the WLCS score of X and Y can be computed using the following dynamic programming procedure:</Paragraph>
    <Paragraph position="2"> Where c is the dynamic programming table, c(i,j) stores the WLCS score ending at word xi of X and yj of Y, w is the table storing the length of consecutive matches ended at c table position i and j, and f is a function of consecutive matches at the table position, c(i,j). Notice that by providing different weighting function f, we can parameterize the WLCS algorithm to assign different credit to consecutive in-sequence matches.</Paragraph>
    <Paragraph position="3"> The weighting function f must have the property that f(x+y) &gt; f(x) + f(y) for any positive integers x and y. In other words, consecutive matches are awarded more scores than non-consecutive matches.</Paragraph>
    <Paragraph position="4"> For example, f(k)-=-ak - b when k &gt;= 0, and a, b &gt; 0. This function charges a gap penalty of -b for each non-consecutive n-gram sequences. Another possible function family is the polynomial family of the form ka where -a &gt; 1. However, in order to norma lize the final ROUGE-W score, we also prefer to have a function that has a close form inverse function. For example, f(k)-=-k2 has a close form inverse function f -1(k)-=-k1/2. F-measure based on WLCS can be computed as follows, given two sequences X of length m and Y of length n:</Paragraph>
    <Paragraph position="6"> Where f -1 is the inverse function of f. In DUC, ss is set to a very big number (? 8). Therefore, only Rwlcs is considered. We call the WLCS-based Fmeasure, i.e. Equation 15, ROUGE-W. Using Equation 15 and f(k)-=-k2 as the weighting function, the ROUGE-W scores for sequences Y1 and Y2 are 0.571 and 0.286 respectively. Therefore, Y1 would be ranked higher than Y2 using WLCS. We use the polynomial function of the form ka in the ROUGE evaluation package. In the next section, we introduce the skip-bigram co-occurrence statistics.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 ROUGE-S: Skip-Bigram Co-Occurrence Sta-
</SectionTitle>
    <Paragraph position="0"> tistics Skip-bigram is any pair of words in their sentence order, allowing for arbitrary gaps. Skip-bigram co-occurrence statistics measure the overlap of skip-bigrams between a candidate translation and a set of reference translations. Using the example given in Section 3.1: S1. police killed the gunman S2. police kill the gunman S3. the gunman kill police S4. the gunman police killed each sentence has C(4,2)1 = 6 skip-bigrams. For example, S1 has the following skip-bigrams: (&amp;quot;police killed&amp;quot;, &amp;quot;police the&amp;quot;, &amp;quot;police gunman&amp;quot;, &amp;quot;killed the&amp;quot;, &amp;quot;killed gunman&amp;quot;, &amp;quot;the gunman&amp;quot;) S2 has three skip-bigram matches with S1 (&amp;quot;police the&amp;quot;, &amp;quot;police gunman&amp;quot;, &amp;quot;the gunman&amp;quot;), S3 has one skip-bigram match with S1 (&amp;quot;the gunman&amp;quot;), and S4 has two skip-bigram matches with S1 (&amp;quot;police kille d&amp;quot;, &amp;quot;the gunman&amp;quot;). Given translations X of length m and Y of length n, assuming X is a reference translation and Y is a candidate translation, we compute skip-bigram-based F-measure as follows:</Paragraph>
    <Paragraph position="2"/>
    <Paragraph position="4"> Where SKIP2(X,Y) is the number of skip-bigram matches between X and Y, ss controlling the relative importance of Pskip2 and Rskip2, and C is the combination function. We call the skip-bigram-based Fmeasure, i.e. Equation 18, ROUGE-S.</Paragraph>
    <Paragraph position="5"> Using Equation 18 with ss = 1 and S1 as the reference, S2's ROUGE-S score is 0.5, S3 is 0.167, and S4 is 0.333. Therefore, S2 is better than S3 and S4, and S4 is better than S3. This result is more intuitive than using BLEU-2 and ROUGE-L. One advantage of skip-bigram vs. BLEU is that it does not require consecutive matches but is still sensitive to word order. Comparing skip-bigram with LCS, skip-bigram counts all in-order matching word pairs while LCS only counts one longest common subsequence.</Paragraph>
    <Paragraph position="6"> Applying skip-bigram without any constraint on the distance between the words, spurious matches such as &amp;quot;the the&amp;quot; or &amp;quot;of in &amp;quot; might be counted as valid matches. To reduce these spurious matches, we can limit the maximum skip distance, dskip, between two in-order words that is allowed to form a skip-bigram. For example, if we set dskip to 0 then ROUGE-S is equivalent to bigram overlap Fmeasure. If we set dskip to 4 then only word pairs of at most 4 words apart can form skip-bigrams.</Paragraph>
    <Paragraph position="7"> Adjusting Equations 16, 17, and 18 to use maximum skip distance limit is straightforward: we only count the skip-bigram matches, SKIP2 (X,Y), within the maximum skip distance and replace denominators of Equations 16, C(m,2), and 17, C(n,2), with the actual numbers of within distance skip-bigrams from the reference and the candidate respectively.</Paragraph>
    <Paragraph position="9"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 ROUGE-SU: Extension of ROUGE-S
</SectionTitle>
      <Paragraph position="0"> One potential problem for ROUGE-S is that it does not give any credit to a candidate sentence if the sentence does not have any word pair co-occurring with its references. For example, the following sentence has a ROUGE-S score of zero: S5. gunman the killed police S5 is the exact reverse of S1 and there is no skip bigram match between them. However, we would like to differentiate sentences similar to S5 from sentences that do not have single word co-occurrence with S1. To achieve this, we extend ROUGE-S with the addition of unigram as counting unit. The extended version is called ROUGE-SU. We can also obtain ROUGE-SU from ROUGE-S by adding a begin-of-sentence marker at the beginning of candidate and reference sentences.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Evaluations of ROUGE
</SectionTitle>
    <Paragraph position="0"> To assess the effectiveness of ROUGE measures, we compute the correlation between ROUGE assigned summary scores and human assigned summary scores. The intuition is that a good evaluation measure should assign a good score to a good summary and a bad score to a bad summary. The ground truth is based on human assigned scores. Acquiring human judgments are usually very expensive; fortunately, we have DUC 2001, 2002, and 2003 evaluation data that include human judgments for the following: * Single document summaries of about 100 words: 12 systems 2 for DUC 2001 and 14 systems for 2002. 149 single document summaries were judged per system in DUC 2001 and 295 were judged in DUC 2002.</Paragraph>
    <Paragraph position="1">  for DUC 2001 and 10 systems for DUC 2002; 100 words: 14 systems for DUC 2001, 10 systems for DUC 2002, and 18 systems for DUC 2003; 200 words: 14 systems for DUC 2001 and 10 systems for DUC 2002; 400 words: 14 systems for DUC 2001. 29 summaries were judged per system per summary size in DUC 2001, 59 were judged in DUC 2002, and 30 were judged in DUC 2003.</Paragraph>
    <Paragraph position="2"> 2 All systems include 1 or 2 baselines. Please see DUC website for details.</Paragraph>
    <Paragraph position="3"> Besides these human judgments, we also have 3 sets of manual summaries for DUC 2001, 2 sets for DUC 2002, and 4 sets for DUC 2003. Human judges assigned content coverage scores to a candidate summary by examining the percentage of content overlap between a manual summary unit, i.e. elementary discourse unit or sentence, and the candidate summary using Summary Evaluation Environment3 (SEE) developed by the University of</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Southern California's Information Sciences Institute
</SectionTitle>
      <Paragraph position="0"> (ISI). The overall candidate summary score is the average of the content coverage scores of all the units in the manual summary. Note that human judges used only one manual summary in all the evaluations although multiple alternative summaries were available.</Paragraph>
      <Paragraph position="1"> With the DUC data, we computed Pearson's product moment correlation coefficients, Spearman's rank order correlation coefficients, and Kendall's correlation coefficients between systems' average ROUGE scores and their human assigned average coverage scores using single reference and multiple references. To investigate the effect of stemming and inclusion or exclusion of stopwords,  manual summaries (CASE set), stemmed4 version of the summaries (STEM set), and stopped version of the summaries (STOP set). For example, we computed ROUGE scores for the 12 systems participated in the DUC 2001 single document summarization evaluation using the CASE set with single reference and then calculated the three correlation scores for these 12 systems' ROUGE scores vs. human assigned average coverage scores. After that we repeated the process using multiple references and then using STEM and STOP sets. Therefore, 2 (multi or single) x 3 (CASE, STEM, or STOP) x 3 (Pearson, Spearman, or Kendall) = 18 data points were collected for each ROUGE measure and each DUC task. To assess the significance of the results, we applied bootstrap resampling technique (Davison and Hinkley, 1997) to estimate 95% confidence intervals for every correlation computation.</Paragraph>
      <Paragraph position="2"> 17 ROUGE measures were tested for each run using ROUGE evaluation package v1.2.1: ROUGE-N with N = 1 to 9, ROUGE-L, ROUGE-W with weighting factor a = 1.2, ROUGE-S and ROUGE-SU with maximum skip distance dskip = 1, 4, and 9. Due to limitation of space, we only report correlation analysis results based on Pearson's correlation coefficient. Correlation analyses based on Spearman's and Kendall's correlation coefficients are tracking Pearson's very closely and will be posted later at the ROUGE website5 for reference. The critical value6 for Pearson's correlation is 0.632 at 95% confidence with 8 degrees of freedom.</Paragraph>
      <Paragraph position="3"> Table 1 shows the Pearson's correlation coefficients of the 17 ROUGE measures vs. human judgments on DUC 2001 and 2002 100 words single document summarization data. The best values in each column are marked with dark (green) color and statistically equivalent values to the best values are marked with gray. We found that correlations were not affected by stemming or removal of stopwords in this data set, ROUGE-2 performed better among the ROUGE-N variants, ROUGE-L, ROUGE-W, and ROUGE-S were all performing well, and using multiple references improved performance though not much. All ROUGE measures achieved very good correlation with human judgments in the DUC 2002 data. This might due to the double sample size in DUC 2002 (295 vs. 149 in DUC 2001) for each system. null Table 2 shows the correlation analysis results on the DUC 2003 single document very short summary data. We found that ROUGE-1, ROUGE-L, ROUGE- null measure scores vs. human judgments for the DUC 2003 very short summary task SU4 and 9, and ROUGE-W were very good measures in this category, ROUGE-N with N &gt; 1 performed significantly worse than all other measures, and exclusion of stopwords improved performance in general except for ROUGE-1. Due to the large number of samples (624) in this data set, using multiple references did not improve correlations.</Paragraph>
      <Paragraph position="4"> In Table 3 A1, A2, and A3, we show correlation analysis results on DUC 2001, 2002, and 2003 100 words multi-document summarization data. The results indicated that using multiple references improved correlation and exclusion of stopwords usually improved performance. ROUGE-1, 2, and 3 performed fine but were not consistent. ROUGE-1, ROUGE-S4, ROUGE-SU4, ROUGE-S9, and ROUGE-SU9 with stopword removal had correlation above 0.70. ROUGE-L and ROUGE-W did not work well in this set of data.</Paragraph>
      <Paragraph position="5"> Table 3 C, D1, D2, E1, E2, and F show the correlation analyses using multiple references on the rest of DUC data. These results again suggested that exclusion of stopwords achieved better performance especially in multi-document summaries of 50 words. Better correlations (&gt; 0.70) were observed on long summary tasks, i.e. 200 and 400 words summaries. The relative performance of ROUGE measures followed the pattern of the 100 words multi-document summarization task.</Paragraph>
      <Paragraph position="6"> Comparing the results in Table 3 with Table s 1 and 2, we found that correlation values in the multi-document tasks rarely reached high 90% except in long summary tasks. One possible explanation of this outcome is that we did not have large amount of samples for the multi-document tasks. In the single document summarization tasks we had over 100 samples; while we only had about 30 samples in the multi-document tasks. The only tasks that had over 30 samples was from DUC 2002 and the correlations of ROUGE measures with human judgments on the 100 words summary task were much better and more stable than similar tasks in DUC 2001 and 2003. Statistically stable human judgments of system performance might not be obtained due to lack of samples and this in turn caused instability of correlation analyses.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML