File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1013_intro.xml

Size: 2,222 bytes

Last Modified: 2025-10-06 14:02:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1013">
  <Title>ROUGE: A Package for Automatic Evaluation of Summaries</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Traditionally evaluation of summarization involves human judgments of different quality metrics, for example, coherence, conciseness, grammaticality, readability, and content (Mani, 2001). However, even simple manual evaluation of summaries on a large scale over a few linguistic quality questions and content coverage as in the Document Understanding Conference (DUC) (Over and Yen, 2003) would require over 3,000 hours of human efforts.</Paragraph>
    <Paragraph position="1"> This is very expensive and difficult to conduct in a frequent basis. Therefore, how to evaluate summaries automatically has drawn a lot of attention in the summarization research community in recent years.</Paragraph>
    <Paragraph position="2"> For example, Saggion et al. (2002) proposed three content-based evaluation methods that measure similarity between summaries. These methods are: cosine similarity, unit overlap (i.e. unigram or bigram), and longest common subsequence. However, they did not show how the results of these automatic evaluation methods correlate to human judgments.</Paragraph>
    <Paragraph position="3"> Following the successful applic ation of automatic evaluation methods, such as BLEU (Papineni et al., 2001), in machine translation evaluation, Lin and Hovy (2003) showed that methods similar to BLEU, i.e. n-gram co-occurrence statistics, could be applied to evaluate summaries. In this paper, we introduce a package, ROUGE, for automatic evaluation of summaries and its evaluations. ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes several automatic evaluation methods that measure the similarity between summaries. We describe ROUGE-N in Section 2, ROUGE-L in Section 3, ROUGE-W in Section 4, and ROUGE-S in Section 5. Section 6 shows how these measures correlate with human judgments using DUC 2001, 2002, and 2003 data. Section 7 concludes this paper and discusses future directions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML