File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0901_intro.xml
Size: 3,494 bytes
Last Modified: 2025-10-06 14:03:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0901"> <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 1-8, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate?</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> People often prefer to read a summary of a text document, e.g., news headlines, scientific abstracts, movie previews and reviews, and meeting minutes. Correspondingly, the explosion of online textual material has prompted advanced research in document summarization. Although researchers have demonstrated that users can read summaries faster than full text (Mani et al., 2002) with some loss of accuracy, researchers have found it difficult to draw strong conclusions about the usefulness of summarization due to the low level of interannotator agreement in the gold standards that they have used. Definitive conclusions about the usefulness of summaries would provide justification for continued research and development of new summarization methods.</Paragraph> <Paragraph position="1"> To investigate the question of whether text summarization is useful in an extrinsic task, we examined human performance in a relevance assessment task using a human text surrogate (i.e. text intended to stand in the place of a document). We use single-document English summaries as these are sufficient for investigating task-based usefulness, although more elaborate surrogates are possible, e.g., those that span more than one document (Radev and McKeown, 1998; Mani and Bloedorn, 1998).</Paragraph> <Paragraph position="2"> The next section motivates the need for developing a new framework for measuring task-based usefulness. Section 3 presents a novel extrinsic measure called Relevance-Prediction. Section 4 demonstrates that this is a more reliable measure than that of previous gold standard methods, e.g., the LDC-Agreement method used for SUMMAC-style evaluations, and that this reliability allows us to make stronger statistical statements about the benefits of summarization. We expect these findings to be important for future summarization evaluations.</Paragraph> <Paragraph position="3"> Section 5 presents the results of correlation between task usefulness and the Recall Oriented Understudy for Gisting Evaluation (ROUGE) metric (Lin and Hovy, 2003).1 While we show that ROUGE correlates with task usefulness (using our Relevance-Prediction measure), we detect a slight difference between informative, extractive headlines (containing words from the full document) and less informative, non-extractive &quot;eye-catchers&quot; (containing words that might not appear in the full document, and intended to entice a reader to read the entire document).</Paragraph> <Paragraph position="4"> Section 6 further highlights the importance of this point and discusses the implications for automatic evaluation of non-extractive summaries. To evaluate non-extractive summaries reliably, an automatic measure may require knowledge of sophisticated meaning units.2 It is our hope that the conclusions drawn herein will prompt investigation into more sophisticated automatic metrics as researchers shift their focus to non-extractive summaries. null</Paragraph> </Section> class="xml-element"></Paper>