File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1048_intro.xml
Size: 1,768 bytes
Last Modified: 2025-10-06 14:01:51
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1048"> <Title>Evaluation challenges in large-scale document summarization</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Automatic document summarization is a field that has seen increasing attention from the NLP community in recent years. In part, this is because summarization incorporates many important aspects of both natural language understanding and natural language generation. In part it is because effective automatic summarization would be useful in a variety of areas. Unfortunately, evaluating automatic summarization in a standard and inexpensive way is a difficult task (Mani et al., 2001). Traditional large-scale evaluations are either too simplistic (using measures like precision, recall, and percent agreement which (1) don't take chance agreement into account and (2) don't account for the fact that human judges don't agree which sentences should be in a summary) or too expensive (an approach using manual judgements can scale up to a few hundred summaries but not to tens or hundreds of thousands).</Paragraph> <Paragraph position="1"> In this paper, we present a comparison of six summarizers as well as a meta-evaluation including eight measures: Precision/Recall, Percent Agreement, Kappa, Relative Utility, Relevance Correlation, and three types of Content-Based measures (cosine, longest common subsequence, and word overlap). We found that while all measures tend to rank summarizers in different orders, measures like Kappa, Relative Utility, Relevance Correlation and Content-Based each offer significant advantages over the more simplistic methods.</Paragraph> </Section> class="xml-element"></Paper>