File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1003_intro.xml

Size: 3,137 bytes

Last Modified: 2025-10-06 14:02:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1003">
  <Title>The Effects of Human Variation in DUC Summarization Evaluation</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Research in summarization was one of the first efforts to use computers to &amp;quot;understand&amp;quot; language.</Paragraph>
    <Paragraph position="1"> Work was done back in the 1950s by many groups, including commercial services, to automatically produce abstracts or lists of pertinent keywords for documents. The interest in automatic summarization of text has continued, and currently is enjoying increased emphasis as demonstrated by the numerous summarization workshops held during the last five years. The DUC summarization evaluations (2001 - 2004)(http://duc.nist.gov) sponsored by the DARPA TIDES project (Translingual Information Detection, Extraction, and Summarization) are prominent examples. DUC has been guided by a roadmap developed by members of the summarization research community.</Paragraph>
    <Paragraph position="2"> Along with the research has come efforts to evaluate automatic summarization performance. Two major types of evaluation have been used: extrinsic evaluation, where one measures indirectly how well the summary performs by measuring performance in a task putatively dependent on the quality of the summary, and intrinsic evaluation, where one measures the quality of the created summary directly.</Paragraph>
    <Paragraph position="3"> Extrinsic evaluation requires the selection of a task that could use summarization and measurement of the effect of using automatic summaries instead of the original text. Critical issues here are the selection of a real task and the metrics that will be sensitive to differences in the quality of the summaries. This paper concerns itself with intrinsic evaluations. Intrinsic evaluation requires some standard or model against which to judge summarization quality and usually this standard is operationalized by finding an existing abstract/text data set or by having humans create model summaries (Jing et al., 1998).</Paragraph>
    <Paragraph position="4"> Intrinsic evaluations have taken two main forms: manual, in which one or more people evaluate the system-produced summary and automatic, in which the summary is evaluated without the human in the loop. But both types involve human judgments of some sort and with them their inherent variability.</Paragraph>
    <Paragraph position="5"> Humans vary in what material they choose to include in a summary and in how they express the content. Humans judgments of summary quality vary from one person to another and across time for one person.</Paragraph>
    <Paragraph position="6"> In DUC 2001 - 2003 human judgments have formed the foundation of the evaluations and information has been collected each year on one or more sorts of variation in those judgments. The following sections examine this information and how the variation in human input affected or did not affect the results of those evaluations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML