XML Viewer - w03-0508

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0508_intro.xml
Size: 8,151 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0508">
  <Title>Examining the consensus between human summaries: initial experiments with factoid analysis</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> It is an understatement to say that measuring the quality of summaries is hard. In fact, there is unanimous consensus in the summarisation community that evaluation of summaries is a monstrously di-cult task. In the past years, there has been quite a lot of summarisation work that has efiectively aimed at flnding viable evaluation strategies (Sp~arck Jones, 1999; Jing et al., 1998; Donaway et al., 2000). Large-scale conferences like SUMMAC (Mani et al., 1999) and DUC (2002) have unfortunately shown weak results in that current evaluation measures could not distinguish between automatic summaries { though they are efiective enough to distinguish them from human-written summaries.</Paragraph>
    <Paragraph position="1"> In principle, the best way to evaluate a summary is to try to perform the task for which the summary was meant in the flrst place, and measure the quality of the summary on the basis of degree of success in executing the task. However, such extrinsic evaluations are so time-consuming to set up that they cannot be used for the day-to-day evaluation needed during system development. So in practice, a method for intrinsic evaluation is needed, where the properties of the summary itself are examined, independent of its application.</Paragraph>
    <Paragraph position="2"> We think one of the reasons for the di-culty of an intrinsic evaluation is that summarisation has to call upon at least two hard subtasks: selection of information and production of new text. Both tasks are known from various NLP flelds (e.g. information retrieval and information extraction for selection; generation and machine translation (MT) for production) to be not only hard to execute, but also hard to evaluate. This is caused for a large part by the fact that in both cases there is no single \best&amp;quot; result, but rather various \good&amp;quot; results. It is hence no wonder that the evaluation of summarisation, combining these two, is even harder. The general approach for intrinsic evaluations, then (Mani, 2001), is to separate the evaluation of the form of the text (quality) and its information content (informativeness).</Paragraph>
    <Paragraph position="3"> In this paper, we will focus on the latter, the intrinsic evaluation of informativeness, and we will address two aspects: the (in)su-ciency of the single human summary to measure against, and the information unit on which similarity measures are based.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Gold standards
</SectionTitle>
      <Paragraph position="0"> In various NLP flelds, such as POS tagging, systems are tested by way of comparison against a \gold standard&amp;quot;, a manually produced result which is supposed to be the \correct&amp;quot;, \true&amp;quot; or \best&amp;quot; result. This presupposes, however, that there is a single \best&amp;quot; result. In summarisation there appears to be no \one truth&amp;quot;, as is evidenced by a low agreement between humans in producing gold standard summaries by sentence selection (Rath et al., 1961; Jing et al., 1998; Zechner, 1996), and low overlap measures between humans when gold standards summaries are created by reformulation in the summarisers' own words (e.g. the average overlap for the 542 single document summary pairs in DUC-02 was only about 47%).</Paragraph>
      <Paragraph position="1"> But even though the non-existence of any one gold standard is generally acknowledged in the summarisation community, actual practice nevertheless ignores this. Comparisons against a single gold standard are widely used, due to the expense of compiling summary gold standards and the lack of composite measures for comparison to more than one gold standard.</Paragraph>
      <Paragraph position="2"> In a related fleld, information retrieval (IR), the problem of subjectivity of relevance judgements is circumvented by extensive sampling: many difierent queries are collected to level out the difierence humans have in suggesting queries and in selecting relevant documents. While relevance judgements between humans remain difierent, Voorhees (2000) shows that the relative rankings of systems are nevertheless stable across annotators, which means that meaningful IR measures have been found despite the inherent subjectivity of relevance judgements.</Paragraph>
      <Paragraph position="3"> Similarly, in MT, the recent Bleu measure also uses the idea that one gold standard is not enough.</Paragraph>
      <Paragraph position="4"> In an experiment, Papineni et al. (2001) based an evaluation on a collection of four reference translations of 40 general news stories and showed the evaluation to be comparable to human judgement.</Paragraph>
      <Paragraph position="5"> Lin and Hovy (2002) examine the use of a multiple gold standard for summarisation evaluation, and conclude \we need more than one model summary although we cannot estimate how many model summaries are required to achieve reliable automated summary evaluation&amp;quot;. We explore the difierences and similarities between various human summaries in order to create a basis for such an estimate, and as a side-efiect, also re-examine the degree of difierence between the use of a single summary gold standard and the use of a compound gold standard.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 Similarity measures
</SectionTitle>
      <Paragraph position="0"> The second aspect we examine is the similarity measure to be used for gold standard comparison.</Paragraph>
      <Paragraph position="1"> In principle, the comparison can be done via co-selection of extracted sentences (Rath et al., 1961; Jing et al., 1998; Zechner, 1996), by string-based surface measures (Lin and Hovy, 2002; Saggion et al., 2002), or by subjective judgements of the amount of information overlap (DUC, 2002). The rationale for using information overlap judgement as the main evaluation metric for DUC is the wish to measure the meaning of sentences rather than use surface-based similarity such as co-selection (which does not even take identical information expressed in difierent sentences into account) and string-based measures.</Paragraph>
      <Paragraph position="2"> In the DUC competitions, assessors judge the informational overlap between \model units&amp;quot; ( elementary discourse units (EDUs), i.e. clause-like units, taken from the gold standard summary) and \peer units&amp;quot; (sentences taken from the participating summaries) on the basis of the question: \How much of the information in a model unit is contained in a peer unit: all of it, most, some, any, or none.&amp;quot; This overlap judgement is done for each system-produced summary, and weighted recall measures report how much gold standard information is present in the summaries.</Paragraph>
      <Paragraph position="3"> However, Lin and Hovy (2002) report low agreement for two tasks: producing the human summaries (around 40%), and assigning information overlap between them. In those cases where annotators had to judge a pair consisting of a gold standard sentence and a system sentence more than once (because difierent systems returned the same sentence), they agreed with their own prior judgement in only 82% of the cases. This relatively low intra-annotator agreement points to the fact that the overlap judgement remains a subjective task where judges will disagree. Lin and Hovy show the instability of the evaluation, expressed in system rankings.</Paragraph>
      <Paragraph position="4"> We propose a gold standard comparison based on factoids, a pseudo-semantic representation of the text, which measures information rather than string similarity, like DUC, but which is more objective than DUC-style information overlap judgement.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML