File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1048_metho.xml

Size: 15,336 bytes

Last Modified: 2025-10-06 14:08:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1048">
  <Title>Evaluation challenges in large-scale document summarization</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Data, Annotation, and Experimental
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Design
</SectionTitle>
      <Paragraph position="0"> We performed our experiments on the Hong Kong News corpus provided by the Hong Kong SAR of the People's Republic of China (LDC catalog number LDC2000T46). It contains 18,146 pairs of parallel documents in English and Chinese. The texts are not typical news articles. The Hong Kong Newspaper mainly publishes announcements of the local administration and descriptions of municipal events, such as an anniversary of the fire department, or seasonal festivals. We tokenized the corpus to identify headlines and sentence boundaries. For the English text, we used a lemmatizer for nouns and verbs.</Paragraph>
      <Paragraph position="1"> We also segmented the Chinese documents using the tool provided at http://www.mandarintools.com.</Paragraph>
      <Paragraph position="2"> Several steps of the meta evaluation that we performed involved human annotator support. First, we Cluster 2 Meetings with foreign leaders  this experiment.</Paragraph>
      <Paragraph position="3"> asked LDC to build a set of queries (Figure 1). Each of these queries produced a cluster of relevant documents. Twenty of these clusters were used in the experiments in this paper.</Paragraph>
      <Paragraph position="4"> Additionally, we needed manual summaries or extracts for reference. The LDC annotators produced summaries for each document in all clusters. In order to produce human extracts, our judges also labeled sentences with &amp;quot;relevance judgements&amp;quot;, which indicate the relevance of sentence to the topic of the document. The relevance judgements for sentences range from 0 (irrelevant) to 10 (essential). As in (Radev et al., 2000), in order to create an extract of a certain length, we simply extract the top scoring sentences that add up to that length.</Paragraph>
      <Paragraph position="5"> For each target summary length, we produce an extract using a summarizer or baseline. Then we compare the output of the summarizer or baseline with the extract produced from the human relevance judgements. Both the summarizers and the evaluation measures are described in greater detail in the next two sections.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Summarizers and baselines
</SectionTitle>
      <Paragraph position="0"> This section briefly describes the summarizers we used in the evaluation. All summarizers take as input a target length (n%) and a document (or cluster) split into sentences. Their output is an n% extract of the document (or cluster).</Paragraph>
      <Paragraph position="1"> MEAD (Radev et al., 2000): MEAD is a centroid-based extractive summarizer that scores sentences based on sentence-level and inter-sentence features which indicate the quality of the sentence as a summary sentence. It then chooses the top-ranked sentences for inclusion in the output summary. MEAD runs on both English documents and on BIG5-encoded Chinese. We tested the summarizer in both languages. null WEBS (Websumm (Mani and Bloedorn, 2000)): can be used to produce generic and query-based summaries. Websumm uses a graph-connectivity model and operates under the assumption that nodes which are connected to many other nodes are likely to carry salient information.</Paragraph>
      <Paragraph position="2"> SUMM (Summarist (Hovy and Lin, 1999)): an extractive summarizer based on topic signatures. null ALGN (alignment-based): We ran a sentence alignment algorithm (Gale and Church, 1993) for each pair of English and Chinese stories. We used it to automatically generate Chinese &amp;quot;manual&amp;quot; extracts from the English manual extracts we received from LDC.</Paragraph>
      <Paragraph position="3"> LEAD (lead-based): n% sentences are chosen from the beginning of the text.</Paragraph>
      <Paragraph position="4"> RAND (random): n% sentences are chosen at random.</Paragraph>
      <Paragraph position="5"> The six summarizers were run at ten different target lengths to produce more than 100 million summaries (Figure 2). For the purpose of this paper, we only focus on a small portion of the possible experiments that our corpus can facilitate.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="2" type="metho">
    <SectionTitle>
3 Summary Evaluation Techniques
</SectionTitle>
    <Paragraph position="0"> We used three general types of evaluation measures: co-selection, content-based similarity, and relevance correlation. Co-selection measures include precision and recall of co-selected sentences, relative utility (Radev et al., 2000), and Kappa (Siegel and Castellan, 1988; Carletta, 1996). Co-selection methods have some restrictions: they only work for extractive summarizers. Two manual summaries of the same input do not in general share many identical sentences. We address this weakness of co-selection</Paragraph>
    <Paragraph position="2"/>
    <Paragraph position="4"> sentence-based, W = word-based; #dj = number of &amp;quot;docjudges&amp;quot; (ranked lists of documents and summaries).</Paragraph>
    <Paragraph position="5"> Target lengths above 50% are not shown in this table for lack of space. Each run is available using two different retrieval schemes. We report results using the cross-lingual retrievals in a separate paper.</Paragraph>
    <Paragraph position="6"> measures with several content-based similarity measures. The similarity measures we use are word overlap, longest common subsequence, and cosine.</Paragraph>
    <Paragraph position="7"> One advantage of similarity measures is that they can compare manual and automatic extracts with manual abstracts. To our knowledge, no systematic experiments about agreement on the task of summary writing have been performed before. We use similarity measures to measure interjudge agreement among three judges per topic. We also apply the measures between human extracts and summaries, which answers the question if human extracts are more similar to automatic extracts or to human summaries.</Paragraph>
    <Paragraph position="8"> The third group of evaluation measures includes relevance correlation. It shows the relative performance of a summary: how much the performance of document retrieval decreases when indexing summaries rather than full texts.</Paragraph>
    <Paragraph position="9"> Task-based evaluations (e.g., SUMMAC (Mani et al., 2001), DUC (Harman and Marcu, 2001), or (Tombros et al., 1998) measure human performance using the summaries for a certain task (after the summaries are created). Although they can be a very effective way of measuring summary quality, task-based evaluations are prohibitively expensive at large scales. In this project, we didn't perform any task-based evaluations as they would not be appropriate at the scale of millions of summaries.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Evaluation by sentence co-selection
</SectionTitle>
      <Paragraph position="0"> For each document and target length we produce three extracts from the three different judges, which we label throughout as J1, J2, and J3.</Paragraph>
      <Paragraph position="1"> We used the rates 5%, 10%, 20%, 30%, 40% for  most experiments. For some experiments, we also consider summaries of 50%, 60%, 70%, 80% and 90% of the original length of the documents. Figure 3 shows some abbreviations for co-selection that we will use throughout this section.</Paragraph>
      <Paragraph position="2"> 3.1.1 Precision and Recall Precision and recall are defined as:</Paragraph>
      <Paragraph position="4"> extracted by the system and the judges.</Paragraph>
      <Paragraph position="5"> In our case, each set of documents which is compared has the same number of sentences and also the same number of sentences are extracted; thus</Paragraph>
      <Paragraph position="7"> The average precision Pavg(SYSTEM) and recall Ravg(SYSTEM) are calculated by summing over individual judges and normalizing. The average interjudge precision and recall is computed by averaging over all judge pairs.</Paragraph>
      <Paragraph position="8"> However, precision and recall do not take chance agreement into account. The amount of agreement one would expect two judges to reach by chance depends on the number and relative proportions of the categories used by the coders. The next section on Kappa shows that chance agreement is very high in extractive summarization.</Paragraph>
      <Paragraph position="9">  Kappa (Siegel and Castellan, 1988) is an evaluation measure which is increasingly used in NLP annotation work (Krippendorff, 1980; Carletta, 1996). Kappa has the following advantages over P and R: It factors out random agreement. Random agreement is defined as the level of agreement which would be reached by random annotation using the same distribution of categories as the real annotators.</Paragraph>
      <Paragraph position="10"> It allows for comparisons between arbitrary numbers of annotators and items.</Paragraph>
      <Paragraph position="11"> It treats less frequent categories as more important (in our case: selected sentences), similarly to precision and recall but it also considers (with a smaller weight) more frequent categories as well.</Paragraph>
      <Paragraph position="12"> The Kappa coefficient controls agreement P(A) by taking into account agreement by chance P(E) :</Paragraph>
      <Paragraph position="14"> No matter how many items or annotators, or how the categories are distributed, K = 0 when there is no agreement other than what would be expected by chance, and K = 1 when agreement is perfect. If two annotators agree less than expected by chance, Kappa can also be negative.</Paragraph>
      <Paragraph position="15"> We report Kappa between three annotators in the case of human agreement, and between three humans and a system (i.e. four judges) in the next section. null  Relative Utility (RU) (Radev et al., 2000) is tested on a large corpus for the first time in this project. RU takes into account chance agreement as a lower bound and interjudge agreement as an upper bound of performance. RU allows judges and summarizers to pick different sentences with similar content in their summaries without penalizing them for doing so. Each judge is asked to indicate the importance of each sentence in a cluster on a scale from 0 to 10. Judges also specify which sentences subsume or paraphrase each other. In relative utility, the score of an automatic summary increases with the importance of the sentences that it includes but goes down with the inclusion of redundant sentences.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Content-based Similarity measures
</SectionTitle>
      <Paragraph position="0"> Content-based similarity measures compute the similarity between two summaries at a more fine-grained level than just sentences. For each automatic extract S and similarity measure M we compute the following number:</Paragraph>
      <Paragraph position="2"> We used several content-based similarity measures that take into account different properties of the text: Cosine similarity is computed using the follow-</Paragraph>
      <Paragraph position="4"> where X and Y are text representations based on the vector space model.</Paragraph>
      <Paragraph position="5"> Longest Common Subsequence is computed as follows:</Paragraph>
      <Paragraph position="7"> where X and Y are representations based on sequences and where lcs(X;Y) is the length of the longest common subsequence between X and Y, length(X) is the length of the string X, and d(X;Y) is the minimum number of deletion and insertions needed to transform X into Y (Crochemore and Rytter, 1994).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Relevance Correlation
</SectionTitle>
      <Paragraph position="0"> Relevance correlation (RC) is a new measure for assessing the relative decrease in retrieval performance when indexing summaries instead of full documents.</Paragraph>
      <Paragraph position="1"> The idea behind it is similar to (Sparck-Jones and Sakai, 2001). In that experiment, Sparck-Jones and Sakai determine that short summaries are good substitutes for full documents at the high precision end.</Paragraph>
      <Paragraph position="2"> With RC we attempt to rank all documents given a query.</Paragraph>
      <Paragraph position="3"> Suppose that given a queryQand a corpus of documents Di, a search engine ranks all documents in Di according to their relevance to the query Q. If instead of the corpus Di, the respective summaries of all documents are substituted for the full documents and the resulting corpus of summaries Si is ranked by the same retrieval engine for relevance to the query, a different ranking will be obtained. If the summaries are good surrogates for the full documents, then it can be expected that rankings will be similar.</Paragraph>
      <Paragraph position="4"> There exist several methods for measuring the similarity of rankings. One such method is Kendall's tau and another is Spearman's rank correlation. Both methods are quite appropriate for the task that we want to perform; however, since search engines produce relevance scores in addition to rankings, we can use a stronger similarity test, linear correlation between retrieval scores. When two identical rankings are compared, their correlation is 1. Two completely independent rankings result in a score of 0 while two rankings that are reverse versions of one another have a score of -1. Although rank correlation seems to be another valid measure, given the large number of irrelevant documents per query resulting in a large number of tied ranks, we opted for linear correlation. Interestingly enough, linear correlation and rank correlation agreed with each other. Relevance correlation r is defined as the linear correlation of the relevance scores (x and y) assigned by two different IR algorithms on the same set of documents or by the same IR algorithm on different data sets:</Paragraph>
      <Paragraph position="6"> Herexandy are the means of the relevance scores for the document sequence.</Paragraph>
      <Paragraph position="7"> We preprocess the documents and use Smart to index and retrieve them. After the retrieval process, each summary is associated with a score indicating the relevance of the summary to the query. The relevance score is actually calculated as the inner product of the summary vector and the query vector. Based on the relevance score, we can produce a full ranking of all the summaries in the corpus.</Paragraph>
      <Paragraph position="8"> In contrast to (Brandow et al., 1995) who run 12 Boolean queries on a corpus of 21,000 documents and compare three types of documents (full documents, lead extracts, and ANES extracts), we measure retrieval performance under more than 300 conditions (by language, summary length, retrieval policy for 8 summarizers or baselines).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML