File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0406_metho.xml
Size: 14,669 bytes
Last Modified: 2025-10-06 14:08:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0406"> <Title>Manual and Automatic Evaluation of Summaries</Title> <Section position="3" start_page="0" end_page="3" type="metho"> <SectionTitle> 2 Document Understanding Conference (DUC) </SectionTitle> <Paragraph position="0"> DUC2001 included three tasks: * Fully automatic single-document summarization: given a document, participants were required to create a generic 100-word summary. The training set comprised 30 sets of approximately 10 documents each, together with their 100-word human written summaries. The test set comprised 30 unseen documents.</Paragraph> <Paragraph position="1"> * Fully automatic multi-document summarization: given a set of documents about a single subject, participants were required to create 4 generic summaries of the entire set, containing 50, 100, 200, and 400 words respectively. The document sets were of four types: a single natural disaster event; a single event; multiple instances of a type of event; and information about an individual. The training set comprised 30 sets of approximately 10 documents, each provided with their 50, 100, 200, and 400word human written summaries. The test set comprised 30 unseen sets.</Paragraph> <Paragraph position="2"> Philadelphia, July 2002, pp. 45-51. Association for Computational Linguistics. Proceedings of the Workshop on Automatic Summarization (including DUC 2002), * Exploratory summarization: participants were encouraged to investigate alternative approaches to evaluating summarization and report their results.</Paragraph> <Paragraph position="3"> A total of 11 systems participated in the single-document summarization task and 12 systems participated in the multi-document task.</Paragraph> <Paragraph position="4"> The training data were distributed in early March of 2001 and the test data were distributed in mid-June of 2001. Results were submitted to NIST for evaluation by July 1 st 2001.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Evaluation Materials </SectionTitle> <Paragraph position="0"> For each document or document set, one human summary was created as the 'ideal' model summary at each specified length. Two other human summaries were also created at each length. In addition, baseline summaries were created automatically for each length as reference points. For the multi-document summarization task, one baseline, lead baseline, took the first 50, 100, 200, and 400 words in the last document in the collection. A second baseline, coverage baseline, took the first sentence in the first document, the first sentence in the second document and so on until it had a summary of 50, 100, 200, or 400 words. Only one baseline (baseline1) was created for the single document summarization task.</Paragraph> </Section> <Section position="2" start_page="0" end_page="3" type="sub_section"> <SectionTitle> 2.2 Summary Evaluation Environment </SectionTitle> <Paragraph position="0"> NIST assessors who created the 'ideal' written summaries did pairwise comparisons of their summaries to the system-generated summaries, other assessors' summaries, and baseline summaries. They used the Summary Evaluation Environment (SEE) 2.0 developed by one of the authors (Lin 2001) to support the process.</Paragraph> <Paragraph position="1"> Using SEE, the assessors compared the system's text (the peer text) to the ideal (the model text). As shown in Figure 1, each text was decomposed into a list of units and displayed in separate windows. In DUC-2001 the sentence was used as the smallest unit of evaluation.</Paragraph> <Paragraph position="2"> Figure 1. SEE in an evaluation session.</Paragraph> <Paragraph position="3"> SEE 2.0 provides interfaces for assessors to judge both the content and the quality of summaries. To measure content, assessors step through each model unit, mark all system units sharing content with the current model unit (shown in green highlight in the model summary window), and specify that the marked system units express all, most, some or hardly any of the content of the current model unit. To measure quality, assessors rate grammaticality at five different levels: all, most, some, hardly any, or none.</Paragraph> <Paragraph position="4"> For example, as shown in Figure 1, an assessor marked system units 1.1 and 10.4 (shown in red underlines) as sharing some content with the current model unit 2.2 (highlighted green).</Paragraph> </Section> </Section> <Section position="4" start_page="3" end_page="3" type="metho"> <SectionTitle> 3 Evaluation Metrics </SectionTitle> <Paragraph position="0"> One goal of DUC-2001 was to debug the evaluation procedures and identify stable metrics that could serve as common reference points. NIST did not define any official performance metric in DUC-2001. It released the raw evaluation results to DUC-2001 participants and encouraged them to propose metrics that would help progress the field.</Paragraph> <Section position="1" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.1 Recall, Coverage, Retention and Weighted Retention </SectionTitle> <Paragraph position="0"> Recall at different compression ratios has been used in summarization research to measure how well an automatic system retains important content of original documents (Mani and Maybury 1999). Assume we have a system summary S s and a model summary S length of a summary (by words or sentences) divided by the length of its original document. Applying this direct all-or-nothing recall in DUC-2001 without modification is not appropriate because: Does the summary observe English grammatical rules independent of its content? Do sentences in the summary fit in with their surrounding sentences? Is the content of the summary expressed and organized in an effective way? 1. Multiple system units contribute to multiple model units.</Paragraph> <Paragraph position="1"> 2. Exact overlap between S s and S m rarely occurs.</Paragraph> <Paragraph position="2"> 3. Overlap judgment is not binary. For example in Figure 1, an assessor judged system units 1.1 and 10.4 sharing some content with model unit 2.2. Unit 1.1 says &quot;Thousands of people are feared dead&quot; and unit 2.2 says &quot;3,000 and perhaps ... 5,000 people have been killed&quot;. Are &quot;thousands&quot; equivalent to &quot;3,000 to 5,000&quot; or not? Unit 10.4 indicates it was an &quot;earthquake of magnitude 6.9&quot; and unit 2.2 says it was &quot;an earthquake measuring 6.9 on the Richter scale&quot;. Both of them report a &quot;6.9&quot; earthquake. But the second part of system unit 10.4, &quot;in an area so isolated...&quot;, seems to share some content with model unit 4.4 &quot;the quake was centered in a remote mountainous area&quot;. Are these two equivalent? This example highlights the difficulty of judging the content coverage of system summaries against model summaries and the inadequacy of using simple recall as defined.</Paragraph> <Paragraph position="3"> For this reason, NIST assessors not only marked the segments shared between system units (SU) and model units (MU), they also indicated the degree of match, i.e., all, most, some, hardly any, or none. This enables us to compute weighted recall.</Paragraph> <Paragraph position="4"> Different versions of weighted recall were proposed by DUC-2001 participants. (McKeown et al. 2001) treated the completeness of coverage as a threshold: 4 for all, 3 for most and above, 2 for some and above, and 1 for hardly any and above. They then proceeded to compare system performances at different threshold levels. They defined recall at threshold t, Recall</Paragraph> <Paragraph position="6"> , as follows: summary model in the MUs ofnumber Total aboveor at marked MUs ofNumber t Instead of thresholds, we use here as coverage score the ratio of completeness of coverage C: 1 for all, 3/4 for most, 1/2 for some, 1/4 for hardly any, and 0 for none. To avoid confusion with the recall used in information retrieval, we call our metric weighted retention, Retention</Paragraph> <Paragraph position="8"> define it as follows: summary model in the MUs ofnumber Total marked) MUs of(Number C* If we ignore C (set it to 1), we obtain an in our evaluation to illustrate that relative system performance (i.e., system ranking) changes when different evaluation metrics are chosen. Therefore, it is important to have common and agreed upon metrics to facilitate large scale evaluation efforts.</Paragraph> </Section> </Section> <Section position="5" start_page="3" end_page="31" type="metho"> <SectionTitle> 4 Instability of Manual Judgments </SectionTitle> <Paragraph position="0"> In the human evaluation protocol described in Section 2, nothing prevents an assessor from assigning different coverage scores to the same system units produced by different systems against the same model unit. (Since most systems produce extracts, the same sentence may appear in many summaries, especially for single-document summaries.) Analyzing the DUC-2001 results, we found the following: * Single document task o A total of 5,921 judgments o Among them, 1,076 (18%) contain multiple judgments for the same units o 143 (2.4%) of them have three different coverage scores * Multi-document task o A total of 6,963 judgments o Among them 528 (7.6%) contain multiple judgments o 27 (0.4%) of them have three different coverage scores Intuitively this is disturbing; the same phrase compared to the same model unit should always have the same score regardless of which system produced it. The large percentage of multiple judgments found in the single document evaluation are test-retest errors that need to be addressed in computing performance metrics. Figure 2 and Figure 3 show the retention scores for systems participating in the single- and multi-document tasks respectively. The error bars are bounded at the top by choosing the maximum coverage score (MAX) assigned by an assessor in the case of multiple judgment scores and at the bottom by taking the minimum assignment (MIN). We also compute system</Paragraph> <Paragraph position="2"> retentions using the majority (MAJORITY) and average (AVG) of assigned coverage scores.</Paragraph> <Paragraph position="3"> The original (ORIGINAL) does not consider the instability in the data.</Paragraph> <Paragraph position="4"> Analyzing all systems' results, we made the following observations.</Paragraph> <Paragraph position="5"> (1) Inter-human agreement is low in the single-document task (~40%) and even lower in multi-documents task (~29%). This indicates that using a single model as reference summary is not adequate.</Paragraph> <Paragraph position="6"> (2) Despite the low inter-human agreement, human summaries are still much better than the best performing systems.</Paragraph> <Paragraph position="7"> (3) The relative performance (rankings) of systems changes when the instability of human judgment is considered. However, the rerankings remain local; systems remain within performance groups. For example, we have the following groups in the multi-document summarization task (Figure 3, considering 0.5% error): a. {Human1, Human2} b. {N, T, Y} c. {Baseline2, L, P} d. {S} e. {M, O, R} f. {Z} g. {Baseline1, U, W} The existence of stable performance regions is encouraging. Still, given the large error bars, one can produce 162 different rankings of these 16 systems. Groups are less obvious in the single document summarization task due to close performance among systems.</Paragraph> <Paragraph position="8"> Table 1 shows relative performance between</Paragraph> <Paragraph position="10"> summarization task. A '+' indicates the minimum retention score of x (row) is higher than the maximum retention score of y (column), a '-' indicates the maximum retention score of x is lower than the minimum retention score of y, and a '~' means x and y are indistinguishable. Table 2 shows relative system performance in the multi-document summarization task.</Paragraph> <Paragraph position="11"> Despite the instability of the manual evaluation, we discuss automatic summary evaluation in an attempt to approximate the human evaluation results in the next section.</Paragraph> </Section> <Section position="6" start_page="31" end_page="31" type="metho"> <SectionTitle> 5 Automatic Summary Evaluation </SectionTitle> <Paragraph position="0"> Inspired by recent progress in automatic evaluation of machine translation (BLEU; Papineni et al. 2001), we would like to apply the same idea in the evaluation of summaries.</Paragraph> <Paragraph position="1"> Following BLEU, we used the automatically computed accumulative n-gram matching scores (NAMS) between a model unit (MU) and a system summary (S) as performance indicator, considering multi-document summaries. Only content words were used in forming n-grams. give more credit to longer n-gram matches. To examine the effect of stemmers in helping the n-gram matching, we also tested all configurations with two different stemmers (Lovin's and Porter's). Figure 4 shows the results with and without using stemmers and their Spearman rank-order correlation coefficients (rho) compared against the original retention ranking from Figure 4. X-nG is configuration n without using any stemmer, L-nG with the Lovin stemmer, and P-nG with the Porter stemmer. The results in Figure 4 indicate that unigram matching provides a good approximation, but the best correlation is achieved using C2 with the Porter stemmer. Using stemmers did improve correlation. Notice that rank inversion remains within the performance groups identified in Section 4. For example, the retention ranking of Baseline1, U, and W is 14, 16, and 15 respectively. The P-2G ranking of these three systems is 15, 14, and 16. The only system crossing performance groups is Y. Y should be grouped with N and T but the automatic evaluations place it lower, in the group with Baseline2, L, and P. The primary reason for Y's behavior may be that its summaries consist mainly of headlines, whose abbreviated style differs from the language models derived from normal newspaper text.</Paragraph> <Paragraph position="2"> For comparison, we also ran IBM's BLEU evaluation script over the same model and system summary set. The Spearman rank-order correlation coefficient (r ) for the single document task is 0.66 using one reference summary and 0.82 using three reference summaries; while Spearman r for the multi-document task is 0.67 using one reference and 0.70 using three.</Paragraph> </Section> class="xml-element"></Paper>