File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0901_metho.xml
Size: 27,975 bytes
Last Modified: 2025-10-06 14:09:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0901"> <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 1-8, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate?</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> In the past, assessments of usefulness involved a wide range of both intrinsic and extrinsic (task-based) measures (Sparck-Jones and Gallier, 1996). Intrinsic evaluations focus on coherence and informativeness (Jing et al., 1998) and often involve quality comparisons between automatic summaries and reference summaries that are pre-determined to be of high quality. Human intrinsic measures determine quality by assessing document accuracy, fluency, and clarity. Automatic intrinsic measures such as ROUGE use n-gram scoring to produce rankings of summarization methods.</Paragraph> <Paragraph position="1"> Extrinsic evaluations concentrate on the use of summaries in a specific task, e.g., executing instructions, information retrieval, question answering, and relevance assessments (Mani, 2001). In relevance assessments, a user reads a topic or event description and judges relevance of a document to the topic/event based solely on its summary.3 These have been used in many large-scale extrinsic evaluations, e.g., SUMMAC (Mani et al., 2002) and the Document Understanding Conference (DUC) (Harman and Over, 2004). The task chosen for such evaluations must support a very high degree of interannotator agreement, i.e., consistent relevance decisions across subjects with respect to a predefined gold standard.</Paragraph> <Paragraph position="2"> Unfortunately, a consistent gold standard has not yet been reported. For example, in two previous studies (Mani, 2001; Tombros and Sanderson, 1998), users' judgments were compared to &quot;gold standard judgments&quot; produced by members of the University of Pennsylvania's Linguistic Data Consortium. Although these judgments were supposed to represent the correct relevance judgments for each of the documents associated with an event, both studies reported that annotators' judgments varied greatly and that this was a significant issue for the evaluations. In the SUMMAC experiments, the Kappa score (Carletta, 1996; Eugenio and Glass, 2004) for interannotator agreement was reported to be 0.38 (Mani et al., 2002). In fact, large variations have been found in the initial summary scoring of an individual participant and a subsequent scoring that occurs a few weeks later (Mani, 2001; van Halteren and Teufel, 2003).</Paragraph> <Paragraph position="3"> This paper attempts to overcome the problem of interannotator inconsistency by measuring summary effectiveness in an extrinsic task using a much more consistent form of user judgment instead of a gold standard. Using Relevance-Prediction increases the confidence in our results and strengthens the statistical statements we can make about the benefits of summarization.</Paragraph> <Paragraph position="4"> The next section describes an alternative approach to measuring task-based usefulness, where the usage of external judgments as a gold standard is replaced by the 3A topic is an event or activity, along with all directly related events and activities. An event is something that happens at some specific time and place, and the unavoidable consequences. null user's own decisions on the full text. Following the lead of earlier evaluations (Oka and Ueda, 2000; Mani et al., 2002; Sakai and Sparck-Jones, 2001), we focus on relevance assessment as our extrinsic task.</Paragraph> </Section> <Section position="5" start_page="0" end_page="3" type="metho"> <SectionTitle> 3 Evaluation of Usefulness of Summaries </SectionTitle> <Paragraph position="0"> We define a new extrinsic measure of task-based usefulness called Relevance-Prediction, where we compare a summary-based decision to the subject's own full-text decision rather than to a different subject's decision. Our findings differ from that of the SUMMAC results (Mani et al., 2002) in that using Relevance-Prediction as an alternative to comparision to a gold standard is a more realistic agreement measure for assessing usefulness in a relevance assessment task. For example, users performing browsing tasks must examine document surrogates, but open the full-text only if they expect the document to be interesting to them. They are not trying to decide if the document will be interesting to someone else.</Paragraph> <Paragraph position="1"> To determine the usefulness of summarization, we focus on two questions: * Can users make judgments on summaries that are consistent with their full-text judgments? * Can users make judgments on summaries more quickly than on full document text? First we describe the Relevance-Prediction measure for determining whether users can make accurate judgments with a summary. Following this, we describe our experiments and results using this measure, including the timing results of summaries compared to full documents.</Paragraph> <Section position="1" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 3.1 Relevance-Prediction Measure </SectionTitle> <Paragraph position="0"> To answer the first question above, we define a measure called Relevance-Prediction, where subjects build their own &quot;gold standard&quot; based on the full-text documents. Agreement is measured by comparing subjects' surrogate-based judgments against their own judgments on the corresponding texts. The subject's judgment is assigned a value of 1 if his/her surrogate judgment is the same as the corresponding full-text judgment, and 0 otherwise. These values were summed over all judgments for a surrogate type and were divided by the total number of judgments for that surrogate type to determine the effectiveness of the associated summary method.</Paragraph> <Paragraph position="1"> Formally, given a summary/document pair (s,d), if subjects make the same judgment on s that they did on d, we say j(s,d) = 1. If subjects change their judgment between s and d, we say j(s,d) = 0. Given a set of summary/document pairs DSi associated with event i, the Relevance-Prediction score is computed as follows:</Paragraph> <Paragraph position="3"> This approach provides a more reliable comparison mechanism than gold standard judgments provided by other individuals. Specifically, Relevance-Prediction is more helpful in illuminating the usefulness of summaries for a real-world scenario, e.g., a browsing environment, where credit is given when an individual subject would choose (or reject) a document under both conditions. To our knowledge, this subject-driven approach to testing usefulness has never before been used.</Paragraph> </Section> <Section position="2" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 3.2 Experiment Design </SectionTitle> <Paragraph position="0"> Ten human subjects were recruited to evaluate full-text documents and two summary types.4 The original text documents were taken from the Topic Detection and Tracking 3 (TDT-3) corpus (Allan et al., 1999) which contains news stories and headlines, topic and event descriptions, and a mapping between news stories and their related topic and/or events. Although the TDT-3 collection contains transcribed speech documents, our investigation was restricted to documents that were originally text, i.e., newspaper or newswire, not broadcast news.</Paragraph> <Paragraph position="1"> For our experiment we selected three distinct events and related document sets5 from TDT-3. For each event, the subjects were given a description of the event (written by LDC) and then asked to judge relevance of a set of 20 documents associated with that event (using three different presentation types to be discussed below).</Paragraph> <Paragraph position="2"> The events used from the TDT data set were events from world news occurring in 1998. It is possible that the subjects had some prior knowledge about the events, yet we believe that this would not affect their ability to complete the task. Subjects' background knowledge of an event can also make this task more similar to real-world browsing tasks, in which subjects are often familiar with the event or topic they are searching for.</Paragraph> <Paragraph position="3"> The 20 documents were retrieved by a search engine.</Paragraph> <Paragraph position="4"> We used a constrained subset where exactly half (10) were judged relevant by the LDC annotators. Because all 20 documents were somewhat similar to the event, this approach ensured that our task would be more difficult than it would be if we had chosen documents from completely unrelated events (where the choice of relevance would be obvious even from a poorly written summary).</Paragraph> <Paragraph position="5"> Each document was pre-annotated with the headline associated with the original newswire source. These headlines were used as the first summary type. We refer to them as HEAD (Headline Surrogate). The average length of the HEAD surrogates was 53 characters. In addition, we commissioned human-generated summaries6 of each document as the second summary type; we refer mary no greater than 75 characters for each specified full text document. The summaries were not compared for writing style or quality.</Paragraph> <Paragraph position="6"> to this as HUM (Human Surrogate). The average length of the HUM surrogates was 72 characters. Although neither of these summaries was produced automatically, our experiment allowed us to focus on the question of summary usefulness and to learn about the differences in presentation style as a first step toward experimentation with the output of automatic summarization systems.</Paragraph> <Paragraph position="7"> Two main factors were measured: (1) differences in judgments for the three presentation types (HEAD, HUM, and the full-text document) and (2) judgment time.</Paragraph> <Paragraph position="8"> Each subject made a total of 60 judgments for each presentation type since there were 3 distinct events and 20 documents per event. To facilitate the analysis of the data, the subjects' judgments were constrained to two possibilities, relevant or not relevant.7 Although the HEAD and HUM surrogates were both produced by humans, they differed in style. The HEAD surrogates were shorter than the HUM surrogates by 26%. Many of these were &quot;eye-catchers&quot; designed to entice the reader to examine the entire document (i.e., purchase the newspaper); that is, the HEAD surrogates were not intended to stand in the place of the full document.</Paragraph> <Paragraph position="9"> By contrast, the writers of the HUM surrogates were instructed to write text that conveyed what happened in the full document. We observed that the HUM surrogates used more words and phrases extracted from the full documents than the HEAD surrogates.</Paragraph> <Paragraph position="10"> Experiments were conducted using a web browser (Internet Explorer) on a PC in the presence of the experimenter. Subjects were given written and verbal instructions for completing their task and were asked to make relevance judgments on a practice event set. The judgments from the practice event set were not included in our experimental results or used in our analyses. The written instructions were given to aid subjects in determining requirements for relevance. For example, in an Election event documents describing new people in office, new public officials, change in governments or parliaments were suggested as evidence for relevance.</Paragraph> <Paragraph position="11"> Each of the ten subjects made judgments on 20 documents for each of three different events. After reading each document or summary, the subjects clicked on a radio button corresponding to their judgment and clicked a submit button to move to the next document description. Subjects were not allowed to move to the next summary/document until a valid selection was made. No backing up was allowed. Judgment time was computed as the number of seconds it took the subject to read the full text document or surrogate, comprehend it, compare it to the event description, and make a judgment (timed up until the subject clicked the submit button).</Paragraph> <Paragraph position="12"> 7If we allowed subjects to make additional judgments such as somewhat relevant, this could possibly encourage subjects to always choose this when they were the least bit unsure. Previous experiments indicate that this additional selection method may increase the level of variability in judgments (Zajic et al., 2004).</Paragraph> </Section> <Section position="3" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.3 Order of Document/Surrogate Presentation </SectionTitle> <Paragraph position="0"> One concern with our evaluation methodology was the issue of possible memory effects or priming: if the same subjects saw a summary and a full document about the same event, their answers might be tainted. Thus, prior to the full experiment, we conducted pre-experiments (using 4 participants) with an extreme form of influence: we presented the summary and full text in immediate succession. In these experiments, we compared two document presentation approaches, termed &quot;Drill Down&quot; and &quot;Complete Set.&quot; In the &quot;Drill Down&quot; document presentation approach all three presentation types were shown for each document, in sequence: first a single HEAD surrogate, followed by the corresponding HUM surrogate, followed by the full text document. This process was repeated 10 times.</Paragraph> <Paragraph position="1"> In the &quot;Complete Set&quot; document-presentation approach we presented the complete set of documents using one surrogate type, followed by the complete set using another surrogate type, and so on. That is, the 10 HEAD surrogates were displayed all at once, followed by the corresponding 10 HUM surrogates, followed by the corresponding 10 full-text documents.</Paragraph> <Paragraph position="2"> The results indicated that there was almost no effect between the two document-presentation approaches. The performance varied only slightly and neither approach consistently allowed subjects to perform better than the other. Therefore, we determined that the subjects were not associating a given summary with its corresponding full-text documents. This may be due, in part, to the fact that all 20 documents were related to the event--and according to the LDC relevance judgments half of these were actually about the same event.</Paragraph> <Paragraph position="3"> Given that the variations were insignificant in these pre-experiments, we selected only the Complete-Set approach (no Drill-Down) for the full experiment. However, we still needed to vary the ordering for the two surrogate presentation types associated with each full-text document. Thus, each 20-document set was divided in half for each subject. In the first half, the subject saw the first 10 documents as: (1) HEAD surrogates, then HUM surrogates and then the full-text document; or (2) HUM surrogates, then HEAD surrogates, and then the full-text document. In the second half, the subject saw the alternative ordering, e.g., if a subject saw HEAD surrogates before HUM surrogates in the first half, he/she saw the HUM surrogates before HEAD surrogates for the second half. Either way, the full-text document was always shown last so as not to introduce judgment effects associated with reading the entire document before either surrogate type.</Paragraph> <Paragraph position="4"> In addition to varying the ordering for the surrogate type, the ordering of the surrogates and full documents within the events were also varied. The subjects were grouped in pairs, and each pair viewed the surrogates and documents in a different order than the other pairs.</Paragraph> </Section> <Section position="4" start_page="3" end_page="3" type="sub_section"> <SectionTitle> 3.4 Experimental Hypotheses </SectionTitle> <Paragraph position="0"> We hypothesized that the summaries would allow subjects to achieve a Relevance-Prediction rate of 70-90%.</Paragraph> <Paragraph position="1"> Since these summaries were significantly shorter than the original document text, we expected that the rate would not be 100% compared to the judgments made on the full document text. However, we expected higher than a 50% ratio, i.e., higher than that of random judgments on all of the surrogates. We also expected high performance because the meaning of the original document text is best preserved when written by a human (Mani, 2001).</Paragraph> <Paragraph position="2"> A second hypothesis is that the HEAD surrogates would yield a significantly lower agreement rate than that of the HUM surrogates. Our commissioned HUM surrogates were written to stand in place of the full document, whereas the HEAD surrogates were written to catch a reader's interest. This suggests that the HEAD surrogates might not provide as informative a description of the original documents as the HUM surrogates.</Paragraph> <Paragraph position="3"> We also tested a third hypothesis: that our Relevance-Prediction measure would be more reliable than that of the LDC-Agreement method used for SUMMAC-style evaluations (thus providing a more stable framework for evaluating summarization techniques). LDC-Agreement compares a subject's judgment on a surrogate or full text against the &quot;correct&quot; judgments as assigned by the TDT corpus annotators (Linguistic Data Consortium 2001).</Paragraph> <Paragraph position="4"> Finally, we tested the hypothesis that using a text summary for judging relevance would take considerably less time than using the corresponding full-text document.</Paragraph> </Section> </Section> <Section position="6" start_page="3" end_page="4" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> Table 1 shows the subjects' judgments using both Relevance-Prediction and LDC-Agreement for each of three events. Using our Relevance-Prediction measure, the HUM surrogates yield averages between 79% and 86%, with an overall average of 81%, thus confirming our first hypothesis.</Paragraph> <Paragraph position="1"> However, we failed to confirm our second hypothesis. The HEAD Relevance-Prediction rates were between 71% and 82%, with an overall average of 76%, which was lower than the rates for HUM, but the difference was not statistically significant. It appeared that subjects were able to make consistent relevance decisions from the non-extractive HEAD surrogates, even though these were shorter and less informative than the HUM surrogates.</Paragraph> <Paragraph position="2"> A closer look reveals that the HEAD summaries sometimes contained enough information to judge relevance, yielding almost the same number of true positives (and true negatives) as the HUM summaries. For example, a document about the formation of a coalition government to avoid violence in Cambodia has the HEAD surrogate Cambodians hope new government can avoid past mistakes. By contrast, the HUM surrogate for this same event was Rival parties to form a coalition government to avoid violence in Cambodia. Although the HEAD surrogate uses words that do not appear in the original document (hope and mistakes), the subject may infer the relevance of this surrogate by relating hope to the notion of forming a coalition government and mistakes to violence.</Paragraph> <Paragraph position="3"> On the other hand, we found that the lower degree of informativeness of HEAD surrogates gave rise to over 50% more false negatives than the HUM summaries. This statistically significant difference will be discussed further in Section 6.</Paragraph> <Paragraph position="4"> As for our third hypothesis, Table 1 illustrates a substantial difference between the two agreement measures. For each of the three events, the Relevance-Prediction rate is at least five percent higher than that of the LDC-Agreement approach, with an average of 8.8% increase for the HEAD summary and a 13.3% average increase for the HUM summary. The average rates across events show a statistically significant difference between LDC-Agreement and Relevance-Prediction for both HUM summaries with p<0.01 and HEAD summaries with p<0.05. This significance was determined through use of a single factor ANOVA statistical analysis. The higher Relevance-Prediction rate supports our statement that this approach provides a more stable framework for evaluating different summarization techniques.</Paragraph> <Paragraph position="5"> Finally, the average timing results shown in Table 1 confirm our fourth hypothesis. The subjects took 4-5 seconds (on average) to make judgments on both the HEAD and HUM summaries, as compared to about 13.4 seconds to make judgments on full text documents. This shows that it takes subjects almost 3 times longer to make judgments on full text documents as it took to make judgments on the summaries (HEAD and HUM). This finding is not surprising since text summaries are an order of magnitude shorter than full-text documents.</Paragraph> </Section> <Section position="7" start_page="4" end_page="6" type="metho"> <SectionTitle> 5 Correlation with Intrinsic Evaluation Metric: ROUGE </SectionTitle> <Paragraph position="0"> We now turn to the task of correlating our extrinsic task performance with scores produced by an intrinsic evaluation measure. We used the Recall Oriented Understudy for Gisting Evaluation (ROUGE) metric version 1.2.1. In previous studies (Dorr et al., 2004) ROUGE was shown to have a very low correlation with the LDC-Agreement measurement results of the extrinsic task. This was attributed to low interannotator agreement in the gold standard. Our goal was to test whether our new Relevance-Prediction technique would allow us to induce higher correlations with ROUGE.</Paragraph> <Section position="1" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 5.1 Extrinsic Agreement Data </SectionTitle> <Paragraph position="0"> To reduce the effect of outliers on the correlation between ROUGE and the human judgments, we averaged over all judgments for each subject (20 judgmentsx3 events) to produce 60 data points. These data points were then partitioned into either 1, 2, or 4 partitions of equal size. (Partitions of size four have 15 data points, partitions of size two have 30 data points, and partitions of size one have 60 data points per subject--or a total of 600 datapoints across all 10 subjects). To ensure that the correlation did not depend on a specific partition, we repeated this same process using 10,000 different (randomly generated) partitions for each of the three partition sizes.</Paragraph> <Paragraph position="1"> Partitioned data points of size four provided a high degree of noise reduction without compromising the size of the data set (15 points). Larger partition sizes would result in too few data points and compromise the statistical significance of our correlation results. In order to show the variation within a single partition, we used the partitioning of size 4 with the smallest mean square error on the human headline compared to the other partitionings as a representative partition. For this representative partitioning, the individual data points P1-P15 of that partition are shown for each of the two agreement measures in Tables 2 and 3. This shows that, across partitions, the maximum and minimum Relevance-Prediction rates for HEAD (93% and 60%) are higher than the corresponding LDC-Agreement rates (85% and 50%). The same trend is seen with the HUM surrogates: Relevance-Prediction maximum of 98%, minimum of 68%; and LDC-Agreement maximum 88%, minimum of 55%.</Paragraph> </Section> <Section position="2" start_page="4" end_page="5" type="sub_section"> <SectionTitle> 5.2 Intrinsic ROUGE Score </SectionTitle> <Paragraph position="0"> To correlate the partitioned agreement scores above with our intrinsic measure, we first ran ROUGE on all 120 surrogates in our experiment (i.e., the HUM and HEAD surrogates for each of the 60 event/document pairs) and then averaged the ROUGE scores for all surrogates belonging to the same partitions (for each of the three partition sizes). These partitioned ROUGE values were then used for detecting correlations with the corresponding partitioned agreement scores described above.</Paragraph> <Paragraph position="1"> Table 4 shows the ROUGE scores, based on 3 reference summaries per document, for partitions P1-P15 used in the previous tables.8 For brevity, we include 8We commissioned a total of 180 human-generated reference summaries (3 for each of 60 documents) (in addition to the human generated summaries used in the experiment).</Paragraph> <Paragraph position="2"> our statements earlier about the difference between non-extractive &quot;eye-catchers&quot; and informative headlines. Because ROUGE measures whether a particular summary has the same words (or n-grams) as a reference summary, a more constrained choice of words (as found in the extractive HUM surrogates) makes it more likely that the summary would match the reference.</Paragraph> <Paragraph position="3"> A summary in which the word choice is less constrained--as in the non-extractive HEAD surrogates--is less likely to share n-grams with the reference. Thus, we may see non-extractive summaries that have almost identical meanings, but very different words. This raises the concern that ROUGE may be sensitive to the style of summarization that is used.</Paragraph> <Paragraph position="4"> Section 6 discusses this point further.</Paragraph> </Section> <Section position="3" start_page="5" end_page="6" type="sub_section"> <SectionTitle> 5.3 Intrinsic and Extrinsic Correlation </SectionTitle> <Paragraph position="0"> To test whether ROUGE correlates more highly with Relevance-Prediction than with LDC-Agreement, we calculated the correlation for the results of both techniques where ri is the ROUGE score of surrogate i, -r is the average ROUGE score of all data points, si is the agreement score of summary i (using Relevance-Prediction or LDC-Agreement), and -s is the average agreement score. Pearson's statistics is commonly used in summarization and machine translation evaluation, see e.g. (Lin, 2004; Lin and Och, 2004).</Paragraph> <Paragraph position="1"> As one might expect, there is some variability in the correlation between ROUGE and human judgments for the different partitions. However, the boxplots for both HEAD and HUM indicate that the first and third quartile were relatively close to the median (see Figure 1).</Paragraph> <Paragraph position="2"> Table 5 shows the Pearson Correlations with ROUGE-1 using Relevance-Prediction and LDC-Agreement. For Relevance-Prediction, we observed a positive correlation for both surrogate types, with a slightly higher correlation for HEAD than HUM. For LDC-Agreement, we observed no correlation (or a minimally negative one) with ROUGE-1 scores, for both the HEAD and HUM surrogates. The highest correlation was observed for Relevance-Prediction on HEAD.</Paragraph> <Paragraph position="3"> We conclude that ROUGE correlates more highly with the Relevance-Prediction measurement than the LDC-Agreement measurement, although we should add that none of the correlations in Table 5 were statistically significant atp< 0.05. The low LDC-Agreement scores are consistent with previous studies where poor correlations where Partition size (P) = 1, 2, and 4 were attributed to low interannotator agreement rates.</Paragraph> </Section> </Section> class="xml-element"></Paper>