File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/p03-1048_evalu.xml

Size: 5,563 bytes

Last Modified: 2025-10-06 13:58:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1048">
  <Title>Evaluation challenges in large-scale document summarization</Title>
  <Section position="5" start_page="2" end_page="2" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> This section reports results for the summarizers and baselines described above. We relied directly on the relevance judgements to create &amp;quot;manual extracts&amp;quot; to use as gold standards for evaluating the English systems. To evaluate Chinese, we made use of a table of automatically produced alignments. While the accuracy of the alignments is quite high, we have not thoroughly measured the errors produced when mapping target English summaries into Chinese. This will be done in future work.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.1 Co-selection results
</SectionTitle>
      <Paragraph position="0"> Co-selection agreement (Section 3.1) is reported in Figures 4, and 5). The tables assume human performance is the upper bound, the next rows compare the different summarizers.</Paragraph>
      <Paragraph position="1"> Figure 4 shows results for precision and recall.</Paragraph>
      <Paragraph position="2"> We observe the effect of a dependence of the numerical results on the length of the summary, which is a well-known fact from information retrieval evaluations. null Websumm has an advantage over MEAD for longer summaries but not for 20% or less. Lead summaries perform better than all the automatic summarizers, and better than the human judges.</Paragraph>
      <Paragraph position="3"> This result usually occurs when the judges choose different, but early sentences. Human judgements overtake the lead baseline for summaries of length  20 clusters).</Paragraph>
      <Paragraph position="4"> Figure 5 shows results using Kappa. Random agreement is 0 by definition between a random process and a non-random process.</Paragraph>
      <Paragraph position="5"> While the results are overall rather low, the numbers still show the following trends: MEAD outperforms Websumm for all but the 5% target length.</Paragraph>
      <Paragraph position="6"> Lead summaries perform best below 20%, whereas human agreement is higher after that. There is a rather large difference between the two summarizers and the humans (except for the 5% case for Websumm). This numerical difference is relatively higher than for any other co-selection measure treated here.</Paragraph>
      <Paragraph position="7"> Random is overall the worst performer.</Paragraph>
      <Paragraph position="8"> Agreement improves with summary length.</Paragraph>
      <Paragraph position="9"> Figures 6 and 7 summarize the results obtained through Relative Utility. As the figures indicate, random performance is quite high although all non-random methods outperform it significantly. Further, and in contrast with other co-selection evaluation criteria, in both the single- and multi-document</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.2 Content-based results
</SectionTitle>
      <Paragraph position="0"> The results obtained for a subset of target lengths using content-based evaluation can be seen in Figures 8 and 9. In all our experiments with tf idf-weighted cosine, the lead-based summarizer obtained results close to the judges in most of the target lengths while MEAD is ranked in second position.</Paragraph>
      <Paragraph position="1"> In all our experiments using longest common subsequence, no system obtained better results in the majority of the cases.</Paragraph>
      <Paragraph position="2">  over 10 clusters.</Paragraph>
      <Paragraph position="3"> The numbers obtained in the evaluation of Chinese summaries for cosine and longest common sub-sequence can be seen in Figures 10 and 11. Both measures identify MEAD as the summarizer that produced results closer to the ideal summaries (these results also were observed across measures and text representations).</Paragraph>
      <Paragraph position="4">  Subsequence. Average over 10 clusters. Chinese Words as Text Representation.</Paragraph>
      <Paragraph position="5"> We have based this evaluation on target summaries produced by LDC assessors, although other alternatives exist. Content-based similarity measures do not require the target summary to be a sub-set of sentences from the source document, thus, content evaluation based on similarity measures can be done using summaries published with the source documents which are in many cases available (Teufel and Moens, 1997; Saggion, 2000).</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.3 Relevance Correlation results
</SectionTitle>
      <Paragraph position="0"> We present several results using Relevance Correlation. Figures 12 and 13 show how RC changes depending on the summarizer and the language used.</Paragraph>
      <Paragraph position="1"> RC is as high as 1.0 when full documents (FD) are compared to themselves. One can notice that even random extracts get a relatively high RC score. It is also worth observing that Chinese summaries score lower than their corresponding English summaries.</Paragraph>
      <Paragraph position="2"> Figure 14 shows the effects of summary length and summarizers on RC. As one might expect, longer summaries carry more of the content of the full document than shorter ones. At the same time, the relative performance of the different summarizers remains the same across compression rates.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML