File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/w00-0408_concl.xml
Size: 2,678 bytes
Last Modified: 2025-10-06 13:52:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0408"> <Title>A Comparison of Rankings Produced by Summarization Evaluation Measures</Title> <Section position="7" start_page="75" end_page="77" type="concl"> <SectionTitle> 5 Conclusions and Future Work </SectionTitle> <Paragraph position="0"> 'There is wide variation in the rankings produced by recall scores from non-identical ground truths. This difference in scores is reflected in * averages computed across documents. The low inter-assessor correlation of ranks based on recall measures is distressing, and indicates that these measures cannot be effectively used to compare performances of summarization systems. Measures which gauge content similarity produce more highly correlated rankings whenever ground truths do not disagree in focus.</Paragraph> <Paragraph position="1"> Content-based measures assign different rankings when ground truths do disagree in focus. In addition, these measures provide a finer grained score with which to compare summaries.</Paragraph> <Paragraph position="2"> Moreover, the content-based measures which rely on a ground truth are only slightly more correlated to each other than theyare to the measures which perform summary-document comparisons. This suggests that the effective- null ness of summarization algorithms could be measured without the use of human judges. Since the cosine measure is easy to calculates feed-back of summary quality can be almost instantaneous. null The properties of these content-based measures need to be further investigated. For example, it is not clear that content-based measures satisfy properties (i) and (ii), discussed in Section 2. Also, while they do produce far fewer ties than either recall or tau, such a fine distinction in summary quality is probably not justified. When human-generated ground truths are available, perhaps some combination of recall and the content-based measures could be used.</Paragraph> <Paragraph position="3"> For instance, whenever recall is not perfect, the content of the non-overlapping sentences could be compared with the missed ground truth sentences. Also, the effects of compression rate, summary length, and document style are not known.</Paragraph> <Paragraph position="4"> The authors are currently performing further experiments to see if users prefer summaries that rank highly with content-based measures over other summaries. Also, the outcomes of extrinsic evaluation techniques will be compared with each of these scoring methods. In other words, do the high-ranking summaries help users to perform various tasks better than lower-ranking summaries do?</Paragraph> </Section> class="xml-element"></Paper>