File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/w05-0907_concl.xml

Size: 1,409 bytes

Last Modified: 2025-10-06 13:55:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0907">
  <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 49-56, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics Evaluating DUC 2004 Tasks with the QARLA Framework</Title>
  <Section position="7" start_page="54" end_page="55" type="concl">
    <SectionTitle>
6 Conclusions
</SectionTitle>
    <Paragraph position="0"> The application of the QARLA evaluation framework to the DUC testbed provides some useful insights into the problem of evaluating text summarisation systems: * The results show that a combination of similarity metrics behaves better than any metric in isolation. Thebestmetricsetis{Rpre-W,TVM.512}, a combination of content-oriented metrics. Un- null surprisingly, stylistic similarity is less useful for evaluation purposes.</Paragraph>
    <Paragraph position="1"> * The evaluation provided by QARLA correlates well with the rankings provided by DUC human judges. For both tasks, metric sets with higher KING values slightly outperforms the best ROUGE evaluation measure.</Paragraph>
    <Paragraph position="2"> * QARLA measures show that DUC tasks 2 and 5are quitedifferentin nature. In Task5, human summaries are more similar, and the automatic summarisation strategies evaluated are less diverse. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML