File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/w05-0907_intro.xml

Size: 2,185 bytes

Last Modified: 2025-10-06 14:03:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0907">
  <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 49-56, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics Evaluating DUC 2004 Tasks with the QARLA Framework</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> QARLA (Amig'o et al., 2005) is a framework that uses similarity to models as a building block for the evaluation of automatic summarisation systems.</Paragraph>
    <Paragraph position="1"> The input of QARLA is a summarisation task, a set of test cases, a set of similarity metrics, and sets of models and automatic summaries (peers) for each test case. With such a testbed, QARLA provides:  * A measure, QUEEN, which combines assorted similarity metrics to estimate the quality of automatic summarisers.</Paragraph>
    <Paragraph position="2"> * A measure, KING, to select the best combination of similarity metrics.</Paragraph>
    <Paragraph position="3"> * An estimation, JACK, of the reliability of the testbed for evaluation purposes.</Paragraph>
    <Paragraph position="4">  The QARLA framework does not rely on human judges. It is interesting, however, to find out how  wellanevaluationusingQARLAcorrelateswithhuman judges, and whether QARLA can provide additional insights into an evaluation based on human assessments.</Paragraph>
    <Paragraph position="5"> In this paper, we apply the QARLA framework (QUEEN, KING and JACK measures) to the output of two different evaluation exercises: DUC 2004 tasks 2 and 5 (Over and Yen, 2004). Task 2 requires short (one-hundred word) summaries for assorted document sets; Task 5 consists of generating a short summary in response to a &amp;quot;Who is&amp;quot; question. In Section 2, we summarise the QARLA evaluation framework; in Section 3, we describe the similarity metrics used in the experiments. Section 4 discusses the results of the QARLA framework using such metrics on the DUC testbeds. Finally, Section 5 draws some conclusions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML