File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-1035_intro.xml

Size: 2,728 bytes

Last Modified: 2025-10-06 14:03:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1035">
  <Title>QARLA:A Framework for the Evaluation of Text Summarization Systems</Title>
  <Section position="2" start_page="0" end_page="280" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The quality of an automatic summary can be established mainly with two approaches: Human assessments: The output of a number of summarisation systems is compared by human judges, using some set of evaluation guidelines.</Paragraph>
    <Paragraph position="1"> Proximity to a gold standard: The best automatic summary is the one that is closest to some reference summary made by humans.</Paragraph>
    <Paragraph position="2"> Using human assessments has some clear advantages: the results of the evaluation are interpretable, and we can trace what a system is doing well, and what is doing poorly. But it also has a couple of serious drawbacks: i) different human assessors reach different conclusions, and ii) the outcome of a comparative evaluation exercise is not directly reusable for new techniques, i.e., a summarisation strategy developed after the comparative exercise cannot be evaluated without additional human assessments made from scratch.</Paragraph>
    <Paragraph position="3"> Proximity to a gold standard, on the other hand, is a criterion that can be automated (see Section 6), with the advantages of i) being objective, and ii) once gold standard summaries are built for a comparative evaluation of systems, the resulting test-bed can iteratively be used to refine text summarisation techniques and re-evaluate them automatically. null This second approach, however, requires solving a number of non-trivial issues. For instance, (i) How can we know whether an evaluation metric is good enough for automatic evaluation?, (ii) different users produce different summaries, all of them equally good as gold standards, (iii) if we have several metrics which test different features of a summary, how can we combine them into an optimal test?, (iv) how do we know if our test bed  is reliable, or the evaluation outcome may change by adding, for instance, additional gold standards? In this paper, we introduce a probabilistic framework, QARLA, that addresses such issues.</Paragraph>
    <Paragraph position="4"> Given a set of manual summaries and another set of baseline summaries per task, together with a set of similarity metrics, QARLA provides quantitative measures to (i) select and combine the best (independent) metrics (KING measure), (ii) apply the best set of metrics to evaluate automatic summaries (QUEEN measure), and (iii) test whether evaluating with that test-bed is reliable (JACK measure).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML