File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0907_metho.xml

Size: 19,024 bytes

Last Modified: 2025-10-06 14:09:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0907">
  <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 49-56, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics Evaluating DUC 2004 Tasks with the QARLA Framework</Title>
  <Section position="3" start_page="0" end_page="50" type="metho">
    <SectionTitle>
2 The QARLA evaluation framework
</SectionTitle>
    <Paragraph position="0"> QARLA uses similarity to models for the evaluation of automatic summarisation systems. Here we summarise its main features; the reader may refer to (Amig'o et al., 2005) for details.</Paragraph>
    <Paragraph position="1"> The input of the framework is: * A summarisation task (e.g. topic oriented, informative multi-document summarisation on a given domain/corpus).</Paragraph>
    <Paragraph position="2"> * A set T of test cases (e.g. topic/document set pairs for the example above) * A set of summaries M produced by humans (models), and a set of automatic summaries A (peers), for every test case.</Paragraph>
    <Paragraph position="3"> * A set X of similarity metrics to compare summaries. null With this input, QARLA provides three main measures that we describe below.</Paragraph>
    <Section position="1" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
2.1 QUEEN: Estimating the quality of an
</SectionTitle>
      <Paragraph position="0"> automatic summary QUEEN operates under the assumption that a summary is better if it is closer to the model summaries according to all metrics; it is defined as the probability, measured onM xM xM, that for every metric in X the automatic summary a is closer to a model than two models to each other:</Paragraph>
      <Paragraph position="2"> where a is the automatic summary being evaluated, &lt;m,mprime,mprimeprime&gt; are three models in M, and x(a,m) stands for the similarity ofmtoa. QUEEN is stated as a probability, and therefore its range of values is [0,1].</Paragraph>
      <Paragraph position="3"> We can think of the QUEEN measure as using a set of tests (every similarity metric in X) to falsify the hypothesis that a given summary a is a model.</Paragraph>
      <Paragraph position="4"> Given&lt;a,m,mprime,mprimeprime&gt; ,wetestx(a,m) [?] x(mprime,mprimeprime) for each metric x. a is accepted as a model only if it passes the test for every metric. QUEEN(a) is, then, the probability of acceptance for a in the sample space M xM xM.</Paragraph>
      <Paragraph position="5"> This measure has some interesting properties: (i) it is able to combine different similarity metrics into a single evaluation measure; (ii) it is not affected by the scale properties of individual metrics, i.e. it does not require metric normalisation and it is not affected by metric weighting. (iii) Peers which are very far from the set of models all receive QUEEN=0. Inotherwords, QUEENdoesnotdistinguish between very poor summarisation strategies. (iv) The value of QUEEN is maximised for peers that &amp;quot;merge&amp;quot; with the models under all metrics inX. (v) The universal quantifier on the metric parameter x implies that adding redundant metrics do not bias the result of QUEEN.</Paragraph>
      <Paragraph position="6"> Now the question is: which similarity metrics are adequate to evaluate summaries? Imagine that we use a similarity metric based on sentence coselection; it might happen that humans do not agree on which sentences to select, and therefore emulating their sentence selection behaviour is both easy (nobody agrees with each other) and useless. We need to take into account which are the features that human summaries do share, and evaluate according to them. This is provided by the KING measure.</Paragraph>
    </Section>
    <Section position="2" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
2.2 KING: estimating the quality of similarity
</SectionTitle>
      <Paragraph position="0"> metrics The measure KINGM,A(X) estimates the quality of a set of similarity metrics X using a set of models M and a set of peers A. KING is defined as the probability that a model has higher QUEEN value than any peer in a test sample. Formally:</Paragraph>
      <Paragraph position="2"> For example, an ideal metric -that puts all models together-would give QUEEN(m) = 1 for all models, and QUEEN(a) = 0 for all peers which are not put together with the models, obtaining KING = 1.</Paragraph>
      <Paragraph position="3"> KING satisfies several interesting properties: (i) KING does not depend on the scale properties of the metric; (ii) Adding repeated or very similar peers do not alter the KING measure, which avoids one way of biasing the measure. (iii) the KING value of random and constant metrics is zero or close to zero.</Paragraph>
    </Section>
    <Section position="3" start_page="49" end_page="50" type="sub_section">
      <SectionTitle>
2.3 JACK: reliability of the peer set
</SectionTitle>
      <Paragraph position="0"> Once we detect a difference in quality between two summarisation systems, the question is now whether this result is reliable. Would we get the same results using a different test set (different examples, different human summarisers (models) or different base-line systems)? The first step is obviously to apply statistical significance tests to the results. But even if they give a positive result, it might be insufficient. The problem is that the estimation of the probabilities in KING assumes that the sample sets M,A are not biased.</Paragraph>
      <Paragraph position="1"> If M,A are biased, the results can be statistically significant and yet unreliable. The set of examples and the behaviour of human summarisers (models) should be somehow controlled either for homogeneity (if the intended profile of examples and/or users is narrow) or representativity (if it is wide). But how to know whether the set of automatic summaries is representative and therefore is not penalising certain automatic summarisation strategies? This is addressed by the JACK measure:</Paragraph>
      <Paragraph position="3"> i.e. theprobabilityoverallmodelsummariesmof finding a couple of automatic summariesa,aprime which are closer to m than to each other according to all metrics. This measure satisfies three desirable properties: (i) it can be enlarged by increasing the similarity of the peers to the models (the x(m,a) factor in the inequalities), i.e. enhancing the quality of the peer set; (ii) it can also be enlarged by decreasing the similarity between automatic summaries (the x(a,aprime) factor in the inequality), i.e. augmenting the diversity of (independent) automatic summarisation strategies represented in the test bed; (iii) adding elements to A cannot diminish the JACK value, because of the existential quantifier on a,aprime.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="50" end_page="52" type="metho">
    <SectionTitle>
3 Selection of similarity metrics
</SectionTitle>
    <Paragraph position="0"> Each different similarity metric characterises different features of a summary. Our first objective is to select the best set of metrics, that is, the metrics which bestcharacterise thehuman summaries(models) as opposed to automatic summaries. The second objective is to obtain as much information as possible about the behaviour of automatic summaries.</Paragraph>
    <Paragraph position="1"> In this Section, we begin by describing a set of 59 metrics used as a starting point. Some of them provide overlapping information; the second step is then to select a subset of metrics that minimises redundancy and, at the same time, maximises quality (KING values). Finally, we analyse the characteristics of the selected metrics.</Paragraph>
    <Section position="1" start_page="50" end_page="51" type="sub_section">
      <SectionTitle>
3.1 Similarity metrics
</SectionTitle>
      <Paragraph position="0"> For this work, we have considered the following similarity metrics: ROUGE based metrics (R): ROUGE (Lin and Hovy, 2003) estimates the quality of an automatic summary on the basis of the n-gram coverage related to a set of human summaries (models). Although ROUGE is an evaluation metric, we can adapt it to behave as a similarity metric between pairs of summaries if we consider only one model in the computation. There are different kinds of ROUGE metrics such as ROUGE-W, ROUGE-L, ROUGE1, ROUGE-2, ROUGE-3, ROUGE-4, etc. (Lin, 2004b). Each of these metrics has been applied over summaries with three preprocessing options: with stemming and stopword removal (type c); only with stopwords removal (type b); or without any kind of preprocessing (type a).</Paragraph>
      <Paragraph position="1"> All these combinations give 24 similarity metrics based on ROUGE.</Paragraph>
      <Paragraph position="2"> Inverted ROUGE based metrics (Rpre): ROUGE metrics are recall oriented. If we reverse the directionofthesimilaritycomputation, weobtain precision oriented metrics (i.e. Rpre(a,b) = R(b,a)). In this way, we generate another 24 metrics based on inverted ROUGE.</Paragraph>
      <Paragraph position="3"> TruncatedVectModel (TVMn): This family of metrics compares the distribution of the n most relevant terms from original documents in the summaries. The process is the following: (1) obtaining thenmost frequent lemmas ignoring stopwords; (2) generating a vector with the relative frequency of each term in the summary; (3) calculating the similarity between two vectors as the inverse of the Euclidean distance. We have used 9 variants of this measure with n = 1,4,8,16,32,64,128,256,512.</Paragraph>
      <Paragraph position="4"> AveragedSentencelengthSim (AVLS): This is a very simple metric that compares the average length of the sentences in two summaries. It can be useful to compare the degree of abstraction of the summaries.</Paragraph>
      <Paragraph position="5"> GRAMSIM: This similarity metric compares the distribution of the part-of-speech tags in the two summaries. The processing is the following: (1) part-of-speech tagging of summaries using TreeTagger ; (2) generation of a vector with the tags frequency for each summary; (3) calculation of the similarity between two vectors as the inverse of the Euclidean distance. This similarity metric is not content oriented, but syntax-oriented.</Paragraph>
    </Section>
    <Section position="2" start_page="51" end_page="51" type="sub_section">
      <SectionTitle>
3.2 Clustering similarity metrics
</SectionTitle>
      <Paragraph position="0"> From the set of metrics described above we have 57 (24+24+9) content oriented metrics, plus two metrics based on stylistic features (AVLS and GRAM-SIM). However, the 57 metrics characterising summary contents are highly redundant. Thus, clustering similar metrics seems desirable.</Paragraph>
      <Paragraph position="1"> We perform an automatic clustering process using the following notion of proximity between two</Paragraph>
      <Paragraph position="3"> where H(X) [?] [?]x [?] X.x(a,m) [?] x(mprime,mprimeprime) Two metrics sets are similar, according to the formula, if they behave similarly with respect to the QUEEN condition (H predicate in the formula), i.e. the probability that the two sets of metrics discriminate the same automatic summaries when they are compared to the same pair of models.</Paragraph>
      <Paragraph position="4"> Figure 1 shows the clustering of similarity metrics for the DUC 2004 Task 2. The number of clusters was fixed in 10. After the clustering process, the 48 ROUGE metrics are grouped in 7 sets, and the 9 TVM metrics are grouped in 3 sets. In each cluster, the metric with highest KING has been marked in boldface. Note that the ROUGE-c metrics (with  stemming)withhighestKINGarethosebasedonrecall whereas the ROUGE-a/b metrics (without stemming)arethosebasedonprecision. RegardingTVM clusters, themetricswithhighestKINGineachcluster are those based on a higher number of terms. Finally, we select the metric with highest KING in each group, obtaining the 10 most representative metrics.</Paragraph>
    </Section>
    <Section position="3" start_page="51" end_page="52" type="sub_section">
      <SectionTitle>
3.3 Best evaluation metric: KING values
</SectionTitle>
      <Paragraph position="0"> task 2, and 23% better for task 5). This is an interesting result confirming that we can improve our ability to characterise human summaries just by combining standard similarity metrics in the QARLA framework. Note also that both metrics in the best set are content-oriented.</Paragraph>
      <Paragraph position="1"> * Rpre-W.1.2.b (inverted ROUGE measure, using non-contiguous word sequences, removing stopwords, without stemming) obtains the highest individual KING for task 2, and is one of the best in task 5, confirming that ROUGE-based metrics are a robust way of evaluating summaries, and indicating that non-contiguous word sequences can be more useful for evaluation purposes than n-grams.</Paragraph>
      <Paragraph position="2">  * TVM metrics get higher values when considering more terms (TVM.512), confirming that comparing with just a few terms (e.g. TVM.4) is not informative enough.</Paragraph>
      <Paragraph position="3"> * Overall, KING values are higher for task 5, suggesting that there is more agreement between human summaries in topic-oriented tasks.</Paragraph>
    </Section>
    <Section position="4" start_page="52" end_page="52" type="sub_section">
      <SectionTitle>
3.4 Reliability of the results
</SectionTitle>
      <Paragraph position="0"> The JACK measure estimates the reliability of QARLA results, and is correlated with the diversity of automatic summarisation strategies included in thetestbed. Inprinciple, thelargerthenumberofautomatic summaries, the higher the JACK values we should obtain. The important point is to determine when JACK values tend to stabilise; at this point, it is not useful to add more automatic summaries without introducing new summarisation strategies.</Paragraph>
      <Paragraph position="1"> Figure 3 shows how JACKRpre-W,TVM.512 values grow when adding automatic summaries. For more than 10 systems, JACK values grow slower in both tasks. Absolute JACK values are higher in Task 2 than in task 5, indicating that systems tend to produce more similar summaries in Task 5 (perhaps becauseitisatopic-orientedtask). Thisresultsuggests that we should incorporate more diverse summarisation strategies in Task 5 to enhance the reliability of the testbed for evaluation purposes with QARLA.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="52" end_page="54" type="metho">
    <SectionTitle>
4 Evaluation of automatic summarisers:
QUEEN values
</SectionTitle>
    <Paragraph position="0"> The QUEEN measure provides two kinds of information to compare automatic summarisation systems: which are the best systems -according to the best metric set-, and which are the individual features of every automatic summariser -according to individual similarity metrics-.</Paragraph>
    <Section position="1" start_page="52" end_page="52" type="sub_section">
      <SectionTitle>
4.1 System ranking
</SectionTitle>
      <Paragraph position="0"> The best metric combination for both tasks was {Rpre-W,TVM.512}; therefore, our global system evaluation uses this combination of content-oriented metrics. Figure 4 shows the QUEEN{Rpre-W,TVM.512} values for each participating system in DUC 2004, also including the model summaries. As expected, model summaries obtain the highest QUEEN values in both DUC tasks, with a significant distance with respect to the automatic summaries.</Paragraph>
    </Section>
    <Section position="2" start_page="52" end_page="54" type="sub_section">
      <SectionTitle>
4.2 Correlation with human judgements
</SectionTitle>
      <Paragraph position="0"> The manual ranking generated in DUC is based on a set of human-produced evaluation criteria, whereas the QARLA framework gives more weight to the aspects that characterise model summaries as opposed to automatic summaries. It is interesting, however, to find out whether both evaluation methodologies are correlated. Indeed, this is the case: the Pearson correlation between manual and QUEEN rankings is 0.92 for the Task 2 and 0.96 for the Task 5.</Paragraph>
      <Paragraph position="1"> Of course, QUEEN values depend on the chosen metric set X; it is also interesting to check whether  metrics with higher KING values lead to QUEEN rankings more similar to human judgements. Figure 5 shows the Pearson correlation between manual and QUEEN rankings for 1024 metric combinations with different KING values. The figure confirms that higher KING values are associated with rankings closer to human judgements.</Paragraph>
    </Section>
    <Section position="3" start_page="54" end_page="54" type="sub_section">
      <SectionTitle>
4.3 Stylistic features
</SectionTitle>
      <Paragraph position="0"> The best metric combination leaves out similarity metrics based on stylistic features. It is interesting, however, to see how automatic summaries behave with respect to this kind of features. Perhaps the most remarkable fact about stylistic similarities is that, in the case of the GRAMSIM metric, task 2 and task 5 exhibit a rather different behaviour (see Figure 6). In task 2, systems merge with the models, while in task 5 the QUEEN values of the systems are inferior to the models. This suggests that there is some stylistic component in models that systems are not capturing in the topic-oriented task.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="54" end_page="54" type="metho">
    <SectionTitle>
5 Related work
</SectionTitle>
    <Paragraph position="0"> The methodology which is closest to our framework is ORANGE (Lin, 2004a), which evaluates a similarity metric using the average ranks obtained by reference items within a baseline set. As in our framework, ORANGE performs an automatic meta-evaluation, there is no need for human assessments, and it does not depend on the scale properties of the metric being evaluated (because changes of scale preserve rankings). The ORANGE approach is, indeed, intimately related to the original QARLA measure introduced in (Amigo et al., 2004).</Paragraph>
    <Paragraph position="1"> There are several approaches to the automatic evaluation of summarisation and Machine Translation systems (Culy and Riehemann, 2003; Coughlin, 2003). Probably the most significant improvement over ORANGE is the ability to combine automatically the information of different metrics. Our impression is that a comprehensive automatic evaluation of a summary must necessarily capture different aspects of the problem with different metrics, and that the results of every individual checking (metric) should not be combined in any prescribed algebraic way (such as a linear weighted combination). Our framework satisfies this condition.</Paragraph>
    <Paragraph position="2"> ORANGE, however, has also an advantage over the QARLA framework, namely that it can be used for evaluation metrics which are not based on similarity between model/peer pairs. For instance, ROUGE can be applied directly in the ORANGE framework without any reformulation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML