File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-0408_evalu.xml

Size: 5,951 bytes

Last Modified: 2025-10-06 13:58:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0408">
  <Title>A Comparison of Rankings Produced by Summarization Evaluation Measures</Title>
  <Section position="6" start_page="73" end_page="75" type="evalu">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> One way to compare the different rankings produced by two different evaluation measures is to</Paragraph>
    <Paragraph position="2"> 667 3, 4, 12, 3 603 4, 4, 9, 3 364 2, 3, 5, 2 502 4, 5, 5, 4 579 12, 11, 10 460 5, 5, 5, 4 503 6, 9, 8, 8 877 3, 5, 5, 4 702 9, 4, 5, 8 528 6, 5, 7, 5 793 3, 3, 5, 2 690 9, 2, 7, 5 438 6, 3, 5, 5 I 682 4, 3, 4,3 474 4, 5, 4, 5  'ficient. When two evaluation measures produce nearly the same ranking of the summary set, the rank correlation will be near 1 and a scatterplot of the two rankings will show points nearly ly* ing on a line with slope 1. When there is little correlation between two rankings, the statistic will be near 0 and the scatterplot will appear to have randomly-distributed points. A negative correlation indicates that one ranking often reverses the rankings of the other and in this case a rank scatterplot will show points nearly lying on a line with negative slope. Table 3 compares the Spearman correlation of the rankings produced by a specific pair of ground truths. The first row contains the correlations of two highly similar ground truth extracts of document 14. Both of these extracts consisted of three sentences; two of the sentences were common to both extracts. Not surprisingly, the correlation is high regardless of what measure produced the rankings. The second row demonstrates an increase (across the row) in correlation between rankings produced by two different ground truth summaries of document 8. These two ground truths did not disagree in focus, but did disagree due to synonymy -- they contain just one sentence in common. In general, the correlation among the rankings produced by synonymous ground truths was increased most by using the SVD content-based comparison. Figure 1 illustrates the correlation increase graphically for this pair of ground truths. By contrast, the third row of Table 3 displays a decrease (across the row) in correlation between rankings produced by two different ground truths. In this case, the two ground truths disagreed in .focus: they are Extracts 2 and 3 contrasted in Section 2.1. Again, the correlation among the rankings produced by the four ground truths was decreased most by using a weighted content-based comparison such  as tf-idf or SVD. These patterns were typical for rankings produced by ground truths which differed in focus, allaying the fear that applying the SVD weighting would produce correlated rankings based on any two ground truths. Of course, the lack of correlation among recall-based rankings whenever ground truths did not contain exactly the same sentences implies that a different collection of extracts would rank highly if one ground truth were replaced with the other. This effect would surely carry through to system averages across a set of documents. To exemplify the size of this effect, for each document, the summaries which scored highest using one ground truth were scored (using recall) against a second ground truth. With the first ground truths, these high-scoring summaries averaged over 75% recall; using the second ground truths, the same summaries averaged just over 25% recall. Thus, by simply changing judges, an automatic system which produced these summaries would appear to have a very different success rate. This dispar. ity is lessened when content-based measures are used, but the outcomes are still disparate.</Paragraph>
    <Paragraph position="3"> Evidence suggests that the content-based measures which do not rely on a ground truth * may be an acceptable substitute to those which do'.- Over the set of 15 documents, the average within-document inter-assessor correlation is 0.61 using term frequency, 0.72 using tf-idf, and 0.67 using SVD. The average correlation of the ground truth dependent measures with those that perform summary-document comparisons is 0.48 using term frequency, 0.70 using tf-idf, and 0.56 using SVD. This means that on average, the rankings based on single ground truths are only slightly more correlated to each other than they are to the rankings that do not depend on any ground truth.</Paragraph>
    <Paragraph position="4"> As noted in Section 2.1, the recall-based measures exhibit unfavorable scoring properties. Figure 2 shows the histogram of scores assigned to the exhaustive summary set for doc- null ument 14 by five different measures. Each of these measures was based on the same ground truth summary of this document, which contained four sentences. Clearly, the measures based on a more sophisticated parsing method have a much greater ability to discriminate between summaries. By contrast, the recail metric can assign one of only four scores to a length 3 summary, based on the value of Ji Elementary combinatorics shows that 4 extracts will receive the highest possible score (and thus will rank first), 126 summaries will rank second, 840 summaries will rank third, and 1330 summaries will rank last (with a score of 0). This accounts for all of the 2300 three-sentence extracts that, are possible. It seems very unlikely that all of the second-ranking summaries are equally effective. The histogram depicting this distribution is shown at the top of Figure 2. This is followed by the histograms for the Kendall metric, and the content-based metrics using term frequency, tf-idf, and SVD weighted vectors, respectively. The tf-idf and SVD weighted measures produced a very fine distribution of scores, particularly near the top of the range. That is, these metrics are able to distinguish between different high-scoring summaries. These patterns in the score histograms were typical across the 15 documents.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML