File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0408_metho.xml
Size: 19,575 bytes
Last Modified: 2025-10-06 14:07:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0408"> <Title>A Comparison of Rankings Produced by Summarization Evaluation Measures</Title> <Section position="4" start_page="0" end_page="73" type="metho"> <SectionTitle> EXTRACT 1: Both Ms. Streisand's film </SectionTitle> <Paragraph position="0"> husband, played by Jeroen Krabbe, and her film son, played by her real son Jason Gould, are, for the purposes of the screenplay, violinists. The actual sound - what might be called a fiddle over - was produced off camera by Pinchas Zucker-</Paragraph> <Paragraph position="2"> man. The violin program in &quot;Prince of Tides&quot; eliminates the critic's usual edge and makes everyone fall back on his basic pair of ears.</Paragraph> <Paragraph position="3"> EXTRACT 2: Journalistic ethics forbid me from saying if I think &quot;Prince of Tides&quot; is as good as &quot;Citizen Kane,&quot; but I don't think it's wrong to reveal that the film has some very fine violin playing. But moviegoers will hear Mr. Zuckerman cast off the languor that too often makes him seem like the most bored of great violinists. With each of these pieces, Mr. Zuckerman takes over the movie and shows what it means to play his instrument with supreme dash.</Paragraph> <Paragraph position="4"> Another source of disagreement can arise from judges' differing opinions about the true focus of the original document. In other words, judges disagree on what the document is about. We call this second source 'disagreement due to focus.' Here is a human-generated extract of the same article which seems to differ in focus: EXTRACT 3: Columbia Pictures has delayed the New York City and Los Angeles openings of &quot;Prince Of Tides&quot; for a week. So Gothamites and Angelenos, along with the rest of the country, will have to wait until Christmas Day to see this film version of the Pat Conroy novel about a Southern football coach (Nick Nolte) dallying with a Jewish female psychotherapist (Barbra Streisand) in the Big Apple. Perhaps the postponement is a sign that the studio is looking askance at this expensive product directed and co-produced &quot;.by its female lead.</Paragraph> <Paragraph position="5"> Whatever the source, disagreements at the sentence level are prevalent. This has serious implications for measures based on a single opinion, when a slightly different opinion would result in a significantly different score (and rank) for many summaries.</Paragraph> <Paragraph position="6"> For example, consider the following three-sentence ground truth extract of a 37-sentence 1994 Los Angeles Times article from the TREC collection. It contains sentences 1, 2 and 13. (1) Clinton Trade Initiative Sinks Under G-7 Criticism. (2) President Clinton came to the high-profile Group of Seven summit to demonstrate new strength in for- null eign policy but instead watched his premier initiative sink Saturday under a wave of sharp criticism. (13) The negative reaction to the president's trade proposal came as a jolt after administration officials had built it up under the forward-looking name of &quot;Markets 2000&quot; and had portrayed it as evidence of his interest in leading the other nations to more open trade practices.</Paragraph> <Paragraph position="7"> An extract that replaces sentence 13 with sen- null tence 5: (5) In its most elementary form, it woul~d have set up a one-year examination of im' prediments to world trade, but it would have also set an agenda for liberalizing trade rules in entirely new areas, such as financial services, telecommunications and invest ment.</Paragraph> <Paragraph position="8"> will receive the same recall score as one which replaces sentence 13 with sentence 32: (32) Most nations have yet to go through this process, which they hope to complete by January.</Paragraph> <Paragraph position="9"> These two alternative summaries both have the same recall rank, but are obviously of very different quality.</Paragraph> <Paragraph position="10"> Considered quantitatively, the only important component of either precision or recall is the 'sentence agreement' J, the number of sentences a summary has in common with the ground truth summary. Following Goldstein et al. (1999), let M be the number of sentences in a ground truth extract summary and let K be the number of sentences in a summary to be evaluated. With precision P = J/K and recall R = JIM as usual, and F1 = 2PR/(P + R); then elementary algebra shows that F1 = 2J/(M/K). Often, a fixed summary length K is used. (In terms of word count, this represents varying compression rates.) When a particular ground truth of a given document is chosen, then precision, recall and F1 are all constant multiples of J. As such, these measures produce different scores, but the same ranking of all the K-sentence extracts from the document. Since only this ranking is of interest, it is not necessary to examine more than one of P, R and F1.</Paragraph> <Paragraph position="11"> The sentence agreement J can only take on integer values between 0 and M, so J, P, R, and F1 are all discrete variables. Therefore, although there may be thousands of possible extract summaries of a document, only M + 1 different scores are possible. This will obviously create a large number of ties in rankings produced by the P, R, and F1 scores, and will greatly increase the probability that radically different summaries will be given the same score and rank. On the other hand, two summaries which express the same ideas using different sentences will be given very different scores. Both of these problems are illustrated by the example above. Furthermore, if a particular ground truth includes a large proportion of the document's sentences (perhaps it is ~ very concise document), shorter summaries will likely include only sentences which appear in the ground truth. Consequently, even a randomly selected collection of sentences might obtain the largest possible score. Thus, recall-based measures are likely to violate both properties (i) and (ii), discussed at the beginning of Section 2. These inherent weaknesses in recall-based measures will be further explored in Section 4.</Paragraph> <Section position="1" start_page="71" end_page="71" type="sub_section"> <SectionTitle> 2.2 A Sentence-Rank-Based Measure </SectionTitle> <Paragraph position="0"> One way to produce ground truth summaries is to ask judges to rank the sentences of a docu.ment in order of their importance in a generic, indicative summary. This is often a difficult task for which it is nearly impossible to obtain consistent results. However, sentences which appear early in a document are often more indicative of the content of the document than are other sentences. This is particularly true in newspaper articles, whose authors frequently try, to give the main points in the first paragraph (Brandow et al., 1995). Similarly, adjacent sentences are more likely to be related to each other than to those which appear further away in the text. Thus, sentence position alone may be an effective way to rank the importance of sentences.</Paragraph> <Paragraph position="1"> To account for sentence importance within a ground truth, a summary comparison measure was developed which treats an extract as a ranking of the sentences of the document. For example, a document with five sentences can be expressed as (1, 2, 3, 4, 5). A particular extract may include sentences 2 and 3. Then if sentence 2 is more important than sentence 3, the sentence ranks are given by (4, 1, 2, 4, 4). Sentences 1, 4, and 5 all rank fourth, since 4 is the midrank of ranks 3, 4 and 5. Such rank vectors can be compared using Kendall's tau statistic (Sheskin, 1997), thus quantifying a summary's agreement with a particular ground truth. As will be shown in Section 4, sentence rank measures result in a smaller number of ties than do recall-based evaluation measures.</Paragraph> <Paragraph position="2"> Although it is also essentially recall-based, the sentence rank measure has another slight advantage over recall. Suppose a ground truth summary of a 20-sentence document consists of sentences (2, 3, 5}. The machine-generated summaries consisting of sentences (2, 3, 4} and {2, 3, 9} would receive the same recall score, but (2, 3, 4} would receive a higher tau score (5 is closer to 4 than to 9). Of course, this higher score may not be warranted if the content of sentence 9 is more similar to that of sentence 5.</Paragraph> <Paragraph position="3"> The use of the tau statistic may be more appropriate for ground truths produced by classifying all of the sentences of the original document in terms of their importance to an indicative summary. Perhaps four different categories could be used, ranging from 'very important' to 'not important.' This would allow comparison of a ranking with four equivalence classes (representing the document) to one with just two equivalence classes (representing inclusion and exclusion from the summary to be evaluated).</Paragraph> </Section> <Section position="2" start_page="71" end_page="73" type="sub_section"> <SectionTitle> 2.3 Content-Based Measures </SectionTitle> <Paragraph position="0"> Since indicative summaries alert users to document content, any measure that evaluates the quality of an indicative summary ought to consider the similarity of the content of the summary to the content of the full document. This consideration should be independent of exactly which sentences are chosen for the summary.</Paragraph> <Paragraph position="1"> The content of the summary need only capture the general ideas of the original document. If human-generated extracts are available, machine-generated extracts may be evaluated alternatively by comparing their contents to these ground truths. This section defines content-based measures by comparing the term frequency (tf) vectors of extracts to tf vectors of the full text or to tf vectors of a ground truth extract. When the texts and summaries are tokenized and token aliases are determined by a thesaurus, sumrriaries that disagree due to synonymy are likely to have similarly-distributed</Paragraph> <Paragraph position="3"> term frequencies. Also, summaries which happen to use synonyms appearing infrequently in the text will not be penalized in a summaryto-full-document comparison. Note that term frequencies can always be used to compare an extract with its full text, since the two will always have terms in common, but without a thesaurus or some form of term aliasing, term frequencies cannot be used to compare abstracts with extracts.</Paragraph> <Paragraph position="4"> The vector space model of information retrieval as described by Salton (1989) uses the inner product of document vectors to measure the content similarity sirn(dl,d2) of two documents dl and d2. Geometrically, this similarity measure gives the cosine of the angle between the two document vectors. Since cos 0 = 1, documents with high cosine similarity are deemed similar. We apply this concept to summary evaluation by computing document-summary content similarity sim(d, s) or ground truthsummary content similarity sire(g, s).</Paragraph> <Paragraph position="5"> Note that when comparing a summary with its document, a prior human assessment is not necessary. This may serve to eliminate the ambiguity of a human assessor's bias towards certain types of summaries or sentences. However, the scores produced by such evaluation measures cannot be used reliably to compare summaries of drastically different lengths, since a much longer summary is more likely than a short summary to produce a term frequency .vector which is similar to the full document's &quot;tf vector, despite the normalization of the two vectors. (This contrasts with the bias of recall towards short summaries.) This similarity measure can be enhanced in a number of ways. For example, using term frequency counts for a large corpus of documents, term weighting (such as log-entropy (Dumais, 199!) or tf-idf (Salton, 1989)) can be used to weight the terms in the document and summary vectors. This may improve the performance of the similarity measure by increasing the weights of content-indicative terms and decreasing the weights of those terms that are not indicative of content. It is demonstrated in Section 4 that term weighting caused a significant increase in the correlation of the rankings produced by different ground truths; however, it is n.ot clear that this weighting increases the scores of high quality summaries.</Paragraph> <Paragraph position="6"> There are two potential problems with using the cosine measure to evaluate the performance of a summarization algorithm. First of all, it is likely that the summary vector will be very sparse compared to the document vector since the summary will probably contain many fewer terms than the document. Second, it is possible that the summary will use key terms that are not used often in the document. For example, a document about the merger of two banks, may use the term &quot;bank&quot; frequently, and use the related (yet not exactly synonymous) term &quot;financial institution&quot; only a few times. It is possilJle that a high quality extract would have a low cosine similarity with the full document if it contained only those few sentences that use the term &quot;financial institution&quot; instead of &quot;bank.&quot; Both of these problems can be addressed with another common tool in information retrieval: latent semantic indexing or LSI (Deerwester et al., 1990).</Paragraph> <Paragraph position="7"> LSI is a method of reducing the dimension of the vector space model using the singular value decomposition. Given a corpus of documents, create a term-by-document matrix A where each row corresponds to a term in the document set and each column corresponds to a document. Thus, the columns of A represent all the documents from the corpus, expressed in a particular term-weighting scheme. (In our testing, the document vectors' entries are the relative frequencies of the terms.) Compute the singular value decomposition (SVD) of this matrix (for details see Golub and van Loan (1989)). Retain some number of the largest singular values of A and discard the rest. In general, removing singular values serves as a dimension reduction technique. While the SVD computation may be time-consuming when the corpus is large, it needs to be performed only once to produce a new term-document matrix and a projection matrix. To calculate the similarity of a summary and a document, the summary vector s must also be mapped to this low-dimensional space. This involves computing a vector-matrix product, which can be done quickly.</Paragraph> <Paragraph position="8"> The effect of using the scaled, reduceddimension document and summary vectors is two-fold. First, each coordinate of both the document and summary vector will contribute to the overall similarity of the summary to the document (unlike the original vector space model, where only terms that occur in the summary contribute to the cosine similarity score). Second, the purpose of using LSI is to reduce the effect of near-synonymy on the similarity score.</Paragraph> <Paragraph position="9"> If a term occurs infrequently in the document but is highly indicative of the content of the document, as in the case where the infrequent term is synonymous with a frequent term, the summary will be penalized less in the reduceddimension model for using the infrequent termthan it would be in the original vector space model. This reduction in penalty occurs because LSI essentially averages the weights of, terms that co-occur frequently with other terms (both &quot;bank&quot; and &quot;financial institution&quot; often occur with the term &quot;account&quot;). This should improve the accuracy of the cosine similarity measure for determining the quality of a summary of a document.</Paragraph> </Section> </Section> <Section position="5" start_page="73" end_page="73" type="metho"> <SectionTitle> 3 Experimental Design </SectionTitle> <Paragraph position="0"> This section describes the experiment that tests how well these summary evaluation metrics perform. Fifteen documents from the Text Retrieval Conference (TREC) collection were used in the experiment. These documents are part of a corpus of 103 newspaper articles. Each of the documents was tokenized by a language processing algorithm, which performed token aliasing. In our experiments, the term set was comprised of all the aliases appearing in the full corpus of 103 documents. This corpus was used for the purposes of term weighting. Four expert judges created extract summaries (ground truths) for each of the documents. A list of the first 15 documents, along with some of their numerical features is found in Table 1. The judges were instructed to select as many sentences as were necessary to make an &quot;ideal&quot; indicative extract summary of the document. In terms of the count of sentences in the ground truth, the lengths of the summaries varied from document to document. Ground truth compression rates were generally between 10 and 20 percent. The inter-assessor agreement also varied, but was often quite high. We measured this by calculating the average pairwise recall in the collection of four ground truths.</Paragraph> <Paragraph position="1"> A suite of summary evaluation measures {Ek } which produce a score for a summary was developed. These measures may depend on none, one, or all of the collection of ground truth summaries {gj}. Measures which do not depend on ground truth compute the summary-document similarity sire(s, d). Content-based measures which depend on a single ground truth gi compute the summary-ground truth similarity sim(s, gi). A measure which depends on all of the ground truths gl,.-.,ga, computes a summary's similarity with each ground truth separately and averages these values. Table 2 enumerates the 28 different evaluation measures that were compared in this experiment. Note that the Recall and Kendall measures require a ground truth.</Paragraph> <Paragraph position="2"> In this study, the measures will be used to evaluate extract summaries of a fixed sentence length K. In all of our tests, K = 3 for reasons of scale which will become clear. A summary length of three sentences represents varying proportions of the number of sentences in the full text document, but this length was usually comparable to the lengths of the human-generated ground truths. For each document, the collection {Sj} was generated. This is the set of all possible K-sentence extracts from the document. If the document has N sentences total, there will be N choose K N) N! g = K! (/~L g)! extracts in the exhaustive collection {Sj}. The focus now is only on the set of all possible summaries and the evaluation measures, and not On any particular summarization algorithm. For each document, each of the measures in {Ek} was used to rank the sets {Sj}. (Note that the measures which do notdepend on ground truths could, in fact, be used to generate summaries if it were possible to produce and rank the exhaustive set of fixed-length summaries in real time. Despite the authors' access to impressive computing power, the process took several hours for each document!) The next section compares these different rankings of the exhaustive set of extracts for each document.</Paragraph> </Section> class="xml-element"></Paper>