File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-0408_intro.xml
Size: 5,323 bytes
Last Modified: 2025-10-06 14:00:52
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0408"> <Title>A Comparison of Rankings Produced by Summarization Evaluation Measures</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Evaluation Measures </SectionTitle> <Paragraph position="0"> An evaluation measure produces a numerical score which can be used to compare different summaries of the same document. The scores are used to assess summary quality across a collection of test documents in order to produce an average for an algorithm or system. However, it must be emphasized that the scores are most significant when considered per document. For example, two different summaries of a document may have been produced by two different summarization algorithms. Presumably, the summary with the higher score indicates that the system which produced it performed better than the other system. Obviously, if one system consistently produces higher scores than another system, its average score will be higher, and one has reason to believe that it is a better system. Thus, the important feature of any summary evaluation measure is not the value of its score, but rather the ranking its score imposes on a set of extracts of a document.</Paragraph> <Paragraph position="1"> To compare two evaluation measures, whose scores may have very different ranges and distributions, one must compare the order in which the measures rank various summaries of a document. For instance, suppose a summary scoring function Y is completely dependent upon the output of another scoring function X, such as Y -- 2 X. Since Y is an increasing function of X, both X and Y will produce the same ranking of any set of summaries. However, the scores produced by Y will have a very different distribution than those of X and the two sets of scores will not be correlated since the dependence of Y on X is non-linear. Therefore, in order to compare the scores two different measures assign to a set of summaries, one must compare the ranks . they assign, not the actual scores.</Paragraph> <Paragraph position="2"> The ranks assigned by an evaluation measure produce equivalence classes of extract summaries; each rank equivalence class contains summaries which received the same score.</Paragraph> <Paragraph position="3"> When a measure produces the same score for two different summaries of a document, there is a tie, and the equivalence class will contain more than one summary. All summaries in an equivalence class must share the same rank; let this rank be the midrank of the range of ranks that would have be assigned if each score were distinct. An evaluation measure should posses the following properties: (i) higher-ranking summaries are more effective or are of higher quality than lower-ranking summaries, and (ii) all of the summaries in a rank equivalence class are more-or-less equally effective.</Paragraph> <Paragraph position="4"> The following sections contrast the ranking properties of three types of evaluation measures: recall-based measures, a sentence-rank-based measure and content-based measures. These types of measures are defined, their properties are described and their use is explained.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Recall-Based Evaluation Measures </SectionTitle> <Paragraph position="0"> Recall-based evaluation measures are intrinsic. They compare machine-generated summaries with sentences previously extracted by human assessors or judges. From each document, the judges extract sentences that they believe make up the best extract summary of the document. A summary of a document generated by a summarization algorithm is typically compared to one of these &quot;ground truth&quot; summaries by counting the number of sentences the ground truth summary and the algorithm's summary have in common. Thus, the more sentences a summary has recalled from the ground truth, the higher its score will be. See work by Goldstein et al. (1999) and Jing et al. (1998) for examples of the use of this measure.</Paragraph> <Paragraph position="1"> The recall-based measures introduce a bias since they are based on the Opinions of a small number of assessors. It is widely acknowledged (Jing et al., 1998; Kupiec et al., 1995; Voorhees, 1998) that assessor agreement is typically quite low. There are at least two sources of this disagreement. First, it is possible that one human assessor will pick a particular sentence for inclusion in their summary when the content of another sentence or set of sentences is approximately equivalent. Jing et al. (1998) agree: &quot;...precision and recall are not the best measures for computing document quality. This is due to the fact that a small change in the summary output (e.g., replacing one sentence with an equally good equivalent which happens not to match majority opinion \[of the assessors\]) can dramatically affect a system's score.&quot; We call this source of summary disagreement 'disagreement due to synonymy.' Here is an example of two human-generated extracts from the same 1991 Wall Street Journal article which contain different sentences, but still seem to be describing an article about violin playing in a film:</Paragraph> </Section> </Section> class="xml-element"></Paper>