File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-1019_intro.xml
Size: 2,639 bytes
Last Modified: 2025-10-06 14:02:18
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1019"> <Title>Evaluating Content Selection in Summarization: The Pyramid Method</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Evaluating content selection in summarization has proven to be a difficult problem. Our approach acknowledges the fact that no single best model summary exists, and takes this as a foundation rather than an obstacle. In machine translation, the rankings from the automatic BLEU method (Papineni et al., 2002) have been shown to correlate well with human evaluation, and it has been widely used since and has even been adapted for summarization (Lin and Hovy, 2003). To show that an automatic method is a reasonable approximation of human judgments, one needs to demonstrate that these can be reliably elicited.</Paragraph> <Paragraph position="1"> However, in contrast to translation, where the evaluation criterion can be defined fairly precisely it is difficult to elicit stable human judgments for summarization (Rath et al., 1961) (Lin and Hovy, 2002).</Paragraph> <Paragraph position="2"> Our approach tailors the evaluation to observed distributions of content over a pool of human summaries, rather than to human judgments of summaries. Our method involves semantic matching of content units to which differential weights are assigned based on their frequency in a corpus of summaries. This can lead to more stable, more informative scores, and hence to a meaningful content evaluation. We create a weighted inventory of Summary Content Units-a pyramid-that is reliable, predictive and diagnostic, and which constitutes a resource for investigating alternate realizations of the same meaning. No other evaluation method predicts sets of equally informative summaries, identifies semantic differences between more and less highly ranked summaries, or constitutes a tool that can be applied directly to further analysis of content selection.</Paragraph> <Paragraph position="3"> In Section 2, we describe the DUC method. In Section 3 we present an overview of our method, contrast our scores with other methods, and describe the distribution of scores as pyramids grow in size. We compare our approach with previous work in Section 4. In Section 5, we present our conclusions and point to our next step, the feasibility of automating our method. A more detailed account of the work described here, but not including the study of distributional properties of pyramid scores, can be found in (Passonneau and Nenkova, 2003).</Paragraph> </Section> class="xml-element"></Paper>