File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-1020_intro.xml
Size: 2,490 bytes
Last Modified: 2025-10-06 14:01:43
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1020"> <Title>Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics</Title> <Section position="2" start_page="0" end_page="1" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Automated text summarization has drawn a lot of interest in the natural language processing and information retrieval communities in the recent years. A series of workshops on automatic text summarization (WAS 2000, 2001, 2002), special topic sessions in ACL, COLING, and SIGIR, and government sponsored evaluation efforts in the United States (DUC 2002) and Japan (Fukusima and Okumura 2001) have advanced the technology and produced a couple of experimental online systems (Radev et al. 2001, McKeown et al.</Paragraph> <Paragraph position="1"> 2002). Despite these efforts, however, there are no common, convenient, and repeatable evaluation methods that can be easily applied to support system development and just-in-time comparison among different summarization methods.</Paragraph> <Paragraph position="2"> The Document Understanding Conference (DUC 2002) run by the National Institute of Standards and Technology (NIST) sets out to address this problem by providing annual large scale common evaluations in text summarization. However, these evaluations involve human judges and hence are subject to variability (Rath et al. 1961). For example, Lin and Hovy (2002) pointed out that 18% of the data contained multiple judgments in the DUC 2001 single document evaluation .</Paragraph> <Paragraph position="3"> To further progress in automatic summarization, in this paper we conduct an in-depth study of automatic evaluation methods based on n-gram co-occurrence in the context of DUC. Due to the setup in DUC, the evaluations we discussed here are intrinsic evaluations (Sprck Jones and Galliers 1996). Section 2 gives an overview of the evaluation procedure used in DUC.</Paragraph> <Paragraph position="4"> Section 3 discusses the IBM BLEU (Papineni et al.</Paragraph> <Paragraph position="5"> 2001) and NIST (2002) n-gram co-occurrence scoring procedures and the application of a similar idea in evaluating summaries. Section 4 compares n-gram co-occurrence scoring procedures in terms of their correlation to human results and on the recall and precision of statistical significance prediction. Section 5 concludes this paper and discusses future directions.</Paragraph> </Section> class="xml-element"></Paper>