File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1078_intro.xml
Size: 2,684 bytes
Last Modified: 2025-10-06 14:02:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1078"> <Title>A Unified Framework for Automatic Evaluation using N-gram Co-Occurrence Statistics</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> With the introduction of the BLEU metric for machine translation evaluation (Papineni et al, 2002), the advantages of doing automatic evaluation for various NLP applications have become increasingly appreciated: they allow for faster implement-evaluate cycles (by by-passing the human evaluation bottleneck), less variation in evaluation performance due to errors in human assessor judgment, and, not least, the possibility of hill-climbing on such metrics in order to improve system performance (Och 2003). Recently, a second proposal for automatic evaluation has come from the Automatic Summarization community (Lin and Hovy, 2003), with an automatic evaluation metric called ROUGE, inspired by BLEU but twisted towards the specifics of the summarization task.</Paragraph> <Paragraph position="1"> An automatic evaluation metric is said to be successful if it is shown to have high agreement with human-performed evaluations. Human evaluations, however, are subject to specific guidelines given to the human assessors when performing the evaluation task; the variation in human judgment is therefore highly influenced by these guidelines. It follows that, in order for an automatic evaluation to agree with a human-performed evaluation, the evaluation metric used by the automatic method must be able to account, at least to some degree, for the bias induced by the human evaluation guidelines. None of the automatic evaluation methods proposed to date, however, explicitly accounts for the different criteria followed by the human assessors, as they are defined independently of the guidelines used in the human evaluations.</Paragraph> <Paragraph position="2"> In this paper, we propose a framework for automatic evaluation of NLP applications which is able to account for the variation in the human evaluation guidelines. We define a family of metrics based on N-gram co-occurrence statistics, for which the automatic evaluation metrics proposed to date for Machine Translation and Automatic Summarization can be seen as particular instances. We show that different members of the same family of metrics explain best the variations obtained with human evaluations, according to the application being evaluated (Machine Translation, Automatic Summarization, and Question Answering) and the guidelines used by humans when evaluating such applications.</Paragraph> </Section> class="xml-element"></Paper>