File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/04/p04-1078_concl.xml
Size: 2,080 bytes
Last Modified: 2025-10-06 13:54:09
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1078"> <Title>A Unified Framework for Automatic Evaluation using N-gram Co-Occurrence Statistics</Title> <Section position="8" start_page="2" end_page="2" type="concl"> <SectionTitle> 5 Conclusions </SectionTitle> <Paragraph position="0"> In this paper, we propose a unified framework for automatic evaluation based on N-gram co-occurrence statistics, for NLP applications for which a correct answer is usually an unfeasibly large set (e.g., Machine Translation, Paraphrasing, Question Answering, Summarization, etc.). The success of BLEU in doing automatic evaluation of machine translation output has often led researchers to blindly try to use this metric for evaluation tasks for which it was more or less for the family of metrics AEv(a,N), for correctness scores, first QA evaluation for the family of metrics AEv(a,N), for correctness scores, second QA evaluation appropriate (see, e.g., the paper of Lin and Hovy (2003), in which the authors start with the assumption that BLEU might work for summarization evaluation, and discover after several trials a better candidate).</Paragraph> <Paragraph position="1"> Our unifying framework facilitates the understanding of when various automatic evaluation metrics are able to closely approximate human evaluations for various applications. Given an application app and an evaluation guideline package eval, the faithfulness/compactness ratio of the application and the precision/recall ratio of the evaluation guidelines determine a restricted area in the evaluation plane in Figure 1 which best characterizes the (app, eval) pair. We have empirically demonstrated that the metrics from the AEv(a ,N) family that best approximate human judgment are those that have the a and N parameters in the determined restricted area. To our knowledge, this is the first proposal regarding automatic evaluation in which the automatic evaluation metrics are able to account for the variation in human judgment due to specific evaluation guidelines.</Paragraph> </Section> class="xml-element"></Paper>