File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1072_intro.xml
Size: 3,927 bytes
Last Modified: 2025-10-06 14:02:05
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1072"> <Title>ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation</Title> <Section position="3" start_page="0" end_page="1" type="intro"> <SectionTitle> 2 ORANGE </SectionTitle> <Paragraph position="0"> Intuitively a good evaluation metric should give higher score to a good translation than a bad one.</Paragraph> <Paragraph position="1"> Therefore, a good translation should be ranked higher than a bad translation based their scores.</Paragraph> <Paragraph position="2"> One basic assumption of all automatic evaluation metrics for machine translation is that reference translations are good translations and the more a machine translation is similar to its reference translations the better. We adopt this assumption and add one more assumption that automatic translations are usually worst than their reference translations. Therefore, reference translations should be ranked higher than machine translations on average if a good automatic evaluation metric is used. Based on these assumptions, we propose a new automatic evaluation method for evaluation of automatic machine translation metrics as follows: Given a source sentence, its machine translations, and its reference translations, we compute the average rank of the reference translations within the combined machine and reference translation list. For example, a statistical machine translation system such as ISI's AlTemp SMT system (Och 2003) can generate a list of n-best alternative translations given a source sentence. We compute the automatic scores for the n-best translations and their reference translations. We then rank these translations, calculate the average rank of the references in the n-best list, and compute the ratio of the average reference rank to the length of the n-best list. We call this ratio &quot;ORANGE&quot; (Oracle</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> Ranking for </SectionTitle> <Paragraph position="0"> Gisting Evaluation) and the smaller the ratio is, the better the automatic metric is.</Paragraph> <Paragraph position="1"> There are several advantages of the proposed ORANGE evaluation method: * No extra human involvement - ORANGE uses the existing human references but not human evaluations.</Paragraph> <Paragraph position="2"> * Applicable on sentence-level - Diagnostic error analysis on sentence-level is naturally provided. This is a feature that many machine translation researchers look for.</Paragraph> <Paragraph position="3"> * Many existing data points - Every sentence is a data point instead of every system (corpus-level). For example, there are 919 sentences vs. 8 systems in the 2003 NIST Chinese-English machine translation evaluation.</Paragraph> <Paragraph position="4"> * Only one objective function to optimize Minimize a single ORANGE score instead of maximize Pearson's correlation coefficients between automatic scores and human judgments in adequacy, fluency, or other quality metrics.</Paragraph> <Paragraph position="5"> * A natural fit to the existing statistical machine translation framework - A metric that ranks a good translation high in an n-best list could be easily integrated in a minimal error rate statistical machine translation training framework (Och 2003). The overall system performance in terms of Oracles refer to the reference translations used in the evaluation procedure.</Paragraph> <Paragraph position="6"> machine translation systems in 2003 NIST Chinese-English machine translation evaluation.</Paragraph> <Paragraph position="7"> generating more human like translations should also be improved.</Paragraph> <Paragraph position="8"> Before we demonstrate how to use ORANGE to evaluate automatic metrics, we briefly introduce three new metrics in the next section.</Paragraph> </Section> </Section> class="xml-element"></Paper>