File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-1032_intro.xml
Size: 2,834 bytes
Last Modified: 2025-10-06 14:03:20
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1032"> <Title>Re-evaluating the Role of BLEU in Machine Translation Research</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Over the past five years progress in machine translation, and to a lesser extent progress in natural language generation tasks such as summarization, has been driven by optimizing against n-gram-based evaluation metrics such as Bleu (Papineni et al., 2002). The statistical machine translation community relies on the Bleu metric for the purposes of evaluating incremental system changes and optimizing systems through minimum error rate training (Och, 2003). Conference papers routinely claim improvements in translation quality by reporting improved Bleu scores, while neglecting to show any actual example translations. Workshops commonly compare systems using Bleu scores, often without confirming these rankings through manual evaluation. All these uses of Bleu are predicated on the assumption that it correlates with human judgments of translation quality, which has been shown to hold in many cases (Doddington, 2002; Coughlin, 2003).</Paragraph> <Paragraph position="1"> However, there is a question as to whether minimizing the error rate with respect to Bleu does indeed guarantee genuine translation improvements.</Paragraph> <Paragraph position="2"> If Bleu's correlation with human judgments has been overestimated, then the field needs to ask itself whether it should continue to be driven by Bleu to the extent that it currently is. In this paper we give a number of counterexamples for Bleu's correlation with human judgments. We show that under some circumstances an improvement in Bleu is not sufficient to reflect a genuine improvement in translation quality, and in other circumstances that it is not necessary to improve Bleu in order to achieve a noticeable improvement in translation quality.</Paragraph> <Paragraph position="3"> We argue that Bleu is insufficient by showing that Bleu admits a huge amount of variation for identically scored hypotheses. Typically there are millions of variations on a hypothesis translation that receive the same Bleu score. Because not all these variations are equally grammatically or semantically plausible there are translations which have the same Bleu score but a worse human evaluation. We further illustrate that in practice a higher Bleu score is not necessarily indicative of better translation quality by giving two substantial examplesofBleuvastlyunderestimatingthetranslation quality of systems. Finally, we discuss appropriate uses for Bleu and suggest that for some research projects it may be preferable to use a focused, manual evaluation instead.</Paragraph> </Section> class="xml-element"></Paper>