File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/e06-1032_evalu.xml

Size: 2,289 bytes

Last Modified: 2025-10-06 13:59:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1032">
  <Title>Re-evaluating the Role of BLEU in Machine Translation Research</Title>
  <Section position="6" start_page="254" end_page="254" type="evalu">
    <SectionTitle>
5 Related Work
</SectionTitle>
    <Paragraph position="0"> A number of projects in the past have looked into ways of extending and improving the Bleu metric. Doddington(2002)suggestedchangingBleu's weighted geometric average of n-gram matches to an arithmetic average, and calculating the brevity penalty in a slightly different manner. Hovy and Ravichandra (2003) suggested increasing Bleu's sensitivity to inappropriate phrase movement by  matchingpart-of-speechtagsequencesagainstreference translations in addition to Bleu's n-gram matches. Babych and Hartley (2004) extend Bleu by adding frequency weighting to lexical items through TF/IDF as a way of placing greater emphasis on content-bearing words and phrases.</Paragraph>
    <Paragraph position="1"> Twoalternativeautomatictranslationevaluation metrics do a much better job at incorporating recall than Bleu does. Melamed et al. (2003) formulate a metric which measures translation accuracy in terms of precision and recall directly rather than precision and a brevity penalty. Banerjee and Lavie (2005) introduce the Meteor metric, which also incorporates recall on the unigram level and further provides facilities incorporating stemming, and WordNet synonyms as a more flexible match.</Paragraph>
    <Paragraph position="2"> LinandHovy(2003)aswellasSoricutandBrill (2004) present ways of extending the notion of n-gram co-occurrence statistics over multiple references, such as those used in Bleu, to other natural language generation tasks such as summarization.</Paragraph>
    <Paragraph position="3"> Both these approaches potentially suffer from the same weaknesses that Bleu has in machine translation evaluation.</Paragraph>
    <Paragraph position="4"> Coughlin (2003) performs a large-scale investigation of Bleu's correlation with human judgments, and finds one example that fails to correlate. Her future work section suggests that she has preliminary evidence that statistical machine translation systems receive a higher Bleu score than their non-n-gram-based counterparts.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML