File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0909_metho.xml

Size: 21,667 bytes

Last Modified: 2025-10-06 14:09:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0909">
  <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65-72, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments</Title>
  <Section position="3" start_page="65" end_page="68" type="metho">
    <SectionTitle>
2 The METEOR Metric
2.1 Weaknesses in BLEU Addressed in
METEOR
</SectionTitle>
    <Paragraph position="0"> The main principle behind IBM's BLEU metric (Papineni et al, 2002) is the measurement of the  overlap in unigrams (single words) and higher order n-grams of words, between a translation being evaluated and a set of one or more reference translations. The main component of BLEU is n-gram precision: the proportion of the matched n-grams out of the total number of n-grams in the evaluated translation. Precision is calculated separately for each n-gram order, and the precisions are combined via a geometric averaging. BLEU does not take recall into account directly. Recall - the proportion of the matched n-grams out of the total number of n-grams in the reference translation, is extremely important for assessing the quality of MT output, as it reflects to what degree the translation covers the entire content of the translated sentence. BLEU does not use recall because the notion of recall is unclear when matching simultaneously against a set of reference translations (rather than a single reference). To compensate for recall, BLEU uses a Brevity Penalty, which penalizes translations for being &amp;quot;too short&amp;quot;. The NIST metric is conceptually similar to BLEU in most aspects, including the weaknesses discussed below.</Paragraph>
    <Paragraph position="1"> BLEU and NIST suffer from several weaknesses, which we attempt to address explicitly in our proposed METEOR metric: The Lack of Recall: We believe that the fixed brevity penalty in BLEU does not adequately compensate for the lack of recall. Our experimental results strongly support this claim.</Paragraph>
    <Paragraph position="2"> Use of Higher Order N-grams: Higher order N-grams are used in BLEU as an indirect measure of a translation's level of grammatical wellformedness. We believe an explicit measure for the level of grammaticality (or word order) can better account for the importance of grammaticality as a factor in the MT metric, and result in better correlation with human judgments of translation quality.</Paragraph>
    <Paragraph position="3"> Lack of Explicit Word-matching Between Translation and Reference: N-gram counts don't require an explicit word-to-word matching, but this can result in counting incorrect &amp;quot;matches&amp;quot;, particularly for common function words.</Paragraph>
    <Paragraph position="4"> Use of Geometric Averaging of N-grams: Geometric averaging results in a score of &amp;quot;zero&amp;quot; whenever one of the component n-gram scores is zero. Consequently, BLEU scores at the sentence (or segment) level can be meaningless. Although BLEU was intended to be used only for aggregate counts over an entire test-set (and not at the sentence level), scores at the sentence level can be useful indicators of the quality of the metric. In experiments we conducted, a modified version of BLEU that uses equal-weight arithmetic averaging of n-gram scores was found to have better correlation with human judgments.</Paragraph>
    <Section position="1" start_page="66" end_page="68" type="sub_section">
      <SectionTitle>
2.2 The METEOR Metric
</SectionTitle>
      <Paragraph position="0"> METEOR was designed to explicitly address the weaknesses in BLEU identified above. It evaluates a translation by computing a score based on explicit word-to-word matches between the translation and a reference translation. If more than one reference translation is available, the given translation is scored against each reference independently, and the best score is reported. This is discussed in more detail later in this section.</Paragraph>
      <Paragraph position="1"> Given a pair of translations to be compared (a system translation and a reference translation), METEOR creates an alignment between the two strings. We define an alignment as a mapping between unigrams, such that every unigram in each string maps to zero or one unigram in the other string, and to no unigrams in the same string. Thus in a given alignment, a single unigram in one string cannot map to more than one unigram in the other string. This alignment is incrementally produced through a series of stages, each stage consisting of two distinct phases.</Paragraph>
      <Paragraph position="2"> In the first phase an external module lists all the possible unigram mappings between the two strings. Thus, for example, if the word &amp;quot;computer&amp;quot; occurs once in the system translation and twice in the reference translation, the external module lists two possible unigram mappings, one mapping the occurrence of &amp;quot;computer&amp;quot; in the system translation to the first occurrence of &amp;quot;computer&amp;quot; in the reference translation, and another mapping it to the second occurrence. Different modules map unigrams based on different criteria. The &amp;quot;exact&amp;quot; module maps two unigrams if they are exactly the same (e.g. &amp;quot;computers&amp;quot; maps to &amp;quot;computers&amp;quot; but not &amp;quot;computer&amp;quot;). The &amp;quot;porter stem&amp;quot; module maps two unigrams if they are the same after they are stemmed using the Porter stemmer (e.g.: &amp;quot;computers&amp;quot; maps to both &amp;quot;computers&amp;quot; and to &amp;quot;computer&amp;quot;). The &amp;quot;WN synonymy&amp;quot; module maps two unigrams if they are synonyms of each other.</Paragraph>
      <Paragraph position="3"> In the second phase of each stage, the largest subset of these unigram mappings is selected such  that the resulting set constitutes an alignment as defined above (that is, each unigram must map to at most one unigram in the other string). If more than one subset constitutes an alignment, and also has the same cardinality as the largest set, METEOR selects that set that has the least number of unigram mapping crosses. Intuitively, if the two strings are typed out on two rows one above the other, and lines are drawn connecting unigrams that are mapped to each other, each line crossing is counted as a &amp;quot;unigram mapping cross&amp;quot;. Formally, two unigram mappings (ti, rj) and (tk, rl) (where ti and tk are unigrams in the system translation mapped to unigrams rj and rl in the reference translation respectively) are said to cross if and only if the following formula evaluates to a negative number:</Paragraph>
      <Paragraph position="5"> where pos(tx) is the numeric position of the uni-gram tx in the system translation string, and pos(ry) is the numeric position of the unigram ry in the reference string. For a given alignment, every pair of unigram mappings is evaluated as a cross or not, and the alignment with the least total crosses is selected in this second phase. Note that these two phases together constitute a variation of the algorithm presented in (Turian et al, 2003).</Paragraph>
      <Paragraph position="6"> Each stage only maps unigrams that have not been mapped to any unigram in any of the preceding stages. Thus the order in which the stages are run imposes different priorities on the mapping modules employed by the different stages. That is, if the first stage employs the &amp;quot;exact&amp;quot; mapping module and the second stage employs the &amp;quot;porter stem&amp;quot; module, METEOR is effectively preferring to first map two unigrams based on their surface forms, and performing the stemming only if the surface forms do not match (or if the mapping based on surface forms was too &amp;quot;costly&amp;quot; in terms of the total number of crosses). Note that METEOR is flexible in terms of the number of stages, the actual external mapping module used for each stage, and the order in which the stages are run. By default the first stage uses the &amp;quot;exact&amp;quot; mapping module, the second the &amp;quot;porter stem&amp;quot; module and the third the &amp;quot;WN synonymy&amp;quot; module. In section 4 we evaluate each of these configurations of METEOR.</Paragraph>
      <Paragraph position="7"> Once all the stages have been run and a final alignment has been produced between the system translation and the reference translation, the METEOR score for this pair of translations is computed as follows. First unigram precision (P) is computed as the ratio of the number of unigrams in the system translation that are mapped (to unigrams in the reference translation) to the total number of unigrams in the system translation.</Paragraph>
      <Paragraph position="8"> Similarly, unigram recall (R) is computed as the ratio of the number of unigrams in the system translation that are mapped (to unigrams in the reference translation) to the total number of unigrams in the reference translation. Next we compute Fmean by combining the precision and recall via a harmonic-mean (van Rijsbergen, 1979) that places most of the weight on recall. We use a harmonic mean of P and 9R. The resulting formula used is:</Paragraph>
      <Paragraph position="10"> Precision, recall and Fmean are based on uni-gram matches. To take into account longer matches, METEOR computes a penalty for a given alignment as follows. First, all the unigrams in the system translation that are mapped to unigrams in the reference translation are grouped into the fewest possible number of chunks such that the unigrams in each chunk are in adjacent positions in the system translation, and are also mapped to unigrams that are in adjacent positions in the reference translation. Thus, the longer the n-grams, the fewer the chunks, and in the extreme case where the entire system translation string matches the reference translation there is only one chunk. In the other extreme, if there are no bigram or longer matches, there are as many chunks as there are unigram matches. The penalty is then computed through the following formula:</Paragraph>
      <Paragraph position="12"> For example, if the system translation was &amp;quot;the president spoke to the audience&amp;quot; and the reference translation was &amp;quot;the president then spoke to the audience&amp;quot;, there are two chunks: &amp;quot;the president&amp;quot; and &amp;quot;spoke to the audience&amp;quot;. Observe that the penalty increases as the number of chunks increases to a maximum of 0.5. As the number of chunks goes to 1, penalty decreases, and its lower bound is decided by the number of unigrams matched. The parameters if this penalty function were determined based on some experimentation with de- null veopment data, but have not yet been trained to be optimal.</Paragraph>
      <Paragraph position="13"> Finally, the METEOR Score for the given alignment is computed as follows:</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="68" end_page="71" type="metho">
    <SectionTitle>
)1(* PenaltyFmeanScore [?]=
</SectionTitle>
    <Paragraph position="0"> This has the effect of reducing the Fmean by the maximum of 50% if there are no bigram or longer matches.</Paragraph>
    <Paragraph position="1"> For a single system translation, METEOR computes the above score for each reference translation, and then reports the best score as the score for the translation. The overall METEOR score for a system is calculated based on aggregate statistics accumulated over the entire test set, similarly to the way this is done in BLEU. We calculate aggregate precision, aggregate recall, an aggregate penalty, and then combine them using the same formula used for scoring individual segments.</Paragraph>
    <Paragraph position="2">  3 Evaluation of the METEOR Metric 3.1. Data We evaluated the METEOR metric and compared its performance with BLEU and NIST on the DARPA/TIDES 2003 Arabic-to-English and Chinese-to-English MT evaluation data released through the LDC as a part of the workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, at the Annual Meeting of the Association of Computational Linguistics  (2005). The Chinese data set consists of 920 sentences, while the Arabic data set consists of 664 sentences. Each sentence has four reference translations. Furthermore, for 7 systems on the Chinese data and 6 on the Arabic data, every sentence translation has been assessed by two separate human judges and assigned an Adequacy and a Fluency Score. Each such score ranges from one to five (with one being the poorest grade and five the highest). For this paper, we computed a Combined Score for each translation by averaging the adequacy and fluency scores of the two judges for that translation. We also computed an average System Score for each translation system by averaging the Combined Score for all the translations produced by that system. (Note that although we refer to these data sets as the &amp;quot;Chinese&amp;quot; and the &amp;quot;Arabic&amp;quot; data sets, the MT evaluation systems analyzed in this paper only evaluate English sentences produced by translation systems by comparing them to English reference sentences).</Paragraph>
    <Section position="1" start_page="68" end_page="69" type="sub_section">
      <SectionTitle>
3.2 Comparison with BLEU and NIST MT
Evaluation Algorithms
</SectionTitle>
      <Paragraph position="0"> In this paper, we are interested in evaluating METEOR as a metric that can evaluate translations on a sentence-by-sentence basis, rather than on a coarse grained system-by-system basis. The standard metrics - BLEU and NIST - were however designed for system level scoring, hence computing sentence level scores using BLEU or the NIST evaluation mechanism is unfair to those algorithms. To provide a point of comparison however, table 1 shows the system level correlation between human judgments and various MT evaluation algorithms and sub components of METEOR over the Chinese portion of the Tides 2003 dataset. Specifically, these correlation figures were obtained as follows: Using each algorithm we computed one score per Chinese system by calculating the aggregate scores produced by that algorithm for that system. We also obtained the overall human judgment for each system by averaging all the human scores for that system's translations. We then computed the Pearson correlation between these system level human judgments and the system level scores for each algorithm; these numbers are presented in  with BLEU and NIST/human correlations Observe that simply using Recall as the MT evaluation metric results in a significant improvement in correlation with human judgment over both the BLEU and the NIST algorithms. These correlations further improve slightly when precision is taken into account (in the F1 measure),  when the recall is weighed more heavily than precision (in the Fmean measure) and when a penalty is levied for fragmented matches (in the main METEOR measure).</Paragraph>
    </Section>
    <Section position="2" start_page="69" end_page="69" type="sub_section">
      <SectionTitle>
3.3 Evaluation Methodology
</SectionTitle>
      <Paragraph position="0"> As mentioned in the previous section, our main goal in this paper is to evaluate METEOR and its components on their translation-by-translation level correlation with human judgment. Towards this end, in the rest of this paper, our evaluation methodology is as follows: For each system, we compute the METEOR Score for every translation produced by the system, and then compute the correlation between these individual scores and the human assessments (average of the adequacy and fluency scores) for the same translations. Thus we get a single Pearson R value for each system for which we have human assessments. Finally we average the R values of all the systems for each of the two language data sets to arrive at the overall average correlation for the Chinese dataset and the Arabic dataset. This number ranges between -1.0 (completely negatively correlated) to +1.0 (completely positively correlated).</Paragraph>
      <Paragraph position="1"> We compare the correlation between human assessments and METEOR Scores produced above with that between human assessments and precision, recall and Fmean scores to show the advantage of the various components in the METEOR scoring function. Finally we run METEOR using different mapping modules, and compute the correlation as described above for each configuration to show the effect of each unigram mapping mechanism. null</Paragraph>
    </Section>
    <Section position="3" start_page="69" end_page="69" type="sub_section">
      <SectionTitle>
3.4 Correlation between METEOR Scores
</SectionTitle>
      <Paragraph position="0"> Human Assessments for the Arabic Dataset We computed sentence by sentence correlation between METEOR Scores and human assessments (average of adequacy and fluency scores) for each translation for every system. Tables 2 and 3 show the Pearson R correlation values for each system, as well as the average correlation value per language dataset.</Paragraph>
    </Section>
    <Section position="4" start_page="69" end_page="70" type="sub_section">
      <SectionTitle>
3.5 Comparison with Other Metrics
</SectionTitle>
      <Paragraph position="0"> We computed translation by translation correlations between human assessments and other metrics besides the METEOR score, namely precision, recall and Fmean. Tables 4 and 5 show the correlations for the various scores.</Paragraph>
      <Paragraph position="1">  precision, recall, Fmean and METEOR Scores, averaged over systems in the Chinese dataset We observe that recall by itself correlates with human assessment much better than precision, and that combining the two using the Fmean formula  described above results in further improvement. By penalizing the Fmean score using the chunk count we get some further marginal improvement in correlation. null</Paragraph>
    </Section>
    <Section position="5" start_page="70" end_page="70" type="sub_section">
      <SectionTitle>
3.6 Comparison between Different Map-
</SectionTitle>
      <Paragraph position="0"> ping Modules To observe the effect of various unigram mapping modules on the correlation between the METEOR score and human assessments, we ran METEOR with different sequences of stages with different mapping modules in them. In the first experiment we ran METEOR with only one stage that used the &amp;quot;exact&amp;quot; mapping module. This module matches unigrams only if their surface forms match. (This module does not match unigrams that belong to a list of &amp;quot;stop words&amp;quot; that consist mainly of function words). In the second experiment we ran METEOR with two stages, the first using the &amp;quot;exact&amp;quot; mapping module, and the second the &amp;quot;Porter&amp;quot; mapping module. The Porter mapping module matches two unigrams to each other if they are identical after being passed through the Porter stemmer. In the third experiment we replaced the Porter mapping module with the WN-Stem mapping module. This module maps two unigrams to each other if they share the same base form in WordNet. This can be thought of as a different kind of stemmer - the difference from the Porter stemmer is that the word stems are actual words when stemmed through WordNet in this manner.</Paragraph>
      <Paragraph position="1"> In the last experiment we ran METEOR with three stages, the first two using the exact and the Porter modules, and the third the WN-Synonymy mapping module. This module maps two unigrams together if at least one sense of each word belongs to the same synset in WordNet. Intuitively, this implies that at least one sense of each of the two words represent the same concept. This can be thought of as a poor-man's synonymy detection algorithm that does not disambiguate the words being tested for synonymy. Note that the METEOR scores used to compute correlations in the other tables (1 through 4) used exactly this sequence of stages.</Paragraph>
      <Paragraph position="2"> Tables 6 and 7 show the correlations between METEOR scores produced in each of these experiments and human assessments for both the Arabic and the Chinese datasets. On both data sets, adding either stemming modules to simply using the exact matching improves correlations. Some further improvement in correlation is produced by adding the synonymy module.</Paragraph>
    </Section>
    <Section position="6" start_page="70" end_page="71" type="sub_section">
      <SectionTitle>
3.7 Correlation using Normalized Human
Assessment Scores
</SectionTitle>
      <Paragraph position="0"> One problem with conducting correlation experiments with human assessment scores at the sentence level is that the human scores are noisy that is, the levels of agreement between human judges on the actual sentence level assessment scores is not extremely high. To partially address this issue, the human assessment scores were normalized by a group at the MITRE Corporation. To see the effect of this noise on the correlation, we computed the correlation between the METEOR Score (computed using the stages used in the 4th experiment in section 7 above) and both the raw human assessments as well as the normalized human assessments.</Paragraph>
      <Paragraph position="1">  Table 8 shows that indeed METEOR Scores correlate better with normalized human assessments. In other words, the noise in the human assessments hurts the correlations between automatic scores and human assessments.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML