File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0908_metho.xml

Size: 20,044 bytes

Last Modified: 2025-10-06 14:09:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0908">
  <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 57-64, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics On Some Pitfalls in Automatic Evaluation and Significance Testing for MT</Title>
  <Section position="3" start_page="57" end_page="58" type="metho">
    <SectionTitle>
2 The Experimental Setup: Discriminative
</SectionTitle>
    <Paragraph position="0"> Reranking for Phrase-Based SMT The experimental setup we employed to compare evaluation measures and significance tests is a discriminative reranking experiment on 1000-best lists of a phrase-based SMT system. Our system is a re-implementation of the phrase-based system described in Koehn (2003), and uses publicly available components for word alignment (Och and Ney, 2003)1, decoding (Koehn, 2004a)2, language modeling (Stolcke, 2002)3 and finite-state processing (Knight and Al-Onaizan, 1999)4. Training and test data are taken from the Europarl parallel corpus (Koehn, 2002)5.</Paragraph>
    <Paragraph position="1"> Phrase-extraction follows Och et al. (1999) and was implemented by the authors: First, the word aligner is applied in both translation directions, and the intersection of the alignment matrices is built. Then, the alignment is extended by adding immediately adjacent alignment points and alignment points that align previously unaligned words. From this many-to-many alignment matrix, phrases are extracted according to a contiguity requirement that states that words in the source phrase are aligned only with words in the target phrase, and vice versa. Discriminative reranking on a 1000-best list of translations of the SMT system uses an lscript1 regularized log-linear model that combines a standard maximum-entropy estimator with an efficient, incremental feature selection technique for lscript1 regularization (Riezler and Vasserman, 2004). Training data are defined as pairs {(sj,tj)}mj=1 of source sentences sj and gold-standard translations tj that are determined as the translations in the 1000-best list that best match a given reference translation. The objective function to be minimized is the conditional log-likelihood L(l) subject to a regularization term R(l), where T(s) is the set of 1000-best translations for sentence s, l is a vector or log-parameters, and</Paragraph>
    <Paragraph position="3"> The features employed in our experiments consist of 8 features corresponding to system components (distortion model, language model, phrasetranslations, lexical weights, phrase penalty, word penalty) as provided by PHARAOH, together with a multitude of overlapping phrase features. For example, for a phrase-table of phrases consisting of maximally 3 words, we allow all 3-word phrases and 2word phrases as features. Since bigram features can overlap, information about trigrams can be gathered by composing bigram features even if the actual tri-gram is not seen in the training data.</Paragraph>
    <Paragraph position="4"> Feature selection makes it possible to employ and evaluate a large number of features, without concerns about redundant or irrelevant features hampering generalization performance. The lscript1 regularizer is defined by the weighted lscript1-norm of the parameters</Paragraph>
    <Paragraph position="6"> where g is a regularization coefficient, and n is number of parameters. This regularizer penalizes overly large parameter values in their absolute values, and tends to force a subset of the parameters to be exactly zero at the optimum. This fact leads to a natural integration of regularization into incremental feature selection as follows: Assuming a tendency of the lscript1 regularizer to produce a large number of zero-valued parameters at the function's optimum, we start with all-zero weights, and incrementally add features to the model only if adjusting their parameters away from zero sufficiently decreases the optimization criterion. Since every non-zero weight added to the model incurs a regularizer penalty of g|li|, it only makes sense to add a feature to the model if this penalty is outweighed by the reduction in negative log-likelihood. Thus features considered for selection have to pass the following test:</Paragraph>
    <Paragraph position="8"> This gradient test is applied to each feature and at each step the features that pass the test with maximum magnitude are added to the model. This provides both efficient and accurate estimation with large feature sets.</Paragraph>
    <Paragraph position="9"> Work on discriminative reranking has been reported before by Och and Ney (2002), Och et al.</Paragraph>
    <Paragraph position="10"> (2004), and Shen et al. (2004). The main purpose of our reranking experiments is to have a system that can easily be adjusted to yield system variants that differ at controllable amounts. For quick experimental turnaround we selected the training and test data from sentences with 5 to 15 words, resulting in a training set of 160,000 sentences, and a development set of 2,000 sentences. The phrase-table employed was restricted to phrases of maximally 3 words, resulting in 200,000 phrases.</Paragraph>
  </Section>
  <Section position="4" start_page="58" end_page="59" type="metho">
    <SectionTitle>
3 Detecting Small Result Differences by
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="58" end_page="59" type="sub_section">
      <SectionTitle>
Intrinsic Evaluations Metrics
</SectionTitle>
      <Paragraph position="0"> The intrinsic evaluation measures used in our experiments are the well-known BLEU (Papineni et al., 2001) and NIST (Doddington, 2002) metrics, and an F-score measure that adapts evaluation techniques from dependency-based parsing (Crouch et al., 2002) and sentence-condensation (Riezler et al., 2003) to machine translation. All of these measures  sures differ in their focus on different entities in matching, corresponding to a focus on different aspects of translation quality.</Paragraph>
      <Paragraph position="1"> BLEU and NIST both consider n-grams in source and reference strings as matching entities. BLEU weighs all n-grams equally whereas NIST puts more weight on n-grams that are more informative, i.e., occur less frequently. This results in BLEU favoring matches in larger n-grams, corresponding to giving more credit to correct word order. NIST weighs lower n-grams more highly, thus it gives more credit to correct lexical choice than to word order.</Paragraph>
      <Paragraph position="2"> F-score is computed by parsing reference sentences and SMT outputs, and matching grammatical dependency relations. The reported value is the harmonic mean of precision and recall, which is defined as (2x precision x recall )/( precision + recall ).</Paragraph>
      <Paragraph position="3"> Precision is the ratio of matching dependency relations to the total number of dependency relations in the parse for the system translation, and recall is the ratio of matches to the total number of dependency relations in the parse for the reference translation. The goal of this measure is to focus on aspects of meaning in measuring similarity of system translations to reference translations, and to allow for meaning-preserving word order variation.</Paragraph>
      <Paragraph position="4"> Evaluation results for a comparison of reranking against a baseline model that only includes features corresponding to the 8 system components are shown in Table 1. Since the task is a comparison of system variants for development, all results are reported on the development set of 2,000 examples of length 5-15. The reranking model achieves an increase in NIST score of .15 units, whereas BLEU and F-score decrease by .3% and .2% respectively. However, as measured by the statistical significance tests described below, the differences in BLEU and F-scores are not statistically significant with p-values exceeding the standard rejection level of .05. In contrast, the differences in NIST score are highly significant. These findings correspond to results reported in Zhang et al. (2004) showing a higher sensitivity of NIST versus BLEU to small result differences. Taking also the results from F-score matching in account, we can conclude that similarity measures that are based on matching more complex entities (such as BLEU's higher n-grams or F's grammatical relations) are not as sensitive to small result differences as scoring techniques that are able to distinguish models by matching simpler entities (such as NIST's focus on lexical choice). Furthermore, we get an indication that differences of .3% in BLEU score or .2% in F-score might not be large enough to conclude statistical significance of result differences. This leads to questions of power and accuracy of the employed statistical significance tests which will be addressed in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="59" end_page="62" type="metho">
    <SectionTitle>
4 Assessing Statistical Significance of
Small Result Differences
</SectionTitle>
    <Paragraph position="0"> The bootstrap method is an example for a computerintensive statistical hypothesis test (see, e.g., Noreen (1989)). Such tests are designed to assess result differences with respect to a test statistic in cases where the sampling distribution of the test statistic  is unknown. Comparative evaluations of outputs of SMT systems according to test statistics such as differences in BLEU, NIST, or F-score are examples of this situation. The attractiveness of computerintensive significance tests such as the bootstrap or the approximate randomization method lies in their power and simplicity. As noted in standard textbooks such as Cohen (1995) or Noreen (1989) such tests are as powerful as parametric tests when parametric assumptions are met and they outperform them when parametric assumptions are violated. Because of their generality and simplicity they are also attractive alternatives to conventional non-parametric tests (see, e.g., Siegel (1988)). The power of these tests lies in the fact that they answer only a very simple question without making too many assumptions that may not be met in the experimental situation. In case of the approximate randomization test, only the question whether two samples are related to each other is answered, without assuming that the samples are representative of the populations from which they were drawn. The bootstrap method makes exactly this one assumption. This makes it formally possible to draw inferences about population parameters for the bootstrap, but not for approximate randomization. However, if the goal is to assess statistical significance of a result difference between two systems the approximate randomization test provides the desired power and accuracy whereas the bootstrap's advantage to draw inferences about population parameters comes at the price of reduced accuracy. Noreen summarizes this shortcoming of the bootstrap technique as follows: &amp;quot;The principal disadvantage of [the bootstrap] method is that the null hypothesis may be rejected because the shape of the sampling distribution is not well-approximated by the shape of the bootstrap sampling distribution rather than because the expected value of the test statistic differs from the value that is hypothesized.&amp;quot;(Noreen (1989), p. 89).</Paragraph>
    <Paragraph position="1"> Below we describe these two test procedures in more detail, and compare them in our experimental setup.</Paragraph>
    <Section position="1" start_page="59" end_page="61" type="sub_section">
      <SectionTitle>
4.1 Approximate Randomization
</SectionTitle>
      <Paragraph position="0"> An excellent introduction to the approximate randomization test is Noreen (1989). Applications of this test to natural language processing problems can be found in Chinchor et al. (1993).</Paragraph>
      <Paragraph position="1"> In our case of assessing statistical significance of result differences between SMT systems, the test statistic of interest is the absolute value of the difference in BLEU, NIST, or F-scores produced by two systems on the same test set. These test statistics are computed by accumulating certain count variables over the sentences in the test set. For example, in case of BLEU and NIST, variables for the length of reference translations and system translations, and for n-gram matches and n-gram counts are accumulated over the test corpus. In case of F-score, variable tuples consisting of the number of dependency-relations in the parse for the system translation, the number of dependency-relations in the parse for the reference translation, and the number of matching dependency-relations between system and reference parse, are accumulated over the test set.</Paragraph>
      <Paragraph position="2"> Under the null hypothesis, the compared systems are not different, thus any variable tuple produced by one of the systems could have been produced just as  likely by the other system. So shuffling the variable tuples between the two systems with equal probability, and recomputing the test statistic, creates an approximate distribution of the test statistic under the null hypothesis. For a test set of S sentences there are 2S different ways to shuffle the variable tuples between the two systems. Approximate randomization produce shuffles by random assignments instead of evaluating all 2S possible assignments. Significance levels are computed as the percentage of trials where the pseudo statistic, i.e., the test statistic computed on the shuffled data, is greater than or equal to the actual statistic, i.e., the test statistic computed on the test data. A sketch of an algorithm for approximate randomization testing is given in Fig. 1.</Paragraph>
    </Section>
    <Section position="2" start_page="61" end_page="62" type="sub_section">
      <SectionTitle>
4.2 The Bootstrap
</SectionTitle>
      <Paragraph position="0"> An excellent introduction to the technique is the textbook by Efron and Tibshirani (1993). In contrast to approximate randomization, the bootstrap method makes the assumption that the sample is a representative &amp;quot;proxy&amp;quot; for the population. The shape of the sampling distribution is estimated by repeatedly sampling (with replacement) from the sample itself.</Paragraph>
      <Paragraph position="1"> A sketch of a procedure for bootstrap testing is given in Fig. 2. First, the test statistic is computed on the test data. Then, the sample mean of the pseudo statistic is computed on the bootstrapped data, i.e., the test statistic is computed on bootstrap samples of equal size and averaged over bootstrap samples.</Paragraph>
      <Paragraph position="2"> In order to compute significance levels based on the bootstrap sampling distribution, we employ the &amp;quot;shift&amp;quot; method described in Noreen (1989). Here it is assumed that the sampling distribution of the null hypothesis and the bootstrap sampling distribution have the same shape but a different location. The location of the bootstrap sampling distribution is shifted so that it is centered over the location where the null hypothesis sampling distribution should be centered. This is achieved by subtracting from each value of the pseudo-statistic its expected value tB and then adding back the expected value t of the test statistic under the null hypothesis. tB can be estimated by the sample mean of the bootstrap samples; t is 0 under the null hypothesis. Then, similar to the approximate randomization test, significance levels are computed as the percentage of trials where the (shifted) pseudo statistic is greater than or equal to the actual statistic.</Paragraph>
      <Paragraph position="3"> 4.3 Power vs. Type I Errors In order to evaluate accuracy of the bootstrap and the approximate randomization test, we conduct an experimental evaluation of type-I errors of both bootstrap and approximate randomization on real data.</Paragraph>
      <Paragraph position="4"> Type-I errors indicate failures to reject the null hypothesis when it is true. We construct SMT system variants that are essentially equal but produce superficially different results. This can be achieved by constructing reranking variants that differ in the redundant features that are included in the models, but are similar in the number and kind of selected features. The results of this experiment are shown in Table 2. System 1 does not include irrelevant features, whereas systems 2-6 were constructed by adding a slightly different number of features in each step, but resulted in the same number of selected features.</Paragraph>
      <Paragraph position="5"> Thus competing features bearing the same information are exchanged in different models, yet overall the same information is conveyed by slightly different feature sets. The results of Table 2 show that the bootstrap method yields p-values &lt; .015 in 3 out of 5 pairwise comparisons whereas the approximate randomization test yields p-values [?] .025 in all cases. Even if the true p-value is unknown, we can say that the approximate randomization test estimates p-values more conservatively than the bootstrap, thus increasing the likelihood of type-I error for the bootstrap test. For a restrictive significance level of 0.15, which is motivated below for multiple  comparisons, the bootstrap would assess statistical significance in 3 out of 5 cases whereas statistical significance would not be assessed under approximate randomization. Assuming equivalence of the compared system variants, these assessments would count as type-I errors.</Paragraph>
    </Section>
    <Section position="3" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
4.4 The Multiplicity Problem
</SectionTitle>
      <Paragraph position="0"> In the experiment on type-I error described above, a more stringent rejection level than the usual .05 was assumed. This was necessary to circumvent a common pitfall in significance testing for k-fold pairwise comparisons. Following the argumentation given in Cohen (1995), the probability of randomly assessing statistical significance for result differences in k-fold pairwise comparisons grows exponentially in k. Recall that for a pairwise comparison of systems, specifying that p &lt; .05 means that the probability of incorrectly rejecting the null hypothesis that the systems are not different be less than .05. Caution has to be exercised in k-fold pairwise comparisons: For a probability pc of incorrectly rejecting the null hypothesis in a specific pairwise comparison, the probability pe of at least once incorrectly rejecting this null hypothesis in an experiment involving k pair-wise comparisons is pe [?] 1[?](1[?]pc)k For large values of k, the probability of concluding result differences incorrectly at least once is undesirably high. For example, in benchmark testing of 15 systems, 15(15 [?] 1)/2 = 105 pairwise comparisons will have to be conducted. At a per-comparison rejection level pc = .05 this results in an experimentwise error pe = .9954, i.e., the probability of at least one spurious assessment of significance is 1[?](1[?].05)105 = .9954. One possibility to reduce the likelihood that one ore more of differences assessed in pairwise comparisons is spurious is to run the comparisons at a more stringent per-comparison rejection level. Reducing the per-comparison rejection level pc until an experimentwise error rate pe of a standard value, e.g., .05, is achieved, will favor pe over pc. In the example of 5 pairwise comparisons described above, a per-comparison error rate pc = .015 was sufficient to achieve an experimentwise error rate pe [?] .07. In many cases this technique would require to reduce pc to the point where a result difference has to be unrealistically large to be significant. Here conventional tests for post-hoc comparisons such as the Scheff'e or Tukey test have to be employed (see Cohen (1995), p. 185ff.).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML