XML Viewer - c04-1136

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1136_intro.xml
Size: 24,503 bytes
Last Modified: 2025-10-06 14:02:09
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1136">
  <Title>Significance tests for the evaluation of ranking methods</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Many tools in the area of natural-language processing involve the application of ranking methods to sets of candidates, in order to select the most useful items from an all too often overwhelming list.</Paragraph>
    <Paragraph position="1"> Examples of such tools range from syntactic parsers (where alternative analyses are ranked by their plausibility) to the extraction of collocations from text corpora (where a ranking according to the scores assigned by a lexical association measure is the essential component of an extraction &amp;quot;pipeline&amp;quot;). To this end, a scoring function g is applied to the candidate set, which assigns a real number g(x) [?] R to every candidate x.1 Conventionally, higher scores are assigned to candidates that the scoring function considers more &amp;quot;useful&amp;quot;. Candidates can then be selected in one of two ways: (i) by comparison with a pre-defined threshold g [?] R (i.e. x is accepted iff g(x) [?] g), resulting in a g-acceptance set; (ii) by ranking the entire candidate set according to the scores g(x) and selecting the n highest-scoring candidates, resulting in an n-best list (where n is either determined by practical constraints or interactively by manual inspection). Note that an n-best list can also be interpreted as a g-acceptance set with a suitably chosen cutoff threshold gg(n) (determined from the scores of all candidates).</Paragraph>
    <Paragraph position="2"> Ranking methods usually involve various heuristics and statistical guesses, so that an empirical eval1Some systems may directly produce a sorted candidate list without assigning explicit scores. However, unless this operation is (implicitly) based on an underlying scoring function, the result will in most cases be a partial ordering (where some pairs of candidates are incomparable) or lead to inconsistencies.</Paragraph>
    <Paragraph position="3"> uation of their performance is necessary. Even when there is a solid theoretical foundation, its predictions may not be borne out in practice. Often, the main goal of an evaluation experiment is the comparison of different ranking methods (i.e. scoring functions) in order to determine the most useful one.</Paragraph>
    <Paragraph position="4"> A widely-used evaluation strategy classifies the candidates accepted by a ranking method into &amp;quot;good&amp;quot; ones (true positives, TP) and &amp;quot;bad&amp;quot; ones (false positives, FP). This is sometimes achieved by comparison of the relevant g-acceptance sets or n-best lists with a gold standard, but for certain applications (such as collocation extraction), manual inspection of the candidates leads to more clear-cut and meaningful results. When TPs and FPs have been identified, the precision P of a g-acceptance set or an n-best list can be computed as the proportion of TPs among the accepted candidates. The most useful ranking method is the one that achieves the highest precision, usually comparing n-best lists of a given size n. If the full candidate set has been annotated, it is also possible to determine the recall R as the number of accepted TPs divided by the total number of TPs in the candidate set. While the evaluation of extraction tools (e.g. in information retrieval) usually requires that both precision and recall are high, ranking methods often put greater weight on high precision, possibly at the price of missing a considerable number of TPs. Moreover, when n-best lists of the same size are compared, precision and recall are fully equivalent.2 For these reasons, I will concentrate on the precision P here.</Paragraph>
    <Paragraph position="5"> As an example, consider the identification of collocations from text corpora. Following the methodology described by Evert and Krenn (2001), German PP-verb combinations were extracted from a chunk-parsed version of the Frankfurter Rundschau Corpus.3 A cooccurrence frequency threshold of 2Namely, P = nTP * R/n, where nTP stands for the total number of TPs in the candidate set.</Paragraph>
    <Paragraph position="6">  per corpus, comprising ca. 40 million words of text. It is part of the ECI Multilingual Corpus 1 distributed by ELSNET. For this f [?] 30 was applied, resulting in a candidate set of 5102 PP-verb pairs. The candidates were then ranked according to the scores assigned by four association measures: the log-likelihood ratio G2 (Dunning, 1993), Pearson's chi-squared statistic X2 (Manning and Sch&amp;quot;utze, 1999, 169-172), the t-score statistic t (Church et al., 1991), and mere cooccurrence frequency f.4 TPs were identified according to the definition of Krenn (2000). The graphs in Figure 1 show the precision achieved by these measures, for n ranging from 100 to 2000 (lists with n &lt; 100 were omitted because the graphs become highly unstable for small n). The baseline precision of 11.09% corresponds to a random selection of n candidates.</Paragraph>
    <Paragraph position="7">  man PP-verb collocations are ranked by four different association measures.</Paragraph>
    <Paragraph position="8"> From Figure 1, we can see that G2 and t are the most useful ranking methods, t being marginally better for n [?] 800 and G2 for n [?] 1500. Both measures are by far superior to frequency-based ranking. The evaluation results also confirm the argument of Dunning (1993), who suggested G2 as a more robust alternative to X2. Such results cannot be taken at face value, though, as they may simply be due to chance. When two equally useful ranking methods are compared, method A might just happen to perform better in a particular experiment, with B taking the lead in a repetition of the experiexperiment, the corpus was annotated with the partial parser YAC (Kermes, 2003).</Paragraph>
    <Paragraph position="9"> 4See Evert (2004) for detailed information about these association measures, as well as many further alternatives. ment under similar conditions. The causes of such random variation include the source material from which the candidates are extracted (what if a slightly different source had been used?), noise introduced by automatic pre-processing and extraction tools, and the uncertainty of human annotators manifested in varying degrees of inter-annotator agreement.</Paragraph>
    <Paragraph position="10"> Most researchers understand the necessity of testing whether their results are statistically significant, but it is fairly unclear which tests are appropriate.</Paragraph>
    <Paragraph position="11"> For instance, Krenn (2000) applies the standard kh2test to her comparative evaluation of collocation extraction methods. She is aware, though, that this test assumes independent samples and is hardly suitable for different ranking methods applied to the same candidate set: Krenn and Evert (2001) suggest several alternative tests for related samples. A wide range of exact and asymptotic tests as well as computationally intensive randomisation tests (Yeh, 2000) are available and add to the confusion about an appropriate choice.</Paragraph>
    <Paragraph position="12"> The aim of this paper is to formulate a statistical model that interprets the evaluation of ranking methods as a random experiment. This model defines the degree to which evaluation results are affected by random variation, allowing us to derive appropriate significance tests. After formalising the evaluation procedure in Section 2, I recast the procedure as a random experiment and make the underlying assumptions explicit (Section 3.1). On the basis of this model, I develop significance tests for the precision of a single ranking method (Section 3.2) and for the comparison of two ranking methods (Section 3.3). The paper concludes with an empiri- null cal validation of the statistical model in Section 4.</Paragraph>
    <Paragraph position="13"> 2 A formal account of ranking methods and their evaluation  In this section I present a formalisation of rankings and their evaluation, giving g-acceptance sets a geometrical interpretation that is essential for the formulation of a statistical model in Section 3.</Paragraph>
    <Paragraph position="14"> The scores computed by a ranking method are based on certain features of the candidates. Each candidate can therefore be represented by its feature vector x [?] Ohm, where Ohm is an abstract feature space. For all practical purposes, Ohm can be equated with a subset of the (possibly high-dimensional) real Euclidean space Rm. The complete set of candidates corresponds to a discrete subset C [?] Ohm of the feature space.5 A ranking method is represented by 5More precisely, C is a multi-set because there may be multiple candidates with identical feature vectors. In order to simplify notation I assume that C is a proper subset of Ohm, which a real-valued function g : Ohm - R on the feature space, called a scoring function (SF). In the following, I assume that there are no candidates with equal scores, and hence no ties in the rankings.6 The g-acceptance set for a SF g contains all candidates x [?] C with g(x) [?] g. In a geometrical interpretation, this condition is equivalent to</Paragraph>
    <Paragraph position="16"> is called the g-acceptance region of g. The g-acceptance set of g is then given by the intersection Ag(g)[?]C =: Cg(g). The selection of an n-best list is based on the g-acceptance region Ag(gg(n)) for a suitably chosen n-best threshold gg(n).7 As an example, consider the collocation extraction task introduced in Section 1. The feature vector x associated with a collocation candidate represents the cooccurrence frequency information for this candidate: x = (O11,O12,O21,O22), where Oij are the cell counts of a 2 x 2 contingency table (Evert, 2004). Therefore, we have a four-dimensional feature space Ohm [?] R4, and each association measure defines a SF g : Ohm - R. The selection of collocation candidates is usually made in the form of an n-best list, but may also be based on a pre-defined threshold g.8 For an evaluation in terms of precision and recall, the candidates in the set C are classified into true positives C+ and false positives C[?]. The precision corresponding to an acceptance region A is then given by</Paragraph>
    <Paragraph position="18"> i.e. the proportion of TPs among the accepted candidates. The precision achieved by a SF g with threshold g is PCg(g). Note that the numerator in Eq. (1) reduces to n for an n-best list (i.e. g = gg(n)), yielding the n-best precision Pg,n. Figure 1 shows graphs of Pg,n for 100 [?] n [?] 2000, for the SFs</Paragraph>
    <Paragraph position="20"> can be enforced by adding a small amount of random jitter to the feature vectors of candidates.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Evaluation as a random experiment
</SectionTitle>
      <Paragraph position="0"> When an evaluation experiment is repeated, the results will not be exactly the same. There are many causes for such variation, including different source material used by the second experiment, changes in the tool settings, changes in the evaluation criteria, or the different intuitions of human annotators. Statistical significance tests are designed to account for a small fraction of this variation that is due to random effects, assuming that all parameters that may have a systematic influence on the evaluation results are kept constant. Thus, they provide a lower limit for the variation that has to be expected in an actual repetition of the experiment. Only when results are significant can we expect them to be reproducible, but even then a second experiment may draw a different picture.</Paragraph>
      <Paragraph position="1"> In particular, the influence of qualitatively different source material or different evaluation criteria can never be predicted by statistical means alone.</Paragraph>
      <Paragraph position="2"> In the example of the collocation extraction task, randomness is mainly introduced by the selection of a source corpus, e.g. the choice of one particular newspaper rather than another. Disagreement between human annotators and uncertainty about the interpretation of annotation guidelines may also lead to an element of randomness in the evaluation.</Paragraph>
      <Paragraph position="3"> However, even significant results cannot be generalised to a different type of collocation (such as adjective-noun instead of PP-verb), different evaluation criteria, a different domain or text type, or even a source corpus of different size, as the results of Krenn and Evert (2001) show.</Paragraph>
      <Paragraph position="4"> A first step in the search for an appropriate significance test is to formulate a (plausible) model for random variation in the evaluation results. Because of the inherent randomness, every repetition of an evaluation experiment under similar conditions will lead to different candidate sets C+ and C[?]. Some elements will be entirely new candidates, sometimes the same candidate appears with a different feature vector (and thus represented by a different point x [?] Ohm), and sometimes a candidate that was annotated as a TP in one experiment may be annotated as a FP in the next. In order to encapsulate all three kinds of variation, let us assume that C+ and C[?] are randomly selected from a large set of hypothetical possibilities (where each candidate corresponds to many different possibilities with different feature vectors, some of which may be TPs and some FPs).</Paragraph>
      <Paragraph position="5"> For any acceptance region A, both the number of TPs in A, TA := |C+ [?]A|, and the number of FPs in A, FA := |C[?] [?]A|, are thus random variables.</Paragraph>
      <Paragraph position="6"> We do not know their precise distributions, but it is reasonable to assume that (i) TA and FA are always independent and (ii) TA and TB (as well as FA and FB) are independent for any two disjoint regions A and B. Note that TA and TB cannot be independent for A[?]B negationslash= [?] because they include the same number of TPs from the region A [?] B. The total number of candidates in the region A is also a random variable NA := TA+FA, and the same follows for the precision PA, which can now be written as</Paragraph>
      <Paragraph position="8"> Following the standard approach, we may now assume that PA approximately follows a normal distribution with mean piA and variance s2A, i.e.</Paragraph>
      <Paragraph position="9"> PA [?] N(piA,s2A). The mean piA can be interpreted as the average precision of the acceptance region A (obtained by averaging over many repetitions of the evaluation experiment). However, there are two problems with this assumption. First, while PA is an unbiased estimator for pia, the variance s2A cannot be estimated from a single experiment.10 Second, PA is a discrete variable because both TA and NA are non-negative integers. When the number of candidates NA is small (as in Section 3.3), approximating the distribution of PA by a continuous normal distribution will not be valid.</Paragraph>
      <Paragraph position="10"> It is reasonable to assume that the distribution of NA does not depend on the average precision piA. In this case, NA is called an ancillary statistic and can be eliminated without loss of information by conditioning on its observed value (see Lehmann (1991, 542ff) for a formal definition of ancillary statistics and the merits of conditional inference). Instead of probabilities P(PA) we will now consider the conditional probabilities P(PA|NA). Because NA is fixed to the observed value, PA is proportional to TA and the conditional probabilities are equivalent to P(TA|NA). When we choose one of the NA candidates at random, the probability that it is a TP (averaged over many repetitions of the experiment) 9In the definition of the n-best precision Pg,n, i.e. for A = Cg(gg(n)), the number of candidates in A is constant: NA = n. At first sight, this may seem to be inconsistent with the interpretation of NA as a random variable. However, one has to keep in mind that gg(n), which is determined from the candidate set C, is itself a random variable. Consequently, A is not a fixed acceptance region and its variation counter-balances that of NA.</Paragraph>
      <Paragraph position="11"> 10Sometimes, cross-validation is used to estimate the variability of evaluation results. While this method is appropriate e.g. for machine learning and classification tasks, it is not useful for the evaluation of ranking methods. Since the cross-validation would have to be based on random samples from a single candidate set, it would not be able to tell us anything about random variation between different candidate sets.</Paragraph>
      <Paragraph position="12"> should be equal to the average precision piA. Consequently, P(TA|NA) should follow a binomial distribution with success probability piA, i.e.</Paragraph>
      <Paragraph position="14"> for k = 0,...,NA. We can now make inferences about the average precision piA based on this binomial distribution.11 As a second step in our search for an appropriate significance test, it is essential to understand exactly what question this test should address: What does it mean for an evaluation result (or result difference) to be significant? In fact, two different questions can be asked: A: If we repeat an evaluation experiment under the same conditions, to what extent will the observed precision values vary? This question is addressed in Section 3.2.</Paragraph>
      <Paragraph position="15"> B: If we repeat an evaluation experiment under the same conditions, will method A again perform better than method B? This question is addressed in Section 3.3.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 The stability of evaluation results
</SectionTitle>
      <Paragraph position="0"> Question A can be rephrased in the following way: How much does the observed precision value for an acceptance region A differ from the true average precision piA? In other words, our goal here is to make inferences about piA, for a given SF g and threshold g. From Eq. (2), we obtain a binomial confidence interval for the true value piA, given the observed values of TA and NA (Lehmann, 1991, 89ff). Using the customary 95% confidence level, piA should be contained in the estimated interval in all but one out of twenty repetitions of the experiment. Binomial confidence intervals can easily be computed with standard software packages such as R (R Development Core Team, 2003). As an example, assume that an observed precision of PA = 40% is based on TA = 200 TPs out of NA = 500 accepted candidates. Precision graphs as those in Figure 1 display PA as a maximum-likelihood estimate for piA, but its true value may range from 35.7% to 44.4% (with 95% confidence).12 11Note that some of the assumptions leading to Eq. (2) are far from self-evident. As an example, (2) tacitly assumes that the success probability is equal to piA regardless of the particular value of NA on which the distribution is conditioned, which need not be the case. Therefore, an empirical validation is necessary (see Section 4).</Paragraph>
      <Paragraph position="1"> 12This confidence interval was computed with the R command binom.test(200,500).</Paragraph>
      <Paragraph position="2"> Figure 2 shows binomial confidence intervals for the association measures G2 and X2 as shaded regions around the precision graphs. It is obvious that a repetition of the evaluation experiment may lead to quite different precision values, especially for n &lt; 1000. In other words, there is a considerable amount of uncertainty in the evaluation results for each individual measure. However, we can be confident that both ranking methods offer a substantial improvement over the baseline.</Paragraph>
      <Paragraph position="3">  For an evaluation based on n-best lists (as in the collocation extraction example), it has to be noted that the confidence intervals are estimates for the average precision piA of a fixed g-acceptance region (with g = gg(n) computed from the observed candidate set). While this region contains exactly NA = n candidates in the current evaluation, NA may be different from n when the experiment is repeated. Consequently, piA is not necessarily identical to the average precision of n-best lists.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 The comparison of ranking methods
</SectionTitle>
      <Paragraph position="0"> Question B can be rephrased in the following way: Does the SF g1 on average achieve higher precision than the SF g2? (This question is normally asked when g1 performed better than g2 in the evaluation.) In other words, our goal is to test whether piA &gt; piB for given acceptance regions A of g1 and B of g2.</Paragraph>
      <Paragraph position="1"> The confidence intervals obtained for two SF g1 and g2 will often overlap (cf. Figure 2, where the confidence intervals of G2 and X2 overlap for all list sizes n), suggesting that there is no significant difference between the two ranking methods. Both observed precision values are consistent with an average precision piA = piB in the region of overlap, so that the observed differences may be due to random variation in opposite directions. However, this conclusion is premature because the two rankings are not independent. Therefore, the observed precision values of g1 and g2 will tend to vary in the same direction, the degree of correlation being determined by the amount of overlap between the two rankings. Given acceptance regions A := Ag1(g1) and B := Ag2(g2), both SF make the same decision for any candidates in the intersection A [?] B (both SF accept) and in the &amp;quot;complement&amp;quot; Ohm \ (A [?] B) (both SF reject). Therefore, the performance of g1 and g2 can only differ in the regions D1 := A \ B (g1 accepts, but g2 rejects) and B \ A (vice versa).</Paragraph>
      <Paragraph position="2"> Correspondingly, the counts TA and TB are correlated because they include the same number of TPs from the region A[?]B (namely, the set C+[?]A[?]B), Indisputably, g1 is a better ranking method than g2 iff piD1 &gt; piD2 and vice versa.13 Our goal is thus to test the null hypothesis H0 : piD1 = piD2 on the basis of the binomial distributions P(TD1 |ND1) and P(TD2 |ND2). I assume that these distributions are independent because D1 [?] D2 = [?] (cf. Section 3.1). The number of candidates in the difference regions, ND1 and ND2, may be small, especially for acceptance regions with large overlap (this was one of the reasons for using conditional inference rather than a normal approximation in Section 3.1). Therefore, it is advisable to use Fisher's exact test (Agresti, 1990, 60-66) instead of an asymptotic test that relies on large-sample approximations. The data for Fisher's test consist of a 2x2 contingency table with columns (TD1,FD1) and (TD2,FD2). Note that a two-sided test is called for because there is no a priori reason to assume that g1 is better than g2 (or vice versa). Although the implementation of a two-sided Fisher's test is not trivial, it is available in software packages such as R.</Paragraph>
      <Paragraph position="3"> Figure 3 shows the same precision graphs as Figure 2. Significant differences between the G2 and X2 measures according to Fisher's test (at a 95% confidence level) are marked by grey triangles.</Paragraph>
      <Paragraph position="4"> 13Note that piD 1 &gt; piD2 does not necessarily entail piA &gt; piB if NA and NB are vastly different and piA[?]B greatermuch piDi. In this case, the winner will always be the SF that accepts the smaller number of candidates (because the additional candidates only serve to lower the precision achieved in A [?] B). This example shows that it is &amp;quot;unfair&amp;quot; to compare acceptance sets of (substantially) different sizes just in terms of their over-all precision. Evaluation should therefore either be based on n-best lists or needs to take recall into account.</Paragraph>
      <Paragraph position="5"> Contrary to what the confidence intervals in Figure 2 suggested, the observed differences turn out to be significant for all n-best lists up to n = 1250 (marked by a thin vertical line).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML