File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/j93-1003_metho.xml
Size: 19,710 bytes
Last Modified: 2025-10-06 14:13:24
<?xml version="1.0" standalone="yes"?> <Paper uid="J93-1003"> <Title>Accurate Methods for the Statistics of Surprise and Coincidence</Title> <Section position="3" start_page="0" end_page="62" type="metho"> <SectionTitle> 2. The Assumption of Normality * </SectionTitle> <Paragraph position="0"> The assumption that simple functions of the random variables being sampled are distributed normally or approximately normally underlies many common statistical tests. This particularly includes Pearson's X 2 test and z-score tests. This assumption is absolutely valid in many cases. Due to the simplification of the methods involved, it is entirely justifiable even in marginal cases.</Paragraph> <Paragraph position="1"> When comparing the rates of occurrence of rare events, the assumptions on which these tests are based break down because texts are composed largely of such rare events. For example, simple word counts made on a moderate-sized corpus show that words that have a frequency of less than one in 50,000 words make up about 20-30% of typical English language news-wire reports. This 'rare' quarter of English includes many of the content-bearing words and nearly all the technical jargon. As an illustration, the following is a random selection of approximately 0.2% of the words found at least once but fewer than five times in a sample of a half million words of Reuters' reports.</Paragraph> <Paragraph position="2"> identities query villas inappropriate redoubtable watchful inflamed remark winter instilling resignations intruded ruin unction scant The only word in this list that is in the least obscure is poi (a native Hawaiian dish made from taro root). If we were to sample 50,000 words instead of the half million used to create the list above, then the expected number of occurrences of any of the words in this list would be less than one hall well below the point where commonly used tests should be used.</Paragraph> <Paragraph position="3"> If such ordinary words are 'rare,' any statistical work with texts must deal with the reality of rare events. It is interesting that while most of the words in running text are common ones, most of the words in the total vocabulary are rare. Unfortunately, the foundational assumption of most common statistical analyses used in computational linguistics is that the events being analyzed are relatively common. For a sample of 50,000 words from the Reuters' corpus mentioned previously, none of the words in the table above is common enough to expect such analyses to work well.</Paragraph> </Section> <Section position="4" start_page="62" end_page="64" type="metho"> <SectionTitle> 3. The Tradition of Chi-Squared Tests * </SectionTitle> <Paragraph position="0"> In text analysis, the statistically based measures that have been used have usually been based on test statistics that are useful because, given certain assumptions, they have a known distribution. This distribution is most commonly either the normal or X 2 distribution. These measures are very useful and can be used to accurately assess significance in a number of different settings. They are based, however, on several assumptions that do not hold for most textual analyses.</Paragraph> <Paragraph position="1"> The details of how and why the assumptions behind these measures do not hold is of interest primarily to the statistician, but the result is of interest to the statistical consumer (in our case, somebody interested in counting words). More applicable techniques are important in textual analysis. The next section describes one such technique; implementation of this technique is described in later sections.</Paragraph> <Paragraph position="2"> Binomial distributions arise commonly in statistical analysis when the data to be analyzed are derived by counting the number of positive outcomes of repeated identical and independent experiments. Flipping a coin is the prototypical experiment of this sort.</Paragraph> <Paragraph position="3"> The task of counting words can be cast into the form of a repeated sequence of such binary trials comparing each word in a text with the word being counted. These comparisons can be viewed as a sequence of binary experiments similar to coin flipping. In text, each comparison is clearly not independent of all others, but the dependency falls off rapidly with distance. Another assumption that works relatively well in practice is that the probability of seeing a particular word does not vary. Of course, this is not really true, since changes in topic may cause this frequency to vary. Indeed it is the mild failure of this assumption that makes shallow information retrieval techniques possible.</Paragraph> <Paragraph position="4"> To the extent that these assumptions of independence and stationarity are valid, we can switch to an abstract discourse concerning Bernoulli trials instead of words in text, and a number of standard results can be used. A Bernoulli trial is the statistical idealization of a coin flip in which there is a fixed probability of a successful outcome that does not vary from flip to flip.</Paragraph> <Paragraph position="5"> In particular, if the actual probability that the next word matches a prototype is p, then the number of matches generated in the next n words is a random variable (K) with binomial distribution</Paragraph> <Paragraph position="7"> whose mean is np and whose variance is np(1 -p). If np(1-p) > 5, then the distribution of this variable will be approximately normal, and as np(1 - p) increases beyond that point, the distribution becomes more and more like a normal distribution. This can be seen in Figure 1 above, where the binomial distribution (dashed lines) is plotted along with the approximating normal distributions (solid lines) for np set to 5, 10, and 20, np= 1 0.63 0.5 with n fixed at 100. Larger values of n with np held constant give curves that are not visibly different from those shown. For these cases, np ~ np(1 - p).</Paragraph> <Paragraph position="8"> This agreement between the binomial and normal distributions is exactly what makes test statistics based on assumptions of normality so useful in the analysis of experiments based on counting. In the case of the binomial distribution, normality assumptions are generally considered to hold well enough when np(1 - p) > 5.</Paragraph> <Paragraph position="9"> The situation is different when np(1 -p) is less than 5, and is dramatically different when np(1 -p) is less than 1. First, it makes much less sense to approximate a discrete distribution such as the binomial with a continuous distribution such as the normal. Second, the probabilities computed using the normal approximation are less and less accurate.</Paragraph> <Paragraph position="10"> Table 1 shows the probability that one or more matches are found in 100 words of text as computed using the binomial and normal distributions for np = 0.001, np = 0.01, np = 0.1, and np = 1 where n = 100. Most words are sufficiently rare so that even for samples of text where n is as large as several thousand, np will be at the bottom of this range. Short phrases are so numerous that np << 1 for almost all phrases even when n is as large as several million.</Paragraph> <Paragraph position="11"> Table 1 shows that for rare events, the normal distribution does not even approximate the binomial distribution. In fact, for np -- 0.1 and n = 100, using the normal distribution overestimates the significance of one or more occurrences by a factor of 40, while for np = 0.01, using the normal distribution overestimates the significance by about 4 x 1020. When n increases beyond 100, the numbers in the table do not change significantly.</Paragraph> <Paragraph position="12"> If this overestimation were constant, then the estimates using normal distributions could be corrected and would still be useful, but the fact that the errors are not constant means that methods dependent on the normal approximation should not be used to analyze Bernoulli trials where the probability of positive outcome is very small. Yet, in many real analyses of text, comparing cases where np -- 0.001 with cases where np > 1 is a common problem.</Paragraph> </Section> <Section position="5" start_page="64" end_page="70" type="metho"> <SectionTitle> 5. Likelihood Ratio Tests * </SectionTitle> <Paragraph position="0"> There is another class of tests that do not depend so critically on assumptions of normality. Instead they use the asymptotic distribution of the generalized likelihood ratio. For text analysis and similar problems, the use of likelihood ratios leads to very much improved statistical results. The practical effect of this improvement is that statistical textual analysis can be done effectively with very much smaller volumes of text than is necessary for conventional tests based on assumed normal distributions, Computational Linguistics Volume 19, Number 1 and it allows comparisons to be made between the significance of the occurrences of both rare and common phenomenon.</Paragraph> <Section position="1" start_page="65" end_page="65" type="sub_section"> <SectionTitle> 5.1 Parameter Spaces and Likelihood Functions </SectionTitle> <Paragraph position="0"> Likelihood ratio tests are based on the idea that statistical hypotheses can be said to specify subspaces of the space described by the unknown parameters of the statistical model being used. These tests assume that the model is known, but that the parameters of the model are unknown. Such a test is called parametric. Other tests are available that make no assumptions about the underlying model at all; they are called distribution-free. Only one particular parametric test is described here. More information on parametric and distribution-free tests is available in Bradley (1968) and Mood, Graybill, and Boes (1974).</Paragraph> <Paragraph position="1"> The probability that a given experimental outcome described by kl,..., kn will be observed for a given model described by a number of parameters Pl, p2,.., is called the likelihood function for the model and is written as H(pl,p2,...;kl,...,km) where all arguments of H left of the semicolon are model parameters, and all arguments right of the semicolon are observed values. In the continuous case, the probability is replaced by a probability density. With binomial and multinomials, we only deal with the discrete case.</Paragraph> <Paragraph position="2"> For repeated Bernoulli trials, m = 2 because we observe both the number of trials and the number of positive outcomes and there is only one p. The explicit form for the likelihood function is</Paragraph> <Paragraph position="4"> The parameter space is the set of all values for p and the hypothesis that p = p0 is a single point. For notational brevity the model parameters can be collected into a single parameter, as can the observed values. Then the likelihood function is written as H(~;k) where w is considered to be a point in the parameter space f~, and k a point in the space of observations K. Particular hypotheses or observations are represented by subscripting f~ or K respectively.</Paragraph> <Paragraph position="5"> More information about likelihood ratio tests can be found in texts on theoretical statistics (Mood et al. 1974).</Paragraph> </Section> <Section position="2" start_page="65" end_page="66" type="sub_section"> <SectionTitle> 5.2 The Likelihood Ratio </SectionTitle> <Paragraph position="0"> The likelihood ratio for a hypothesis is the ratio of the maximum value of the likelihood function over the subspace represented by the hypothesis to the maximum value of the likelihood function over the entire parameter space. That is, A = max~f~deg H(a;; k) max~en H(a;; k) where f~ is the entire parameter space and f~0 is the particular hypothesis being tested. The particularly important feature of likelihood ratios is that the quantity -2 log )~ is asymptotically X 2 distributed with degrees of freedom equal to the difference in dimension between f~ and f~0. Importantly, this asymptote is approached very quickly in the case of binomial and multinomial distributions.</Paragraph> </Section> <Section position="3" start_page="66" end_page="68" type="sub_section"> <SectionTitle> Ted Dunning Accurate Methods for the Statistics 5.3 Likelihood Ratio for Binomial and Multinomial Distributions </SectionTitle> <Paragraph position="0"> The comparison of two binomial or multinomial processes can be done rather easily using likelihood ratios. In the case of two binomial distributions, H(pl,p2;kl,rll, ka, na)~-plkl(1--pl)nl-kl ( nl p2 k2 (1 - p2) n2-k2 (n2).k2 The hypothesis that the two distributions have the same underlying parameter is represented by the set {(pl, p2) I Pl = p2}.</Paragraph> <Paragraph position="1"> The likelihood ratio for this test is = maxpH(p, p; kl, nl, k2, n2) maxpl ,p2 H(pl, P2; kl,//1, k2,//2)&quot; These maxima are achieved with Pl = ~ and P2 = ~ for the denominator, and for the numerator. This reduces the ratio to P = ~1+~2 maxp L(p, kl , nl )L(p, k2, n2) maxp, ,p2 L(pl , kl , nl )L(p2, k2~ /'/2 ) where L(p,k, n) = pk(1 -- p)n-k.</Paragraph> <Paragraph position="2"> Taking the logarithm of the likelihood ratio gives -2 log ,~ = 2 \[log L(pl, kl, nl) + log L(p2, k2, n2) - log L(p, kl, nl) - log L(p, k2, n2)\] * For the multinomial case, it is convenient to use the double subscripts and the abbreviations null so that we can write The likelihood ratio is</Paragraph> <Paragraph position="4"> Computational Linguistics Volume 19, Number 1 This expression implicitly involves n because ~j kj = n. Maximizing and taking the logarithm,</Paragraph> <Paragraph position="6"> If the null hypothesis holds, then the log-likelihood ratio is asymptotically X 2 distributed with k/2 - 1 degrees of freedom. When j is 2 (the binomial), -2 log )~ will be X 2 distributed with one degree of freedom.</Paragraph> <Paragraph position="7"> If we had initially approximated the binomial distribution with a normal distribution with mean np and variance np(1 - p), then we would have arrived at another form that is a good approximation of -2 log ~ when np(1 - p) is more than roughly 5.</Paragraph> <Paragraph position="8"> This form is</Paragraph> <Paragraph position="10"> as in the multinomial case above and</Paragraph> <Paragraph position="12"> Interestingly, this expression is exactly the test statistic for Pearson's X 2 test, although the form shown is not quite the customary one. Figure 2 shows the reasonably good agreement between this expression and the exact binomial log-likelihood ratio derived earlier where p -- 0.1 and nl -- n2 -- 1000 for various values of kl and k2.</Paragraph> <Paragraph position="13"> Figure 3, on the other hand, shows the divergence between Pearson's statistic and the log-likelihood ratio when p = 0.01, nl = 100, and n2 -- 10000. Note the large change of scale on the vertical axis. The pronounced disparity occurs when k I is larger than the value expected based on the observed value of k2. The case where nl < n2 and ~ > ~ is exactly the case of most interest in many text analyses.</Paragraph> <Paragraph position="14"> T~e convergence of the log of the likelihood ratio to the asymptotic distribution is demonstrated dramatically in Figure 4. In this figure, the straighter line was computed using a symbolic algebra package and represents the idealized one degree of freedom cumulative X 2 distribution. The rougher curve was computed by a numerical experiment in which p -- 0.01, nl = 100, and n2 = 10000, which corresponds to the situation in Figure 3. The close agreement shows that the likelihood ratio measure produces accurate results over six decades of significance even in the range where the normal X 2 measure diverges radically from the ideal.</Paragraph> </Section> <Section position="4" start_page="68" end_page="70" type="sub_section"> <SectionTitle> 6.1 Bigram Analysis of a Small Text </SectionTitle> <Paragraph position="0"> To test the efficacy of the likelihood methods, an analysis was made of a 30,000-word sample of text obtained from the Union Bank of Switzerland, with the intention of finding pairs of words that occurred next to each other with a significantly higher frequency than would be expected, based on the word frequencies alone. The text was 31,777 words of financial text largely describing market conditions for 1986 and 1987. The results of such a bigram analysis should highlight collocations common in English as well as collocations peculiar to the financial nature of the analyzed text. As will be seen, the ranking based on likelihood ratio tests does exactly this. Similar comparisons made between a large corpus of general text and a domain-specific text can be used to produce lists consisting only of words and bigrams characteristic of the domain-specific texts.</Paragraph> <Paragraph position="1"> This comparison was done by creating a contingency table that contained the following counts of each bigram that appeared in the text:</Paragraph> <Paragraph position="3"> where the ~ A B represents the bigram in which the first word is not word A and the second is word/3.</Paragraph> <Paragraph position="4"> If the words A and B occur independently, then we would expect p(AB) = p(A)p(B) where p(AB) is the probability of A and B occurring in sequence, p(A) is the probability of A appearing in the first position, and p(B) is the probability of B appearing in the second position. We can cast this into the mold of our earlier binomial analysis by phrasing the null hypothesis that A and B are independent as p(A I B) = p(A \[,~ B) = p(A). This means that testing for the independence of A and B can be done by testing to see if the distribution of A given that B is present (the first row of the table) is the same as the distribution of A given that B is not present (the second row of the table). In fact, of course, we are not really doing a statistical test to see if A and B are Ted Dunning Accurate Methods for the Statistics independent; we know that they are generally not independent in text. Instead we just want to use the test statistic as a measure that will help highlight particular As and Bs that are highly associated in text.</Paragraph> <Paragraph position="5"> These counts were analyzed using the test for binomials described earlier, and the 50 most significant are tabulated in Table 2. This table contains the most significant 200 bigrams and is reverse sorted by the first column, which contains the quantity -2 log &. Other columns contain the four counts from the contingency table described above, and the bigram itself.</Paragraph> <Paragraph position="6"> Examination of the table shows that there is good correlation with intuitive feelings about how natural the bigrams in the table actually are. This is in distinct contrast with Table 3, which contains the same data except that the first column is computed using Pearson's ~2 test statistic. The overestimate of the significance of items that occur only a few times is dramatic. In fact, the entire first portion of the table is dominated by bigrams rare enough to occur only once in the current sample of text. The misspelling in the bigram 'sees posibilities' is in the original text.</Paragraph> <Paragraph position="7"> Out of 2693 bigrams analyzed, 2682 of them fall outside the scope of applicability of the normal X 2 test. The 11 bigrams that were suitable for analysis with the X 2 test are listed in Table 4. It is notable that all of these bigrams contain the word the, which is the most common word in English.</Paragraph> </Section> </Section> class="xml-element"></Paper>