File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2151_metho.xml
Size: 17,699 bytes
Last Modified: 2025-10-06 14:14:12
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2151"> <Title>Handling Sparse Data by Successive Abstraction</Title> <Section position="4" start_page="0" end_page="896" type="metho"> <SectionTitle> 2 Linear Successive Abstraction </SectionTitle> <Paragraph position="0"> Assume that we want to estimate the conditional probability P(x I C) of tile outcome x given a context C from the number of times N~ it occurs in N = ICI trials, but that this data is sparse.</Paragraph> <Paragraph position="1"> Assume further that there is abundant data in a more general context C t D C that we want to use to get a better estimate of P(x I C). The idea is to let the probability estimate/5(x I C) in context C be a flmction g of the relative frequency f(x I C) of the outcome x in context C and the probability estimate P(x \[C') ill context C':</Paragraph> <Paragraph position="3"> Let us generalize this scenario slightly to the situation were wc have a sequence of increasingly more general contexts Cm C Urn-1 C ... C C1, i.e., where there is a linear order of the various contexts Ck. We can then build the estimate of P(x I Ck) on the relative frequency f(x I Ck) in context Ck and the previously established estimate of P(x I Ck-1). Wc call this method linear successive abstraction. A simple example is estimating the probability P(x I/n-j+lC/..., In) of word class x given l,-j+l,...,ln, tile last j letters of a word ll,...,l,. In this case, the estimate will be based on the relative frequencies f(x I l,,_~+,,..., l,,),..., f(x \[ In), f(x).</Paragraph> <Paragraph position="4"> We will here consider the special case when the flmction g is a weighted sum of the relative frequency and the previous estimate, appropriately</Paragraph> <Paragraph position="6"> We want the weight 0 to depend on the context Ck, and in particular be proportional to some measure of how spread out the relative frequencies of the various outcomes in context Ck are from the statistical mean. The variance is the quadratic moment w.r.t, the mean, and is thus such a measure. However, we want the weight to have the same dimension as the statistical mean, and the dimension of the variance is obviously the square of the dimension of the mean. The square root of the variance, which is the standard deviation, should thus be a suitable quantity. For this reason we will use the standard deviation in Ck as a weight, i.e., 0 = ~r(Ck). One could of course multiply this quantity with any reasonable real constant, but we will arbitrarily set this constant to one, i.e., use ~r(Ck) itself.</Paragraph> <Paragraph position="7"> In linguistic applications, the outcomes are usually not real numbers, but pieces of linguistic structure such as words, part-of-speech tags, grammar rules, bits of semantic tissue, etc. This means that it is not quite obvious what the standard deviation, or the statistical mean for that matter, actually should be. To put it a bit more abstractly, we need to calculate the standard deviation of a non-numerical random variable.</Paragraph> <Section position="1" start_page="895" end_page="895" type="sub_section"> <SectionTitle> 2.1 Deriving the Standard Deviation </SectionTitle> <Paragraph position="0"> So how do we find the standard deviation of a non-numerical random variable? One way is to construct an equivalent numerical random variable and use the standard deviation of the latter. This can be done in several different ways. The one we will use is to construct a numerical random variable with a uniform distribution that has the same entropy as the non-numerical one. Whether we use a discrete or continuous random variable is, as we shall see, of no importance.</Paragraph> <Paragraph position="1"> We will first factor out the dependence on the context size. Quite in general, if ~N is the sample mean of N independent observations of any numerical random variable ( with variance a0 2, i.e.,</Paragraph> <Paragraph position="3"> In our case, the number of observations N is simply the size of the context Ck, by which we mean the number of times Ck occurred in the training data, i.e., the frequency count of Ck, which we will denote \]Ck\[. Since the standard deviation is the square root of the variance, we have</Paragraph> <Paragraph position="5"> Here ~r0 does not depend on the number of observations in'cofftext Ck, only on the underlying probability distribution conditional on context Ck.</Paragraph> <Paragraph position="6"> To estimate cr0(Ck), we assume that we have either a discrete uniform distribution on {1,..., M} or a continuous uniform distribution on \[0, M\] that is as hard to predict as the one in Ck in the sense that the entropy is the same. The entropy H\[~\] of a random variable ~ is the expectation value of the logarithm of P((). In the discrete case we thus have</Paragraph> <Paragraph position="8"> tIere P(xi) is the probability of the random variable ( taking the value xi, which is ~ for all possible outcomes xi and zero otherwise. Thus, the entropy is In M:</Paragraph> <Paragraph position="10"> The continuous case is similar. We thus have that ln M = H\[Ck\] or M = e IIICk\] The variance of these uniform distributions is M 2 1--T in the continuous case and ~ in the discrete case. We thus have</Paragraph> <Paragraph position="12"> Unfortunately, the entropy It\[Ck\] depends on the probability distribution of context Ck and thus on Cro(Ck). Since we want to avoid trying to solve highly nonlinear equations, and since we have access to an estimate of the probability distribution of context Ck-1, we will make the following approximation: null O'0(Ck-1) 1 ~(Ck) ~ It is starting to look sensible to specify ~r- 1 instead of ~, i.e., instead of ~ we will write lq-o&quot; ' c~-I q-1 &quot;</Paragraph> </Section> <Section position="2" start_page="895" end_page="896" type="sub_section"> <SectionTitle> 2.2 The Final Recurrence Formula </SectionTitle> <Paragraph position="0"> We have thus established a recurrence formula for the estimate of the probability distribution in context Ck given the estimate of the probability distribution in context Ck-1 and the relative frequencies in context Ck:</Paragraph> <Paragraph position="2"> We will start by estimating the probability distribution in the most general context C1, if necessary directly from the relative frequencies. Since this is the most general context, this will be the context with the most training data. Thus it stands the best chances of the relative frequencies being acceptably accurate estimates. This will allow us to calculate an estimate of the probability distribution in context C2, which in turn will allow ns to calculate an estimate of the probability distribution in context Ca, etc. We can thus calculate estimates of the probability distributions in all contexts C1,..., Cm.</Paragraph> <Paragraph position="3"> We will next consider some examples from part-of-speech tagging.</Paragraph> </Section> </Section> <Section position="5" start_page="896" end_page="897" type="metho"> <SectionTitle> 3 Examples from PoS Tagging </SectionTitle> <Paragraph position="0"> Part-of-speech (PoS) tagging consists in assigning to each word of an input text a (set of) tag(s) from a finite set of possible tags, a tag palette or a tag set. The reason that this is a research issue is that a word can in general be assigned different tags depending on context. In statistical tagging, the relevant information is extracted from a training text and fitted into a statistical language model, which is then used to assign the most likely tag to each word in the input text.</Paragraph> <Paragraph position="1"> The statistical language model usually consists of lexical probabilities, which determine the probability of a particular tag conditional on the particular word, and contextual probabilities, which determine the probability of a particular tag conditional on the surrounding tags. The latter conditioning is usually on the tags of the neighbouring words, and very often on the N - 1 previous tags, so-called (tag) N-gram statistics. These probabilities can bc estimated either from a pretagged training corpus or from untagged text, a lexicon and an initial bias. We will here consider the former case.</Paragraph> <Paragraph position="2"> Statistical taggers usually work as follows: First, each word in the input word string 1471, * .., W, is assigned all possible tags according to the lexicon, thereby creating a lattice. A dynamic programming technique is then used to find tag the sequence 5/\],..., ~, that maximizes</Paragraph> <Paragraph position="4"> Since the maximum does not depend on the factors P(Wk), these can be omitted, yielding the standard statistical PoS tagging task: max \]-\[ P(Tk IU~-~V+~,...,Tk-J.P(Wk JT~) TI ,...,T~, t~l= This is well-described in for example (DeRose 1988).</Paragraph> <Paragraph position="5"> We thus have to estimate the two following sets of probabilities: * Lexical probabilities: The probability of each tag T i conditional on the word W that is to be tagged, p(r' I I wr! i Often the converse probabilities P(W are given instead, but we will for reasons soou to become apparent use the former formulation. null (r) Tag N-grams: The probability of tag T i at position k in the input string, denoted T~, given that tags 7~.-N+1 T ,..., k-1 have been assigned to the previous N- 1 words* Often N is set to two or three, and thus bigralns or trigrams are employed. When using trigram statistics, this quantity is P(T~ \]7' k-~,Tk-1).</Paragraph> <Section position="1" start_page="896" end_page="896" type="sub_section"> <SectionTitle> 3.1 N-gram Back-off Smoothing </SectionTitle> <Paragraph position="0"> We will first consider estimating the N-gram probabilities P(T~ \]Tk-N+I,...,Tk-1). IIere, there is an obvious sequence of generalizations of the context 5/~-N+1,..., 7~-1 with a linear order, na-</Paragraph> <Paragraph position="2"> corresponding to the nnigram probabilities. Tiros we will repeatedly strip off the tag furthest from the current word and use the estimate of the probability distribution in this generalized context to improve the estimate in the current context. This means that when estimating the (j + 1)-gram probabilities, we back off to the estimate of the jgram probabilities.</Paragraph> <Paragraph position="3"> 7' So when estimating P(T\[ I Tk-j,..., ~-~), we simply strip off the tag 5~_j and apply Eq. (1):</Paragraph> <Paragraph position="5"/> </Section> <Section position="2" start_page="896" end_page="897" type="sub_section"> <SectionTitle> 3.2 Handling Unknown Words </SectionTitle> <Paragraph position="0"> We will next consider improving the probability estimates for unknown words, i.e., words that do not occur in tile training corpus, and for which we therefore have no lexical probabilities, The same technique could actually be used for improving the estimates of the lexical probabilities of words that do occur in the training corpus. The basic idea is thaF there is a substantial amount of information in the word suffixes, especially for languages with a richer morphological structure than English. For this reason, we will estimate the probability distribution conditional on an unknown word from the statistical data available for words that end with the same sequence of letters. Assume that the word consists of the letters I1,. * *, I,~. We want to know the probabilities P(T i I ll,...,ln) for the various tags Ti. 1 Since the word is unknown, this data is not available. However, if we look at the sequence of generalizations of &quot;ending with same last j letters&quot;, here denoted ln-j+l, * *., In, we realize that sooner or later, there will be observations available, in the worst case looking at the last zero letters, i.e., at the unigram probabilities.</Paragraph> <Paragraph position="1"> So when estimating P(T i I In-j+l,...,ln), we simply omit the jth last letter In-j+l and apply Eq. (1):</Paragraph> <Paragraph position="3"> This data can be collected from the words in the training corpus with frequencies below some threshold, e.g., words that occur less than say ten times, and can be indexed in a tree on reversed suffixes for quick access.</Paragraph> </Section> </Section> <Section position="6" start_page="897" end_page="897" type="metho"> <SectionTitle> 4 Partial Successive Abstraction </SectionTitle> <Paragraph position="0"> If there is only a partial order of the various generalizations, the scheme is still viable. For example, consider generalizing symmetric trigram statistics, i.e., statistics of the form P(T I Tz, Tr). Here, both Tt, the tag of the word to the left, and Tr, the tag of the word to the right, are one-step generalizations of the context 7}, Tr, and both have in turn the common generalization ~ (&quot;no information&quot;).</Paragraph> <Paragraph position="1"> We modify Eq. (1) accordingly:</Paragraph> <Paragraph position="3"> symbol indicating the beginning o1 the word.</Paragraph> <Paragraph position="4"> We call this partial successive abstraction. Since we really want to estimate cr in the more specific context, and since the standard deviation (with the dependence on context size factored out) will most likely not increase when we specialize the context, we will use:</Paragraph> <Paragraph position="6"> In the general case, where we have M one-step generalizations C~ of C, we arrive at the equation</Paragraph> <Paragraph position="8"> By calculating the estimates of the probability distributions in such an order that whenever estimating the probability distribution in some particular context, the probability distributions in all more general contexts have already been estimated, we can guarantee that all quantities necessary for the calculations are available.</Paragraph> </Section> <Section position="7" start_page="897" end_page="898" type="metho"> <SectionTitle> 5 Relationship to Other Methods </SectionTitle> <Paragraph position="0"> We will next compare the proposed method to, in turn, deleted interpolation, expected likelihood estimation and Katz's back-off scheme.</Paragraph> <Section position="1" start_page="897" end_page="898" type="sub_section"> <SectionTitle> 5.1 Deleted Interpolation </SectionTitle> <Paragraph position="0"> Interpolation requires that the training corpus is divided into one part used to estimate the relative frequencies, and a separate held-back part used to cope with sparse data through back-off smoothing. For example, tag trigram probabilities carl be estimated as follows:</Paragraph> <Paragraph position="2"> Since the probability estimate is a linear combination of the various observed relative frequencies, this is called linear interpolation. The weights 13.</Paragraph> <Paragraph position="3"> may depend on the conditionings, but are required to be nonnegative and to sum to one over j. An enhancement is to partition the training set into n parts and in turn perform linear interpolation with each of the n parts held out to determine the back-off weights and use the remaining n - 1 parts for parameter estimation. The various back-off weights are combined in the process. This is usually referred to as deleted interpolation.</Paragraph> <Paragraph position="4"> The weights Aj are determined by maximizing the probability of the held-out part of the training data, see (Jelinek & Mercer 1980). A locally optimal weight setting can be found using Baum-Welch reestimation, see (Baum 1972). Baum-Welch reestimation is however prohibitively time-consuming for complex contexts if the weights are allowed to depend on the contexts, while successive abstraction is clearly tractable; the latter effectively determines these weights directly from the same data as the relative frequcncies.</Paragraph> </Section> <Section position="2" start_page="898" end_page="898" type="sub_section"> <SectionTitle> 5.2 Expected Likelihood Estimation </SectionTitle> <Paragraph position="0"> Expected likelihood estimation (ELE) consists in assigning an extra half a count to all outcomes.</Paragraph> <Paragraph position="1"> Thus, an outcome that didn't occur in the training data receives half a count, an outcome that occurred once receives three half counts. This is equivalent to assigning a count of one to the occurring, and one third to the non-occurring outcomes. To give an indication of how successive abstraction is related to ELE, consider the following special case: If we indeed have a uniform distribution with M outcomes of probability M ! in context Ck-1 and there is but one observation of one single outcome in context Ck, then Eq. (1) will assign to this outcome the probability vqh+l and vqh+M to the other, non-occurring, outcomes 1 So v~+m' if we had used 2 instead of vq2 in Eq. (1), this would have been equivalent to assigning a count of one to the outcome that occurred, anti a count of one third to the ones that didn't. As it is, the latter outcomes are assigned a count of 1 4i5+~&quot;</Paragraph> </Section> <Section position="3" start_page="898" end_page="898" type="sub_section"> <SectionTitle> 5.3 Katz's Back-Off Scheme </SectionTitle> <Paragraph position="0"> The proposed method is identical to Katz's back-off method (Katz 1987) up to the point of suggesting a, in the general case non-linear, retreat to more general contexts:</Paragraph> <Paragraph position="2"> Blending the involved distributions f(x \] C) and /5( x I C'), rather than only backing oft&quot; to C' if f(x \] C) is zero, and in particular, instantiating the flmction g(f, P) to a weighted sum, distinguishes the two approaches.</Paragraph> </Section> </Section> class="xml-element"></Paper>