File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1075_metho.xml
Size: 18,005 bytes
Last Modified: 2025-10-06 14:09:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1075"> <Title>A Nonparametric Method for Extraction of Candidate Phrasal Terms</Title> <Section position="3" start_page="605" end_page="607" type="metho"> <SectionTitle> 2 Statistical considerations </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="605" end_page="605" type="sub_section"> <SectionTitle> 2.1 Highly skewed distributions </SectionTitle> <Paragraph position="0"> As first observed e.g. by Zipf (1935, 1949) the frequency of words and other linguistic units tend to follow highly skewed distributions in which there are a large number of rare events. Zipf's formulation of this relationship for single word frequency distributions (Zipf's first law) postulates that the frequency of a word is inversely proportional to its rank in the frequency distribution, or more generally if we rank words by frequency and assign rank z, where the function fz(z,N) gives the frequency of rank z for a sample of size N, Zipf's first law states that:</Paragraph> <Paragraph position="2"> where C is a normalizing constant and is a free parameter that determines the exact degree of skew; typically with single word frequency data, approximates 1 (Baayen 2001: 14). Ideally, an association metric would be designed to maximize its statistical validity with respect to the distribution which underlies natural language text -- which is if not a pure Zipfian distribution at least an LNRE (large number of rare events, cf. Baayen 2001) distribution with a very long tail, containing events which differ in probability by many orders of magnitude. Unfortunately, research on LNRE distributions focuses primarily on unigram distributions, and generalizations to bigram and n-gram distributions on large corpora are not as yet clearly feasible (Baayen 2001:221). Yet many of the best-performing lexical association measures, such as the t-test, assume normal distributions, (cf.</Paragraph> <Paragraph position="3"> Dunning 1993) or else (as with mutual information) eschew significance testing in favor of a generic information-theoretic approach.</Paragraph> <Paragraph position="4"> Various strategies could be adopted in this situation: finding a better model of the distribution,or adopting a nonparametric method.</Paragraph> </Section> <Section position="2" start_page="605" end_page="606" type="sub_section"> <SectionTitle> 2.2 The independence assumption </SectionTitle> <Paragraph position="0"> Even more importantly, many of the standard lexical association measures measure significance (or information content) against the default assumption that word-choices are statistically independent events. This assumption is built into the highest-performing measures as observed in Evert & Krenn 2001, Krenn & Evert 2001 and Schone & Jurafsky 2001.</Paragraph> <Paragraph position="1"> This is of course untrue, and justifiable only as a simplifying idealization in the absence of a better model. The actual probability of any sequence of words is strongly influenced by the base grammatical and semantic structure of language, particularly since phrasal terms usually conform to the normal rules of linguistic structure. What makes a compound noun, or a verb-particle construction, into a phrasal term is not deviation from the base grammatical pattern for noun-noun or verb-particle structures, but rather a further pattern (of meaning and usage and thus heightened frequency) superimposed on the normal linguistic base. There are, of course, entirely aberrant phrasal terms, but they constitute the exception rather than the rule.</Paragraph> <Paragraph position="2"> This state of affairs poses something of a chicken-and-the-egg problem, in that statistical parsing models have to estimate probabilities from the same base data as the lexical association measures, so the usual heuristic solution as noted above is to impose a linguistic filter on the data, with the association measures being applied only to the subset thus selected. The result is in effect a constrained statistical model in which the independence assumption is much more accurate.</Paragraph> <Paragraph position="3"> For instance, if the universe of statistical possibilities is restricted to the set of sequences in which an adjective is followed by a noun, the null hypothesis that word choice is independent -- i.e., that any adjective may precede any noun -- is a reasonable idealization. Without filtering, the independence assumption yields the much less plausible null hypothesis that any word may appear in any order.</Paragraph> <Paragraph position="4"> It is thus worth considering whether there are any ways to bring additional information to bear on the problem of recognizing phrasal terms without presupposing statistical independence.</Paragraph> </Section> <Section position="3" start_page="606" end_page="606" type="sub_section"> <SectionTitle> 2.3 Variable length; alternative/overlapping </SectionTitle> <Paragraph position="0"> phrases Phrasal terms vary in length. Typically they range from about two to six words in length, but critically we cannot judge whether a phrase is lexical without considering both shorter and longer sequences.</Paragraph> <Paragraph position="1"> That is, the statistical comparison that needs to be made must apply in principle to the entire set of word sequences that must be distinguished from phrasal terms, including longer sequences, subsequences, and overlapping sequences, despite the fact that these are not statistically independent events. Of the association metrics mentioned thus far, only the C-Value method attempts to take direct notice of such word sequence information, and then only as a modification to the basic information provided by frequency.</Paragraph> <Paragraph position="2"> Any solution to the problem of variable length must enable normalization allowing direct comparison of phrases of different length. Ideally, the solution would also address the other issues -the independence assumption and the skewed distributions typical of natural language data.</Paragraph> </Section> <Section position="4" start_page="606" end_page="607" type="sub_section"> <SectionTitle> 2.4 Mutual expectation </SectionTitle> <Paragraph position="0"> An interesting proposal which seeks to overcome the variable-length issue is the mutual expectation metric presented in Dias, Guillore, and Lopes (1999) and implemented in the SENTA system (Gil and Dias 2003a). In their approach, the frequency of a phrase is normalized by taking into account the relative probability of each word compared to the phrase.</Paragraph> <Paragraph position="1"> Dias, Guillore, and Lopes take as the foundation of their approach the idea that the cohesiveness of a text unit can be measured by measuring how strongly it resists the loss of any component term. This is implemented by considering, for any ngram, the set of [continuous or discontinuous] (n-1)-grams which can be formed by deleting one word from the n-gram. A normalized expectation for the n-gram is then calculated as follows:</Paragraph> <Paragraph position="3"> where wi is the term omitted from the n-gram.</Paragraph> <Paragraph position="4"> They then calculate mutual expectation as the product of the probability of the n-gram and its normalized expectation.</Paragraph> <Paragraph position="5"> This statistic is of interest for two reasons: first, it provides a single statistic that can be applied to n-grams of any length; second, it is not based upon the independence assumption. The core statistic, normalized expectation, is essentially frequency with a penalty if a phrase contains component parts significantly more frequent than the phrase itself.</Paragraph> <Paragraph position="6"> It is of course an empirical question how well mutual expectation performs (and we shall examine this below) but mutual expectation is not in any sense a significance test. That is, if we are examining a phrase like the east end, the conditional probability of east given [__ end] or of end given [__ east] may be relatively low (since other words can appear in that context) and yet the phrase might still be very lexicalized if the association of both words with this context were significantly stronger than their association for other phrases. That is, to the extent that phrasal terms follow the regular patterns of the language, a phrase might have a relatively low conditional probability (given the wide range of alternative phrases following the same basic linguistic patterns) and thus have a low mutual expectation yet still occur far more often than one would expect from chance.</Paragraph> <Paragraph position="7"> In short, the fundamental insight -- assessing how tightly each word is bound to a phrase -- is worth adopting. There is, however, good reason to suspect that one could improve on this method by assessing relative statistical significance for each component word without making the independence assumption. In the heuristic to be outlined below, a nonparametric method is proposed. This method is novel: not a modification of mutual expectation, but a new technique based on ranks in a Zipfian frequency distribution.</Paragraph> </Section> <Section position="5" start_page="607" end_page="607" type="sub_section"> <SectionTitle> 2.5 Rank ratios and mutual rank ratios </SectionTitle> <Paragraph position="0"> This technique can be justified as follows. For each component word in the n-gram, we want to know whether the n-gram is more probable for that word than we would expect given its behavior with other words. Since we do not know what the expected shape of this distribution is going to be, a nonparametric method using ranks is in order, and there is some reason to think that frequency rank regardless of n-gram size will be useful. In particular, Ha, Sicilia-Garcia, Ming and Smith (2002) show that Zipf's law can be extended to the combined frequency distribution of n-grams of varying length up to rank 6, which entails that the relative rank of words in such a combined distribution provide a useful estimate of relative probability. The availability of new techniques for handling large sets of n-gram data (e.g. Gil & Dias 2003b) make this a relatively feasible task.</Paragraph> <Paragraph position="1"> Thus, given a phrase like east end, we can rank how often __ end appears with east in comparison to how often other phrases appear with east.That is, if {__ end, __side, the __, toward the __, etc.} is the set of (variable length) n-gram contexts associated with east (up to a length cutoff), then the actual rank of __ end is the rank we calculate by ordering all contexts by the frequency with which the actual word appears in the context.</Paragraph> <Paragraph position="2"> We also rank the set of contexts associated with east by their overall corpus frequency. The resulting ranking is the expected rank of __ end based upon how often the competing contexts appear regardless of which word fills the context.</Paragraph> <Paragraph position="3"> The rank ratio (RR) for the word given the context can then be defined as:</Paragraph> <Paragraph position="5"> where ER is the expected rank and AR is the actual rank. A normalized, or mutual rank ratio for the n-gram can then be defined as 2 11, [__ .... ] 2, [ __ ... ] , [ 1, 2... _]( )* ( )...* ( )n nw w w w n w wn RR w RR w RR w The motivation for this method is that it attempts to address each of the major issues outlined above by providing a nonparametric metric which does not make the independence assumption and allows scores to be compared across n-grams of different lengths.</Paragraph> <Paragraph position="6"> A few notes about the details of the method are in order. Actual ranks are assigned by listing all the contexts associated with each word in the corpus, and then ranking contexts by word, assigning the most frequent context for word n the rank 1, next next most frequent rank 2, etc. Tied ranks are given the median value for the ranks occupied by the tie, e.g., if two contexts with the same frequency would occupy ranks 2 and 3, they are both assigned rank 2.5. Expected ranks are calculated for the same set of contexts using the same algorithm, but substituting the unconditional frequency of the (n-1)-gram for the gram's frequency with the target word.1</Paragraph> </Section> </Section> <Section position="4" start_page="607" end_page="609" type="metho"> <SectionTitle> 3 Data sources and methodology </SectionTitle> <Paragraph position="0"> The Lexile Corpus is a collection of documents covering a wide range of reading materials such as a child might encounter at school, more or less evenly divided by Lexile (reading level) rating to cover all levels of textual complexity from kindergarten to college. It contains in excess of 400 million words of running text, and has been made available to the Educational Testing Service under a research license by Metametrics Corporation.</Paragraph> <Paragraph position="1"> This corpus was tokenized using an in-house tokenization program, toksent, which treats most punctuation marks as separate tokens but makes single tokens out of common abbreviations, numbers like 1,500, and words like o'clock. It should be noted that some of the association measures are known to perform poorly if punctuation marks and common stopwords are 1 In this study the rank-ratio method was tested for bigrams and trigrams only, due to the small number of WordNet gold standard items greater than two words in length. Work in progress will assess the metrics' performance on n-grams of orders four through six. included; therefore, n-gram sequences containing punctuation marks and the 160 most frequent word forms were excluded from the analysis so as not to bias the results against them. Separate lists of bigrams and trigrams were extracted and ranked according to several standard word association metrics. Rank ratios were calculated from a comparison set consisting of all contexts derived by this method from bigrams and trigrams, e.g., contexts of the form word1__, ___word2,</Paragraph> <Paragraph position="3"> Table 1 lists the standard lexical association measures tested in section four3.</Paragraph> <Paragraph position="4"> The logical evaluation method for phrasal term identification is to rank n-grams using each metric and then compare the results against a gold standard containing known phrasal terms. Since Schone and Jurafsky (2001) demonstrated similar results whether WordNet or online dictionaries were used as a gold standard, WordNet was selected. Two separate lists were derived containing two- and three-word phrases. The choice of WordNet as a gold standard tests ability to predict general dictionary headwords rather than technical terms, appropriate since the source corpus consists of nontechnical text.</Paragraph> <Paragraph position="5"> Following Schone & Jurafsky (2001), the bigram and trigram lists were ranked by each statistic then scored against the gold standard, with results evaluated using a figure of merit (FOM) roughly characterizable as the area under the precision-recall curve. The formula is:</Paragraph> <Paragraph position="7"> where Pi (precision at i) equals i/Hi, and Hi is the number of n-grams into the ranked n-gram list required to find the ith correct phrasal term.</Paragraph> <Paragraph position="8"> It should be noted, however, that one of the most pressing issues with respect to phrasal terms is that they display the same skewed, long-tail distribution as ordinary words, with a large 2 Excluding the 160 most frequent words prevented evaluation of a subset of phrasal terms such as verbal idioms like act up or go on. Experiments with smaller corpora during preliminary work indicated that this exclusion did not appear to bias the results.</Paragraph> <Paragraph position="9"> Lopes, 1999) and the Z-Score (Smadja 1993). Thus it was not judged necessary to replicate results for all methods covered in Schone & Jurafsky (2001).</Paragraph> <Paragraph position="10"> proportion of the total displaying very low frequencies. This can be measured by considering the overlap between WordNet and the Lexile corpus. A list of 53,764 two-word phrases were extracted from WordNet, and 7,613 three-word phrases. Even though the Lexile corpus is quite large -- in excess of 400 million words of running text -- only 19,939 of the two-word phrases and 4 Due to the computational cost of calculating C-Values over a very large corpus, C-Values were calculated over bigrams and trigrams only. More sophisticated versions of the C-Value method such as NC-values were not included as these incorporate linguistic knowledge and thus fall outside the scope of the study.</Paragraph> <Paragraph position="12"> 1,700 of the three-word phrases are attested in the Lexile corpus. 14,045 of the 19,939 attested two-word phrases occur at least 5 times, 11,384 occur at least 10 times, and only 5,366 occur at least 50 times; in short, the strategy of cutting off the data at a threshold sacrifices a large percent of total recall. Thus one of the issues that needs to be addressed is the accuracy with which lexical association measures can be extended to deal with relatively sparse data, e.g., phrases that appear less than ten times in the source corpus.</Paragraph> <Paragraph position="13"> A second question of interest is the effect of filtering for particular linguistic patterns. This is another method of prescreening the source data which can improve precision but damage recall. In the evaluation bigrams were classified as N-N and A-N sequences using a dictionary template, with the expected effect. For instance, if the WordNet two word phrase list is limited only to those which could be interpreted as noun-noun or adjective noun sequences, N>=5, the total set of WordNet terms that can be retrieved is reduced to 9,757..</Paragraph> </Section> class="xml-element"></Paper>