File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0513_metho.xml
Size: 18,998 bytes
Last Modified: 2025-10-06 14:07:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0513"> <Title>Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem?</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> XX </SectionTitle> <Paragraph position="0"> of a word X. A variable XY indicates a word bigram and indicates its expected frequency at random.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> XY </SectionTitle> <Paragraph position="0"> An overbar signifies a variable's complement. For more details, one can consult the original sources as well as Ferreira and Pereira (1999) and Manning and Schutze (1999).</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Lexical Access </SectionTitle> <Paragraph position="0"> Prior to applying the algorithms, we lemmatize using a weakly-informed tokenizer that knows only that whitespace and punctuation separate words.</Paragraph> <Paragraph position="1"> Punctuation can either be discarded or treated as words. Since we are equally interested in finding units like &quot;Dr.&quot; and &quot;U. S.,&quot; we opt to treat punctuation as words.</Paragraph> <Paragraph position="2"> Once we tokenize, we use Church's (1995) suffix array approach to identify word n-grams that occur</Paragraph> <Paragraph position="4"> n-gram list in accordance to each probabilistic algorithm. This task is non-trivial since most algorithms were originally suited for finding two-word collocations. We must therefore decide how to expand the algorithms to identify general n-grams (say, C=w w ...w ). We can either generalize or</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 12 n </SectionTitle> <Paragraph position="0"> approximate. Since generalizing requires exponential compute time and memory for several of the algorithms, approximation is an attractive alternative.</Paragraph> <Paragraph position="1"> One approximation redefines X and Y to be, respectively, the word sequences ww...w and 12 i ww...w where i is chosen to maximize P P . i+1 i+2 n, X Y This has a natural interpretation of being the expected probability of concatenating the two most probable substrings in order to form the larger unit. Since it can be computed rapidly with low memory costs, we use this approximation.</Paragraph> <Paragraph position="2"> Two additional issues need addressing before evaluation. The first regards document sourcing. If an n-gram appears in multiple sources (eg., Congressional Record versus Associated Press), its likelihood of accuracy should increase. This is particularly true if we are looking for MWU headwords for a general versus specialized dictionary. Phrases that appear in one source may in fact be general MWUs, but frequently, they are text-specific units. Hence, precision gained by excluding single-source n-grams may be worth losses in recall. We will measure this trade-off. Second, evaluating with punctuation as words and applying no filtering mechanism may unfairly bias against some algorithms. Pre- or post-processing of n-grams with a linguistic filter has shown to improve some induction algorithms' performance rules as in Section 2.2. Yet we can filter by pruning n-grams whose beginning or ending word is among the top N most frequent words. This unfortunately eliminates acronyms like &quot;U. S.&quot; and phrasal verbs like &quot;throw up.&quot; However, discarding some words may be worthwhile if the final list of n-grams is richer in terms of MRD headwords. We therefore evaluate with such an automatic filter, arbitrarily (and without optimization) choosing N=75.</Paragraph> </Section> <Section position="8" start_page="0" end_page="2" type="metho"> <SectionTitle> 4 Evaluating Performance </SectionTitle> <Paragraph position="0"> A natural scoring standard is to select a language and evaluate against headwords from existing dictionaries in that language. Others have used similar standards (Daille, 1996), but to our knowledge, none to the extent described here. We evaluate thousands of hypothesized units from an unconstrained corpus. Furthermore, we use two separate evaluation gold standards: (1) WordNet (Miller, et al, 1990) and (2) a collection of Internet MRDs. Using two gold standards helps valid MWUs. It also provides evaluation using both static and dynamic resources. We choose to evaluate in English due to the wealth of linguistic resources.</Paragraph> <Paragraph position="1"> The &quot;* *&quot; and &quot;* * *&quot; are actual units. In particular, we use a randomly-selected corpus the first five columns as &quot;information-like.&quot; consisting of a 6.7 million word subset of the TREC Similarly, since the last four columns share databases (DARPA, 1993-1997). properties of the frequency approach, we will refer Table 2 illustrates a sample of rank-ordered output to them as &quot;frequency-like.&quot; from each of the different algorithms (following the One's application may dictate which set of cross-source, filtered paradigm described in section algorithms to use. Our gold standard selection 3). Note that algorithms in the first four columns reflects our interest in general word dictionaries, so produce results that are similar to each other as do results we obtain may differ from results we might those in the last four columns. Although the mutual have obtained using terminology lexicons. information results seem to be almost in a class of If our gold standard contains K MWUs with their own, they actually are similar overall to the corpus frequencies satisfying threshold (T=10), our first four sets of results; therefore, we will refer to figure of merit (FOM) is given by</Paragraph> <Paragraph position="3"> little or even negative impact. On the other hand, where P (precision at i) equals i/H , and H is the i i i number of hypothesized MWUs required to find the i correct MWU. This FOM corresponds to area th under a precision-recall curve.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 WordNet-based Evaluation </SectionTitle> <Paragraph position="0"> WordNet has definite advantages as an evaluation resource. It has in excess of 50,000 MWUs, is freely accessible, widely used, and is in electronic form.</Paragraph> <Paragraph position="1"> Yet, it obviously cannot contain every MWU. For instance, our corpus contains 177,331 n-grams (for 2G06nG0610) satisfying TG0710, but WordNet contains only 2610 of these. It is unclear, therefore, if algorithms are wrong when they propose MWUs that are not in WordNet. We will assume they are wrong but with a special caveat for proper nouns.</Paragraph> <Paragraph position="2"> WordNet includes few proper noun MWUs. Yet several algorithms produce large numbers of proper nouns. This biases against them. One could contend that all proper nouns MWUs are valid, but we disagree. Although such may be MWUs, they are not necessarily MRD headwords; one would not include every proper noun in a dictionary, but rather, those needing definitions. To overcome this, we will have two scoring modes. The first, &quot;S&quot; mode (standing for some) discards any proposed capitalized n-gram whose uncapitalized version is not in WordNet. The second mode &quot;N&quot; (for none) disregards all capitalized n-grams.</Paragraph> <Paragraph position="3"> Table 3 illustrates algorithmic performance as compared to the 2610 MWUs from WordNet. The first double column illustrates &quot;out-of-the-box&quot; performance on all 177,331 possible n-grams. The second double column shows cross-sourcing: only hypothesizing MWUs that appear in at least two separate datasets (124,952 in all), but being evaluated against all of the 2610 valid units. Double columns 3 and 4 show effects from high-frequency filtering the n-grams of the first and second columns (reporting only 29,716 and 17,720 n-grams) respectively.</Paragraph> <Paragraph position="4"> As Table 3 suggests, for every condition, the information-like algorithms seem to perform best at identifying valid, general MWU headwords.</Paragraph> <Paragraph position="5"> Moreover, they are enhanced when cross-sourcing is considered; but since much of their strength comes from identifying proper nouns, filtering has the frequency-like approaches are independent of data source. They also improve significantly with filtering. Overall, though, after the algorithms are judged, even the best score of 0.265 is far short of the maximum possible, namely 1.0.</Paragraph> </Section> <Section position="2" start_page="0" end_page="2" type="sub_section"> <SectionTitle> 4.2 Web-based Evaluation </SectionTitle> <Paragraph position="0"> Since WordNet is static and cannot report on all of a corpus' n-grams, one may expect different performance by using a more all-encompassing, dynamic resource. The Internet houses dynamic resources which can judge practically every induced n-gram. With permission and sufficient time, one can repeatedly query websites that host large collections of MRDs and evaluate each n-gram.</Paragraph> <Paragraph position="1"> Having approval, we queried: (1) onelook.com, (2) acronymfinder.com, and (3) infoplease.com. The first website interfaces with over 600 electronic dictionaries. The second is devoted to identifying proper acronyms. The third focuses on world facts such as historical figures and organization names.</Paragraph> <Paragraph position="2"> To minimize disruption to websites by reducing the total number of queries needed for evaluation, we use an evaluation approach from the information retrieval community (Sparck-Jones and van Rijsbergen, 1975). Each algorithm reports its top 5000 MWU choices and the union of these choices (45192 possible n-grams) is looked up on the Internet. Valid MWUs identified at any website are assumed to be the only valid units in the data.</Paragraph> <Paragraph position="4"> Algorithms are then evaluated based on this showed how one could compute latent semantic collection. Although this strategy for evaluation is vectors for any word in a corpus (Schone and not flawless, it is reasonable and makes dynamic Jurafsky, 2000). Using the same approach, we evaluation tractable. Table 4 shows the algorithms' compute semantic vectors for every proposed word performance (including proper nouns). n-gram C=X X ...X Since LSA involves word Though Internet dictionaries and WordNet are counts, we can also compute semantic vectors completely separate &quot;gold standards,&quot; results are surprisingly consistent. One can conclude that WordNet may safely be used as a gold standard in future MWU headword evaluations. Also, one can see that Z-scores, G24 , and SCP have virtually identical results and seem to best identify MWU headwords (particularly if proper nouns are desired). Yet there is still significant room for improvement.</Paragraph> </Section> </Section> <Section position="9" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Improvement strategies </SectionTitle> <Paragraph position="0"> Can performance be improved? Numerous strategies could be explored. An idea we discuss here tries using induced semantics to rescore the output of the best algorithm (filtered, cross-sourced Zscore) and eliminate semantically compositional or modifiable MWU hypotheses.</Paragraph> <Paragraph position="1"> Deerwester, et al (1990) introduced Latent Semantic Analysis (LSA) as a computational technique for inducing semantic relationships between words and documents. It forms high-dimensional vectors using word counts and uses singular value decomposition to project those vectors into an optimal k-dimensional, &quot;semantic&quot; subspace (see Landauer, et al, 1998).</Paragraph> <Paragraph position="2"> Following an approach from Schutze (1993), we 12 n.</Paragraph> <Paragraph position="3"> (denoted by G0D) for C's subcomponents. These can either include ( ) or exclude ( ) C's counts. We seek to see if induced semantics can help eliminate incorrectly-chosen MWUs. As will be shown, the effort using semantics in this nature has a very small payoff for the expended cost.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 5.1 Non-compositionality </SectionTitle> <Paragraph position="0"> Non-compositionality is a key component of valid MWUs, so we may desire to emphasize n-grams that are semantically non-compositional. Suppose we wanted to determine if C (defined above) were noncompositional. Then given some meaning function, G0C, C should satisfy an equation like:</Paragraph> <Paragraph position="2"> where h combines the semantics of C's subcomponents and g measures semantic differences. If C were a bigram, then if g(a,b) is defined to be |a-b|, if h(c,d) is the sum of c and d, and if G0C(e) is set to -log P , then equation (1) would e become the pointwise mutual information of the bigram. If g(a,b) were defined to be (a-b)/b , and if h(a,b)=ab/N and G0C(X)=f , we essentially get ZX null scores. These formulations suggest that several of the probabilistic algorithms we have seen include non-compositionality measures already. However, since the probabilistic algorithms rely only on distributional information obtained by considering juxtaposed words, they tend to incorporate a significant amount of non-semantic information such as syntax. Can semantic-only rescoring help? To find out, we must select g, h, and G0C. Since we want to eliminate MWUs that are compositional, we want h's output to correlate well with C when there is compositionality and correlate poorly otherwise. Frequently, LSA vectors are correlated using the cosine between them: A large cosine indicates strong correlation, so large values for g(a,b)=1-|cos(a,b) |should signal weak correlation or non-compositionality. h could represent a weighted vector sum of the components' required for this task. This seems to be a significant semantic vectors with weights (w ) set to either 1.0 component. Yet there is still another: maybe i or the reciprocal of the words' frequencies. semantic compositionality is not always bad. Table 5 indicates several results using these Interestingly, this is often the case. Consider settings. As the first four rows indicate and as vice_president, organized crime, and desired, non-compositionality is more apparent for Marine_Corps. Although these are MWUs, one G0D * (i.e., the vectors derived from excluding C's X counts) than for G0D . Yet, performance overall is X horrible, particularly considering we are rescoring Z-score output whose score was 0.269. Rescoring caused five-fold degradation! What happens if we instead emphasize compositionality? Rows 5-8 illustrate the effect: there is a significant recovery in performance. The most reasonable explanation for this is that if MWUs and their components are strongly correlated, the components may rarely occur except in context with the MWU. It takes about 20 hours to compute the G0D * for each possible n-gram X combination. Since the probabilistic algorithms already identify n-grams that share strong distributional properties with their components, it seems imprudent to exhaust resources on this LSA-based strategy for non-compositionality.</Paragraph> <Paragraph position="3"> These findings warrant some discussion. Why did non-compositionality fail? Certainly there is the possibility that better choices for g, h, and could yield improvements. We actually spent months trying to find an optimal combination as well as a strategy for coupling LSA-based scores with the Zscores, but without avail. Another possibility: although LSA can find semantic relationships, it may not make semantic decisions at the level would still expect that the first is related to president, the second relates to crime, and the last relates to Marine. Similarly, tokens such as Johns_Hopkins and Elvis are anaphors for Johns_Hopkins_University and Elvis_Presley, so they should have similar meanings.</Paragraph> <Paragraph position="4"> This begs the question: can induced semantics help at all? The answer is &quot;yes.&quot; The key is using LSA where it does best: finding things that are similar -- or substitutable.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 5.2 Non-substitutivity </SectionTitle> <Paragraph position="0"> For every collocation C=X X ..X X X ..X , we related, chances are that C is substitutable. Since LSA excels at finding semantic correlations, we can compare G0D and G0D to see if C is Xi Y substitutable. We use our earlier approach (Schone and Jurafsky, 2000) for performing the comparison; namely, for every word W, we compute cos(G0DG0D) w, R for 200 randomly chosen words, R. This allows for computation of a correlaton mean (u ) and standard W deviation (1 ) between W and other words. As W before, we then compute a normalized cosine score ( ) between words of interest, defined by With this set-up, we now look for substitutivity. Note that phrases may be substitutable and still be headword if their substitute phrases are themselves MWUs. For example, dioxide in carbon_dioxide is semantically similar to monoxide in carbon_monoxide. Moreover, there are other important instances of valid substitutivity: However, guilty and innocent are semantically related, but pleaded_guilty and pleaded_innocent are not MWUs. We would like to emphasize only n-grams whose substitutes are valid MWUs. To show how we do this using LSA, suppose we want to rescore a list L whose entries are potential MWUs. For every entry X in L, we seek out all other entries whose sorted order is less than some maximum value (such as 5000) that have all but one word in common. For example, suppose X is &quot;bachelor_'_s_degree.&quot; The only other entry that matches in all but one word is &quot;master_'_s_degree.&quot; If the semantic vectors for &quot;bachelor&quot; and &quot;master&quot; have a normalized cosine score greater than a threshold of 2.0, we then say that the two MWUs are in each others substitution set. To rescore, we assign a new score to each entry in substitution set. Each element in the substitution set gets the same score. The score is derived using a combination of the previous Z-scores for each element in the substitution set. The combining function may be an averaging, or a computation of the median, the maximum, or something else. The maximum outperforms the average and the median on our data. By applying in to our data, we observe a small but visible improvement of 1.3% absolute to .282 (see Fig. 1). It is also possible that other improvements could be gained using other combining strategies.</Paragraph> </Section> </Section> class="xml-element"></Paper>