File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1036_metho.xml
Size: 12,318 bytes
Last Modified: 2025-10-06 14:08:15
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1036"> <Title>Unsupervised methods for developing taxonomies by combining syntactic and statistical information</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Finding semantic neighbors: Combining </SectionTitle> <Paragraph position="0"> latent semantic analysis with part-of-speech information.</Paragraph> <Paragraph position="1"> There are many empirical techniques for recognizing when words are similar in meaning, rooted in the idea that &quot;you shall know a word by the company it keeps&quot; (Firth, 1957). It is certainly the case that words which repeatedly occur with similar companions often have related meanings, and common features used for determining this similarity include shared collocations (Lin, 1999), co-occurrence in lists of objects (Widdows and Dorow, 2002) and latent semantic analysis (Landauer and Dumais, 1997; Hearst and Sch&quot;utze, 1993).</Paragraph> <Paragraph position="2"> The method used to obtain semantic neighbors in our experiments was a version of latent semantic analysis, descended from that used by Hearst and Sch&quot;utze (1993, SS4). First, 1000 frequent words were chosen as column labels (after removing stopwords (Baeza-Yates and Ribiero-Neto, 1999, p. 167)). Other words were assigned co-ordinates determined by the number of times they occured within the same context-window (15 words) as one of the 1000 column-label words in a large corpus. This gave a matrix where every word is represented by a rowvector determined by its co-occurence with frequently occuring, meaningful words. Since this matrix was very sparse, singular value decomposition (known in this context as latent semantic analysis (Landauer and Dumais, 1997)) was used to reduce the number of dimensions from 1000 to 100. This reduced vector space is called WordSpace (Hearst and Sch&quot;utze, 1993, SS4). Similarity between words was then computed using the cosine similarity measure (Baeza-Yates and Ribiero-Neto, 1999, p.</Paragraph> <Paragraph position="3"> 28). Such techniques for measuring similarity between words have been shown to capture semantic properties: for example, they have been used successfully for recognizing synonymy (Landauer and Dumais, 1997) and for finding correct translations of individual terms (Widdows et al., 2002).</Paragraph> <Paragraph position="4"> The corpus used for these experiments was the British National Corpus, which is tagged for parts-of-speech.</Paragraph> <Paragraph position="5"> This enabled us to build syntactic distinctions into WordSpace -- instead of just giving a vector for the string test we were able to build separate vectors for the nouns, verbs and adjectives test. An example of the contribution of part-of-speech information to extracting semantic neighbors of the word fire is shown in Table 2. As can be seen, the noun fire (as in the substance/element) and the verb fire (mainly used to mean firing some sort of weapon) are related to quite different areas of meaning. Building a single vector for the string fire confuses this distinction -- the neighbors of fire treated just as a string include words related to both the meaning of fire as a noun (more frequent in the BNC) and as a verb.</Paragraph> <Paragraph position="6"> Part of the goal of our experiments was to investigate the contribution that this part-of-speech information made for mapping words into taxonomies. As far as we are aware, these experiments are the first to investigate the combination of latent semantic indexing with part-of-speech information.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Finding class-labels: Mapping </SectionTitle> <Paragraph position="0"> collections of words into a taxonomy Given a collection of words or multiword expressions which are semantically related, it is often important to know what these words have in common. All adults with normal language competence and world knowledge are adept at this task -- we know that plant, animal and fungus are all living things, and that plant, factory and works are all kinds of buildings. This ability to classify objects, and to work out which of the possible classifications of a given object is appropriate in a particular context, is essential for understanding and reasoning about linguistic meaning. We will refer to this process as class-labelling.</Paragraph> <Paragraph position="1"> The approach demonstrated here uses a hand-built taxonomy to assign class-labels to a collection of similar nouns. As with much work of this nature, the taxonomy used is WordNet (version 1.6), a freely-available broad-coverage lexical database for English (Fellbaum, 1998).</Paragraph> <Paragraph position="2"> Our algorithm finds the hypernyms which subsume as many as possible of the original nouns, as closely as possible 1. The concept v is said to be a hypernym of w if w is a kind of v. For this reason this sort of a taxonomy is sometimes referred to as an 'IS A hierarchy'. For example, the possible hypernyms given for the word oak in WordNet 1.6 are oak = wood = plant material = material, stuff = substance, matter = object, physical object = entity, something 1Another method which could be used for class-labelling is given by the conceptual density algorithm of Agirre and Rigau (1996), which those authors applied to word-sense disambiguation. A different but related idea is presented by Li and Abe (1998), who use a principle from information theory to model selectional preferences for verbs using different classes from a taxonomy. Their algorithm and goals are different from ours: we are looking for a single class-label for semantically related words, whereas for modelling selectional preferences several classes may be appropriate.</Paragraph> <Paragraph position="3"> fire (string only) fire nn1 fire vvi fire 1.000000 fire nn1 1.000000 fire vvi 1.000000 oak, oak tree = tree = woody plant, ligneous plant = vascular plant, tracheophyte = plant, flora, plant life = life form, organism, being, living thing = entity, something Let S be a set of nouns or verbs. If the word w [?] S is recognized by WordNet, the WordNet taxonomy assigns to w an ordered set of hypernyms H(w).</Paragraph> <Paragraph position="4"> Consider the union</Paragraph> <Paragraph position="6"> H(w).</Paragraph> <Paragraph position="7"> This is the set of all hypernyms of any member of S. Our intuition is that the most appropriate class-label for the set S is the hypernym h [?] H which subsumes as many as possible of the members of S as closely as possible in the hierarchy. There is a trade-off here between subsuming 'as many as possible' of the members of S, and subsuming them 'as closely as possible'. This line of reasoning can be used to define a whole collection of 'classlabelling algorithms'.</Paragraph> <Paragraph position="8"> For each w [?] S and for each h [?]H, define the affinity score function a(w,h) between w and h to be</Paragraph> <Paragraph position="10"> where dist(w,h) is a measure of the distance between w and h, f is some positive, monotonically decreasing function, and g is some positive (possibly constant) function.</Paragraph> <Paragraph position="11"> The function f accords 'positive points' to h if h subsumes w, and the condition that f be monotonically decreasing ensures that h gets more positive points the closer it is to w. The function g subtracts 'penalty points' if h does not subsume w. This function could depend in many ways on w and h -- for example, there could be a smaller penalty if h is a very specific concept than if h is a very general concept.</Paragraph> <Paragraph position="12"> The distance measure dist(w,h) could take many forms, and there are already a number of distance measures available to use with WordNet (Budanitsky and Hirst, 2001). The easiest method for assigning a distance between words and their hypernyms is to count the number of intervening levels in the taxonomy. This assumes that the distance in specificity between ontological levels is constant, which is of course not the case, a problem addressed by Resnik (1999).</Paragraph> <Paragraph position="13"> Given an appropriate affinity score, it is a simple matter to define the best class-label for a collection of objects. Definition 1 Let S be a set of nouns, let H =uniontext w[?]S H(w) be the set of hypernyms of S and let a(w,h) be an affinity score function as defined in equation (1).</Paragraph> <Paragraph position="14"> The best class-label hmax(S) for S is the node hmax [?]H with the highest total affinity score summed over all the members of S, so hmax is the node which gives the maximum score</Paragraph> <Paragraph position="16"> a(w,h).</Paragraph> <Paragraph position="17"> Since H is determined by S, hmax is solely determined by the set S and the affinity score a. In the event that hmax is not unique, it is customary to take the most specific class-label available.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Example </SectionTitle> <Paragraph position="0"> A particularly simple example of this kind of algorithm is used by Hearst and Sch&quot;utze (1993). First they partition the WordNet taxonomy into a number of disjoint sets which are used as class-labels. Thus each concept has a single 'hypernym', and the 'affinity-score' between a word w and a class h is simply the set membership function, a(w,h) = 1 if w [?] h and 0 otherwise. A collection of words is assigned a class-label by majority voting.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Ambiguity </SectionTitle> <Paragraph position="0"> In theory, rather than a class-label for related strings, we would like one for related meanings -- the concepts to which the strings refer. To implement this for a set of words, we alter our affinity score function a as follows.</Paragraph> <Paragraph position="1"> Let C(w) be the set of concepts to which the word w could refer. (So each c [?] C is a possible sense of w.)</Paragraph> <Paragraph position="3"> This implies that the 'preferred-sense' of w with respect to the possible subsumer h is the sense closest to h. In practice, our class-labelling algorithm implements this preference by computing the affinity score a(c,h) for all c [?] C(w) and only using the best match. This selective approach is much less noisy than simply averaging the probability mass of the word over each possible sense (the technique used in (Li and Abe, 1998), for example).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Choice of scoring functions for the </SectionTitle> <Paragraph position="0"> class-labelling algorithm The precise choice of class-labelling algorithm depends on the functions f and g in the affinity score function a of equation (2). There is some tension here between being correct and being informative: 'correct' but uninformative class-labels (such as entity, something) can be obtained easily by preferring nodes high up in the hierarchy, but since our goal in this work was to classify unknown words in an informative and accurate fashion, the functions f and g had to be chosen to give an appropriate balance. After a variety of heuristic tests, the function f was chosen to be</Paragraph> <Paragraph position="2"> where for the distance function dist(w,h) we chose the computationally simple method of counting the number of taxonomic levels between w and h (inclusively to avoid dividing by zero). For the penalty function g we chose the constant g = 0.25.</Paragraph> <Paragraph position="3"> The net effect of choosing the reciprocal-distancesquared and a small constant penalty function was that hypernyms close to the concept in question received magnified credit, but possible class-labels were not penalized too harshly for missing out a node. This made the algorithm simple and robust to noise but with a strong preference for detailed information-bearing class-labels. This configuration of the class-labelling algorithm was used in all the experiments described below.</Paragraph> </Section> </Section> class="xml-element"></Paper>