File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-1036_evalu.xml
Size: 13,594 bytes
Last Modified: 2025-10-06 13:58:56
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-1036"> <Title>Unsupervised methods for developing taxonomies by combining syntactic and statistical information</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Experiments and Evaluation </SectionTitle> <Paragraph position="0"> To test the success of our approach to placing unknown words into the WordNet taxonomy on a large and significant sample, we designed the following experiment. If the algorithm is successful at placing unknown words in the correct new place in a taxonomy, we would expect it to place already known words in their current position.</Paragraph> <Paragraph position="1"> The experiment to test this worked as follows.</Paragraph> <Paragraph position="2"> * For a word w, find the neighbors N(w) of w in WordSpace. Remove w itself from this set.</Paragraph> <Paragraph position="3"> * Find the best class-label hmax(N(w)) for this set (using Definition 1).</Paragraph> <Paragraph position="4"> * Test to see if, according to WordNet, hmax is a hypernym of the original word w, and if so check how closely hmax subsumes w in the taxonomy.</Paragraph> <Paragraph position="5"> Since our class-labelling algorithm gives a ranked list of possible hypernyms, credit was given for correct classifications in the top 4 places. This algorithm was tested on singular common nouns (PoS-tag nn1), proper nouns (PoS-tag np0) and finite present-tense verbs (PoS-tag vvb). For each of these classes, a random sample of words was selected with corpus frequencies ranging from 1000 to 250. For the noun categories, 600 words were sampled, and for the finite verbs, 420. For each word w, we found semantic neighbors with and without using part-of-speech information. The same experiments were carried out using 3, 6 and 12 neighbors: we will focus on the results for 3 and 12 neighbors since those for 6 neighbors turned out to be reliably 'somewhere in between' these two.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Results for Common Nouns </SectionTitle> <Paragraph position="0"> The best results for reproducing WordNet classifications were obtained for common nouns, and are summarized in Table 2, which shows the percentage of test words w which were given a class-label h which was a correct hypernym according to WordNet (so for which h [?] H(w)). For these words for which a correct classification was found, the 'Height' columns refer to the number of levels in the hierarchy between the target word w and the class-label h. If the algorithm failed to find a class-label h which is a hypernym of w, the result was counted as 'Wrong'. The 'Missing' column records the number of words in the sample which are not in WordNet at all.</Paragraph> <Paragraph position="1"> The following trends are apparent. For finding any correct class-label, the best results were obtained by taking 12 neighbors and using part-of-speech information, which found a correct classification for 485/591 = 82% of the common nouns that were included in Word-Net. This compares favorably with previous experiments, though as stated earlier it is difficult to be sure we are comparing like with like. Finding the hypernym which immediately subsumes w (with no intervening nodes) exactly reproduces a classification given by WordNet, and as such was taken to be a complete success. Taking fewer neighbors and using PoS-information both improved this success rate, the best accuracy obtained being 86/591 = 15%. However, this configuration actually gave the worst results at obtaining a correct classification overall.</Paragraph> <Paragraph position="2"> In conclusion, taking more neighbors makes the chances of obtaining some correct classification for a word w greater, but taking fewer neighbors increases the chances of 'hitting the nail on the head'. The use of part-of-speech information reliably increases the chances of correctly obtaining both exact and broadly correct classifications, though careful tuning is still necessary to obtain optimal results for either.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Results for Proper Nouns and Verbs </SectionTitle> <Paragraph position="0"> The results for proper nouns and verbs (also in Table 2) demonstrate some interesting problems. On the whole, the mapping is less reliable than for common nouns, at least when it comes to reconstructing WordNet as it currently stands.</Paragraph> <Paragraph position="1"> Proper nouns are rightly recognized as one of the categories where automatic methods for lexical acquisition are most important (Hearst and Sch&quot;utze, 1993, SS4). It is impossible for a single knowledge base to keep up-to-date with all possible meanings of proper names, and this would be undesirable without considerable filtering abilities because proper names are often domain-specific.</Paragraph> <Paragraph position="2"> Ih our experiments, the best results for proper nouns were those obtained using 12 neighbors, where a correct classification was found for 206/266 = 77% of the proper nouns that were included in WordNet, using no part-of-speech information. Part-of-speech information still helps for mapping proper nouns into exactly the right place, but in general degrades performance.</Paragraph> <Paragraph position="3"> Several of the proper names tested are geographical, and in the BNC they often refer to regions of the British Isles which are not in WordNet. For example, hampshire is labelled as a territorial division, which as an English county it certainly is, but in WordNet hampshire is instead a hyponym of domestic sheep. For many of the proper names which our evaluation labelled as 'wrongly classified', the classification was in fact correct but a different meaning from those given in WordNet. The challenge for these situations is how to recognize when corpus methods give a correct meaning which is different from the meaning already listed in a knowledge base.</Paragraph> <Paragraph position="4"> Many of these meanings will be systematically related (such as the way a region is used to name an item or product from that region, as with the hampshire example above) by generative processes which are becoming well understood by theoretical linguists (Pustejovsky, 1995), and linguistic theory may help our statistical algorithms considerably by predicting what sort of new meanings we might expect a known word to assume through metonymy and systematic polysemy.</Paragraph> <Paragraph position="5"> Typical first names of people such as lisa and ralph almost always have neighbors which are also first names (usually of the same gender), but these words are not represented in WordNet. This lexical category is ripe for automatic discovery: preliminary experiments using the two names above as 'seed-words' (Roark and Charniak, 1998; Widdows and Dorow, 2002) show that by taking a few known examples, finding neighbors and removing words which are already in WordNet, we can collect first names of the same gender with at least 90% accuracy.</Paragraph> <Paragraph position="6"> Verbs pose special problems for knowledge bases. The usefulness of an IS A hierarchy for pinpointing information and enabling inference is much less clear-cut than for nouns. For example, sleeping does entail breathing and arriving does imply moving, but the aspectual properties, argument structure and case roles may all be different. The more restrictive definition of troponymy is used in WordNet to describe those properties of verbs that are inherited through the taxonomy (Fellbaum, 1998, Ch 3). In practice, the taxonomy of verbs in WordNet tends to have fewer levels and many more branches than the noun taxonomy. This led to problems for our class-labelling algorithm -- class-labels obtained for the verb play included exhaust, deploy, move and behave, all of which are 'correct' hypernyms according to WordNet, while possible class-labels obtained for the verb appeal included keep, defend, reassert and examine, all of which were marked 'wrong'. For our methods, the WordNet taxonomy as it stands appears to give much less reliable evaluation criteria for verbs than for common nouns. It is also plausible that similarity measures based upon simple co-occurence are better for modelling similarity between nominals than between verbs, an observation which is compatible with psychological experiments on word-association (Fellbaum, 1998, p. 90).</Paragraph> <Paragraph position="7"> In our experiments, the best results for verbs were clearly those obtained using 12 neighbors and no part-of-speech information, for which some correct classification was found for 273/406 = 59% of the verbs that were included in WordNet, and which achieved better results than those using part-of-speech information even for finding exact classifications. The shallowness of the taxonomy for verbs means that most classifications which were successful at all were quite close to the word in question, which should be taken into account when interpreting the results in Table 2.</Paragraph> <Paragraph position="8"> As we have seen, part-of-speech information degraded performance overall for proper nouns and verbs. This may be because combining all uses of a particular word-form into a single vector is less prone to problems of data sparseness, especially if these word-forms are semantically related in spite of part-of-speech differences 2. It is also plausible that discarding part-of-speech information 2This issue is reminiscent of the question of whether stemming improves or harms information retrieval (Baeza-Yates and Ribiero-Neto, 1999) -- the received wisdom is that stemming (at best) improves recall at the expense of precision and our findings for proper nouns are consistent with this.</Paragraph> <Paragraph position="9"> should improve the classification of verbs for the following reason. Classification using corpus-derived neighbors is markedly better for common nouns than for verbs, and most of the verbs in our sample (57%) also occur as common nouns in WordSpace. (In contrast, only 13% of our common nouns also occur as verbs, a reliable asymmetry for English.) Most of these noun senses are semantically related in some way to the corresponding verbs. Since using neighboring words for classification is demonstrably more reliable for nouns than for verbs, putting these parts-of-speech together in a single vector in WordSpace might be expected to improve performance for verbs but degrade it for nouns.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Filtering using Affinity scores </SectionTitle> <Paragraph position="0"> One of the benefits of the class-labelling algorithm (Definition 1) presented in this paper is that it returns not just class-labels but an affinity score measuring how well each class-label describes the class of objects in question.</Paragraph> <Paragraph position="1"> The affinity score turns out to be signficantly correlated with the likelihood of obtaining a successful classification. This can be seen very clearly in Table 3, which shows the average affinity score for correct class-labels of different heights above the target word, and for incorrect class-labels -- as a rule, correct and informative class-labels have significantly higher affinity scores than incorrect class-labels. It follows that the affinity score can be used as an indicator of success, and so filtering out class-labels with poor scores can be used as a technique for improving accuracy.</Paragraph> <Paragraph position="2"> To test this, we repeated our experiments using 3 neighbors and this time only using class-labels with an affinity score greater than 0.75, the rest being marked 'unknown'. Without filtering, there were 1143 successful and 1380 unsuccessful outcomes: with filtering, these numbers changed to 660 and 184 respectively. Filtering discarded some 87% of the incorrect labels and kept more than half of the correct ones, which amounts to at least a fourfold improvement in accuracy. The improvement was particularly dramatic for proper nouns, where filtering removed 270 out of 283 incorrect results and still retained half of the correct ones.</Paragraph> <Paragraph position="3"> Conclusions For common nouns, where WordNet is most reliable, our mapping algorithm performs comparatively well, accurately classifying several words and finding some correct information about most others. The optimum number of neighbors is smaller if we want to try for an exact classification and larger if we want information that is broadly reliable. Part-of-speech information noticeably improves the process of both broad and narrow classification. For proper names, many classifications are correct, and many which are absent or incorrect according to WordNet are in fact correct meanings which should be added to the knowledge base for (at least) the domain in question. Results for verbs are more difficult to interpret: reasons for this might include the shallowness and breadth of the WordNet verb hierarchy, the suitability of our WordSpace similarity measure, and many theoretical issues which should be taken into account for a successful approach to the classification of verbs.</Paragraph> <Paragraph position="4"> Filtering using the affinity score from the class-labelling algorithm can be used to dramatically increase performance.</Paragraph> </Section> </Section> class="xml-element"></Paper>