File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0103_metho.xml
Size: 9,635 bytes
Last Modified: 2025-10-06 14:07:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0103"> <Title>Reducing Lexical Semantic Complexity with Systematic Polysemous Classes and Underspecification</Title> <Section position="4" start_page="15" end_page="16" type="metho"> <SectionTitle> 3 CoreLex-II </SectionTitle> <Paragraph position="0"> Thereby following Apresjan's definition of systematic polysemy discussed above.</Paragraph> <Section position="1" start_page="15" end_page="15" type="sub_section"> <SectionTitle> 3.2 The Algorithm </SectionTitle> <Paragraph position="0"> The algorithm works example for nouns): as follows (for</Paragraph> </Section> <Section position="2" start_page="15" end_page="16" type="sub_section"> <SectionTitle> 3.1 A More Flexible Approach </SectionTitle> <Paragraph position="0"> The CoreLex database has been used and/or evaluated in a number of projects, leading to some criticisms of the approach in (Krymolowski and Roth 1998) (Peters, Peters and Vossen 1998) (Tomuro 1998) and in personal communication. Primarily it was argued that the choice of basic types is arbitrary and on too high a level. Systematic class discovery in the original approach is dependent on this set of basic types, which means that classes on lower levels are not captured at all. Further criticism arose on the arbitrariness (and inefficiency) of human intervention in grouping together resulting classes into more comprehensive ones based on the similarity of their members.</Paragraph> <Paragraph position="1"> In response to this, a new approach was formulated and implemented that addresses both these points. Comparison of sense distributions is now performed over synsets on all levels, not just over a small set on the top levels. In addition, similarity on sense distribution between words need no longer be complete (100%), as with the former approach. Instead, a threshold on similarity can be set that constraints a clustering algorithm for automatically grouping together words into systematic polysemous classes. (No human intervention to further group together resulting classes is required.) This approach took inspiration from the pioneering work by (Dolan 1994), but it is also fundamentally different, because instead of grouping similar senses together, the CoreLex approach groups together words according to all of their senses.</Paragraph> <Paragraph position="2"> 1. foreach noun 2. get all levell synsets (senses) 3. if number of level1 synsets > 1 then put noun in list 4. foreaeh level1 synset 5. get all higher level synsets (hypernyms) 6. foreaeh nouna in list 7. foreaeh noun2 in list 8. compute similarity nounx and nounz 9. if similarity > threshold then put nouns and nounz in matrix 10. foreaeh nounl in matrix 11. if noun1 not assigned to a cluster then construct a new cluster Ci and assign noun1 to it 12. foreaeh noun2 similar to nounl 13. if nounz not assigned to a cluster then assign nounz to new cluster Ci For every noun in the WordNet or GermaNet index, get all of its senses (which are in fact level1 synsets). If a noun has more than one sense put it in a separate list that will be used for further processing. Nouns with only one sense are not used in further processing because we are only interested in systematic distributions of more than one sense over several nouns. In order to compare nouns not only on the sense level but rather over the whole of the WordNet hierarchy, also all higher level synsets (hypernyms) for each sense are stored.</Paragraph> <Paragraph position="3"> Then, for each noun we compare its &quot;sense distribution&quot; (the complete set of synsets derived in the previous steps) with each other noun. Similarity is computed using the Jaccard score, which compares objects according to the attributes they share and their unique attributes. If the similarity is over a certain threshold, the noun pair is stored in a matrix which is consequently used in a final clustering step.</Paragraph> <Paragraph position="4"> Finally, the clustering itself is a simple, single link algorithm that groups together objects uniquely in discrete clusters.</Paragraph> </Section> <Section position="3" start_page="16" end_page="16" type="sub_section"> <SectionTitle> 3.3 Quantitative and Qualitative Analysis </SectionTitle> <Paragraph position="0"> Depending on the threshold on similarity, the algorithm generates a number of clusters of ambiguous words that share similar sense distributions, and which can be seen as systematic polysemous classes. In the following table an overview is given of results with different thresholds. The number of nouns in WordNet that were processed is 46.995, of which 10.772 have more than one sense.</Paragraph> <Paragraph position="1"> A qualitative analysis of the clusters shows that best results are obtained with a threshold of 0,75. Some of the resulting clusters with this threshold are: * ball/game baseball, basketball, handball, volleyball football, fish / food albacore, blowfish, bluefin, bluefish, bonito, bream, butterfish, crappie, croaker, dolphinfish, flatfish, flounder, grouper, halibut, lingcod, mackerel, mahimahi, mullet, muskellunge, pickerel, pompano, porgy, puffer, rockfish, sailfish, scup, striper, swordfish, tuna, tunny, weakfish * plant/nut almond, butternut, candlenut, cashew, chinquapin, chokecherry, cobnut, filbert, hazelnut, pistachio * plant / berry bilberry, blueberry, checkerberry, cowberry, cranberry, currant, feijoa, gooseberry, huckleberry, juneberry, lingonberry, serviceberry, spiceberry; teaberry, whortleberry * vessel \] measure bottle, bucket, cask, flask, jug, keg, pail, tub * cord / fabric chenille, lace, laniard, lanyard, ripcord, whipcord, worsted * taste_property/sensation acridity, aroma, odor, odour, pungency * communication / noise clamor, hiss, howl, roar, roaring, screaming, screech, screeching, shriek, sigh, splutter, sputter, whisper</Paragraph> </Section> </Section> <Section position="5" start_page="16" end_page="17" type="metho"> <SectionTitle> 4 Application </SectionTitle> <Paragraph position="0"> Systematic polysemous classes that are obtained in this way can be used as filters on sense disambiguation in a variety of applications in which a coarse grained sense assignment will suffice in many cases, but where an option of further specification exists. For instance, in information retrieval it will not always be necessary to distinguish between the two interpretations of &quot;baseball, ,,2 basketball, football ..... Users looking for information on a baseball-game may be interested also in baseball-balls. On the other hand, a user may be interested specifically in buying a new baseball-ball and does not wish to be flooded with irrelevant information on baseball-games. In this case, the underspecified ball / game sense needs to be further specified in the ball sense only. Similarly, it will not always be necessary to distinguish exactly between the vessel interpretation of &quot;bottle, bucket, cask, ...&quot; and the measure interpretation, or between the communication interpretation of a &quot;clamor, hiss, roar, ...&quot; and the noise interpretation.</Paragraph> <Paragraph position="1"> Currently, a query expansion module based on the approach described here is under development as part of the prototype systems of two EU funded projects: MIETTA 3 (a cross-lingual search engine in the tourism domain - Buitelaar et al 1998) and OLIVE 4 (a cross-lingual video retrieval system).</Paragraph> <Paragraph position="2"> Also in shallow processing applications like semantic pre-processing for document categorization it will be sufficient to use an underspecified sense instead of needless disambiguation between senses that are roughly equal in their relevance to a certain document category. Similarly, in shallow syntactic processing tasks, like statistical disambiguation of PP-attachment, the use of underspecified senses may be preferable as shown in experiments by (Krymolowski and Roth 1998).</Paragraph> <Paragraph position="3"> 2 Compare (SchUtze 1997) for a similar, but purely statistical approach to underspecification in lexical semantic processing and its use in machine learning and information retrieval.</Paragraph> <Paragraph position="4"> In order to train systems to accurately perform syntactic analysis on the basis of semantic classes, semantically annotated corpora are needed. This is another area of application of the research described here.</Paragraph> <Paragraph position="5"> CoreLex clusters can be considered by annotators as alternatives to WordNet or GermaNet synsets if they are not able to choose between the senses given and instead prefer an underspecified sense. This approach is currently tested, in cooperation with the GermaNet group of the University of Ttibingen, in a preliminary project on semantic annotation of German newspaper text.</Paragraph> <Paragraph position="6"> Conclusion We presented a new algorithm for generating systematic polysemous classes from existing resources like WordNet and similar semantic databases. Results were discussed for classes of English nouns as generated from WordNet. With a threshold of 75% similarity between nouns, 1341 classes could be found covering 3336 nouns. Not discussed were similar experiments for verbs and adjectives, both in English and German. The resulting classes can be used as filters on incremental sense disambiguation in various applications in which coarse grained (underspecified) senses are preferred, but from which more fine grained senses can be derived on demand.</Paragraph> </Section> class="xml-element"></Paper>