File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/j98-1006_abstr.xml
Size: 3,799 bytes
Last Modified: 2025-10-06 13:49:09
<?xml version="1.0" standalone="yes"?> <Paper uid="J98-1006"> <Title>Using Corpus Statistics and WordNet Relations for Sense Identification</Title> <Section position="2" start_page="0" end_page="0" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> An impressive array of statistical methods have been developed for word sense identification. They range from dictionary-based approaches that rely on definitions (V~ronis and Ide 1990; Wilks et al. 1993) to corpus-based approaches that use only word co-occurrence frequencies extracted from large textual corpora (Sch~itze 1995; Dagan and Itai 1994). We have drawn on these two traditions, using corpus-based co-occurrence and the lexical knowledge base that is embodied in the WordNet lexicon.</Paragraph> <Paragraph position="1"> The two traditions complement each other. Corpus-based approaches have the advantage of being generally applicable to new texts, domains, and corpora without needing costly and perhaps error-prone parsing or semantic analysis. They require only training corpora in which the sense distinctions have been marked, but therein lies their weakness. Obtaining training materials for statistical methods is costly and timeconsuming--it is a &quot;knowledge acquisition bottleneck&quot; (Gale, Church, and Yarowsky 1992a). To open this bottleneck, we use WordNet's lexical relations to locate unsupervised training examples.</Paragraph> <Paragraph position="2"> Section 2 describes a statistical classifier, TLC (Topical/Local Classifier), that uses topical context (the open-class words that co-occur with a particular sense), local context (the open- and closed-class items that occur within a small window around a word), or a combination of the two. The results of combining the two types of context to disambiguate a noun (line), a verb (serve), and an adjective (hard) are presented. The following questions are discussed: When is topical context superior to local context (and vice versa)? Is their combination superior to either type alone? Do the answers to these questions depend on the size of the training? Do they depend on the syntactic category of the target? * Division of Cognitive and Instructional Science, Princeton, NJ 08541; e-mail: cleacock@ets.org. The work reported here was done while the author was at Princeton University.</Paragraph> <Paragraph position="3"> t Department of Psychology, 695 Park Avenue, New York, NY 10021; e-mail: mschc@cunyvm.cuny.edu Cognitive Science Laboratory, 221 Nassau Street, Princeton, NJ 08542; e-mail: geo@clarity.princeton.edu Q 1998 Association for Computational Linguistics Computational Linguistics Volume 24, Number 1 Manually tagged training materials were used in the development of TLC and the experiments in Section 2. The Cognitive Science Laboratory at Princeton University, with support from NSF-ARPA, is producing textual corpora that can be used in developing and evaluating automatic methods for disambiguation. Examples of the different meanings of one thousand common, polysemous, open-class English words are being manually tagged. The results of this effort will be a useful resource for training statistical classifiers, but what about the next thousand polysemous words, and the next? In order to identify senses of these words, it will be necessary to learn how to harvest training examples automatically.</Paragraph> <Paragraph position="4"> Section 3 describes WordNet's lexical relations and the role that monosemous &quot;relatives&quot; of polysemous words can play in creating unsupervised training materials. TLC is trained with automatically extracted examples, its performance is compared with that obtained from manually tagged training materials.</Paragraph> </Section> class="xml-element"></Paper>