File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0803_intro.xml
Size: 3,770 bytes
Last Modified: 2025-10-06 14:06:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0803"> <Title>Extending a thesaurus by classifying words</Title> <Section position="4" start_page="0" end_page="16" type="intro"> <SectionTitle> Roger's International Thesaurus \[Chapman, 1984\] and </SectionTitle> <Paragraph position="0"> WordNet \[Miller et al., 1993\] are typical English thesauri which have been widely used in past NLP research \[Resnik, 1992; Yarowsky, 1992\]. They are handcrafted, machine-readable and have fairly broad coverage. However, since these thesauri were originally compiled for human use, they are not always suitable for computer-based natural language processing. Limitations of handcrafted thesauri can be summarized as follows \[Hatzivassiloglou and McKeown, 1993; Uramoto, and effort The vocabulary size of typical handcrafted thesauri ranges from 50,000 to 100,000 words, including general words in broad domains. From the viewpoint of NLP systems dealing with a particular domain, however, these thesauri include many unnecessary (general) words and do not include necessary domain-specific words.</Paragraph> <Paragraph position="1"> The second problem with handcrafted thesauri is that their classification is based on the intuition of lexicographers, with their classification criteria not always being clear. For the purposes of NLP systems, their classification of words is sometimes too coarse and does not provide sufficient distinction between words, or is some times unnecessarily detailed.</Paragraph> <Paragraph position="2"> Lastly, building thesauri by hand requires significant amounts of time and effort even for restricted domains.</Paragraph> <Paragraph position="3"> Furthermore, this effort is repeated when a system is ported to another domain.</Paragraph> <Paragraph position="4"> This criticism leads us to automatic approaches for building thesauri from large corpora \[Hirschman et al., 1975; Hindle, 1990; Hatzivassiloglou and McKeown, 1993; Pereira et al., 1993; Tokunaga et aL, 1995; Ushioda, 1996\]. Past attempts have basically taken the following steps \[Charniak, 1993\].</Paragraph> <Paragraph position="5"> (1) extract word co-occurrences (2) define similarities (distances) between words on the basis of co-occurrences (3) cluster words on the basis of similarities The most crucial part of this approach is gathering word co-occurrence data. Co-occurrences are usually gathered on the basis of certain relations such as predicateargument, modifier-modified, adjacency, or mixture of these. However, it is very difficult to gather sufficient co-occurrences to calculate similarities reliably \[Resnik, 1992; Basili et al., 1992\]. It is sometimes impractical to build a large thesaurus from scratch based on only co-occurrence data.</Paragraph> <Paragraph position="6"> Based on this observation, a third approach has been proposed, namely, combining linguistic knowledge and co-occurrence data \[Resnik, 1992; Uramoto, 1996\]. This approach aims at compensating the sparseness of co~ occurrence data by using existing linguistic knowledge, such as WordNet. This paper follows this line of research and proposes a method to extend an existing thesaurus by classifying new words in terms of that thesaurus. In other words, the proposed method identifies appropriate word classes of the thesaurus for a new word which is not included in the thesaurus. This search process is facilitated based on the probability that a word belongs to a given word class. The probability is calculated based on word co'occurrences. As such, this method could also suffer from the data sparseness problem. As Resnik pointed out, however, using the thesaurus structure (classes) can remedy this problem \[Resnik, 1992\].</Paragraph> </Section> class="xml-element"></Paper>