File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2127_intro.xml
Size: 3,823 bytes
Last Modified: 2025-10-06 14:06:33
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2127"> <Title>Automatic Retrieval and Clustering of Similar Words</Title> <Section position="2" start_page="0" end_page="768" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The meaning of an unknown word can often be inferred from its context. Consider the following (slightly modified) example in (Nida, 1975, p.167): (1) A bottle of tezgiiino is on the table. Everyone likes tezgiiino.</Paragraph> <Paragraph position="1"> Tezgiiino makes you drunk.</Paragraph> <Paragraph position="2"> We make tezgiiino out of corn.</Paragraph> <Paragraph position="3"> The contexts in which the word tezgiiino is used suggest that tezgiiino may be a kind of alcoholic beverage made from corn mash.</Paragraph> <Paragraph position="4"> Bootstrapping semantics from text is one of the greatest challenges in natural language learning. It has been argued that similarity plays an important role in word acquisition (Gentner, 1982). Identifying similar words is an initial step in learning the definition of a word. This paper presents a method for making this first step. For example, given a corpus that includes the sentences in (1), our goal is to be able to infer that tezgiiino is similar to &quot;beer&quot;, &quot;wine&quot;, &quot;vodka&quot;, etc.</Paragraph> <Paragraph position="5"> In addition to the long-term goal of bootstrapping semantics from text, automatic identification of similar words has many immediate applications. The most obvious one is thesaurus construction. An automatically created thesaurus offers many advantages over manually constructed thesauri. Firstly, the terms can be corpus- or genre-specific. Manually constructed general-purpose dictionaries and thesauri include many usages that are very infrequent in a particular corpus or genre of documents. For example, one of the 8 senses of &quot;company&quot; in WordNet 1.5 is a &quot;visitor/visitant&quot;, which is a hyponym of &quot;person&quot;. This usage of the word is practically never used in newspaper articles. However, its existance may prevent a co-reference recognizer to rule out the possiblity for personal pronouns to refer to &quot;company&quot;. Secondly, certain word usages may be particular to a period of time, which are unlikely to be captured by manually compiled lexicons. For example, among 274 occurrences of the word &quot;westerner&quot; in a 45 million word San Jose Mercury corpus, 55% of them refer to hostages. If one needs to search hostage-related articles, &quot;westemer&quot; may well be a good search term.</Paragraph> <Paragraph position="6"> Another application of automatically extracted similar words is to help solve the problem of data sparseness in statistical natural language processing (Dagan et al., 1994; Essen and Steinbiss, 1992). When the frequency of a word does not warrant reliable maximum likelihood estimation, its probability can be computed as a weighted sum of the probabilities of words that are similar to it. It was shown in (Dagan et al., 1997) that a similarity-based smoothing method achieved much better results than back-off smoothing methods in word sense disambiguation. null The remainder of the paper is organized as follows. The next section is concerned with similarities between words based on their distributional patterns. The similarity measure can then be used to create a thesaurus. In Section 3, we evaluate the constructed thesauri by computing the similarity between their entries and entries in manually created thesauri. Section 4 briefly discuss future work in clustering similar words. Finally, Section 5 reviews related work and summarize our contributions.</Paragraph> </Section> class="xml-element"></Paper>