File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1808_intro.xml
Size: 6,114 bytes
Last Modified: 2025-10-06 14:02:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1808"> <Title>Discovering Synonyms and Other Related Words</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Finding related words among the words in a document collection can be seen as a clustering problem, where we expect the words in a cluster to be closely related to the same sense or to be distributional substitutes or proxies for one another. A number of language-technology tasks can benefit from such word clusters, e.g. document classification applications, language modelling, resolving prepositional phrase attachment, conjunction scope identification, word sense disambiguation, word sense separation, automatic thesaurus generation, information retrieval, anaphor resolution, text simplification, topic identification, spelling correction (Weeds, 2003).</Paragraph> <Paragraph position="1"> At present, synonyms and other related words are available in manually constructed ontologies, such as synonym dictionaries, thesauri, translation dictionaries and terminologies. Manually constructing ontologies is time-consuming even for a single domain. On the world-wide web there are documents on many topics in different languages that could benefit from having an ontology. For many of them some degree of automation is eventually needed.</Paragraph> <Paragraph position="2"> Humans often infer the meaning of an unknown word from its context. Lets look at a less well-known word like blopping. We look it up on the Web. Some of the hits are: Blopping through some of my faves, i.e. leafing through favourite web links, A blop module emits strange electronic blopping noises, i.e. an electronic sound, The volcano looked like something off the cover of a Tolkien novel - perfectly conical, billowing smoke and blopping out chunks of bright orange lava, i.e. spluttering liquid. At first we find all of them different and perhaps equally important. When looking at further links, we get an intuition that the first instance is perhaps a spurious creative metonym, whereas the two others can be regarded as more or less conventional and represent two distinct senses of blopping. However, the meaning of all three seems to be related to a sound, which is either clicking or spluttering in nature.</Paragraph> <Paragraph position="3"> The intuition is that words occurring in the same or similar contexts tend to convey similar meaning. This is known as the Distributional Hypothesis (Harris, 1968). There are many approaches to computing semantic similarity between words based on their distribution in a corpus. For a general overview of similarity measures, see (Manning and Sch&quot;utze, 1999), and for some recent and extensive overviews and evaluations of similarity measures for i.a. automatic thesaurus construction, see (Weeds, 2003; Curran, 2003; Lee, 2001; Dagan et al., 1999). They show that the information radius and the a-skew distance are among the best for finding distributional proxies for words.</Paragraph> <Paragraph position="4"> If we assume that a word w is represented as a sum of its contexts and that we can calculate the similarities between such word representations, we get a list Lw of words with quantifications of how similar they are to w. Each similarity CompuTerm 2004 - 3rd International Workshop on Computational Terminology 63 list Lw contains a mix of words related to the senses of the word w.</Paragraph> <Paragraph position="5"> If we wish to identify groups of synonyms and other related words in a list of similarityrated words, we need to find clusters of similar words that are more similar to one another than they are to other words. For a review of general clustering algorithms, see (Jain et al., 1999) and for a recent evaluation of clustering algorithms for finding word categories, see (Pantel, 2003). (Pantel, 2003) shows that among the standard algorithms the average-link and the k-means clustering perform the best when trying to discover meaningful word groups.</Paragraph> <Paragraph position="6"> In order to evaluate the quality of the discovered clusters three methods can be used, i.e.</Paragraph> <Paragraph position="7"> measuring the internal coherence of clusters, embedding the clusters in an application, or evaluating against a manually generated answer key. The first method is generally used by the clustering algorithms themselves. The second method is especially relevant for applications that can deal with noisy clusters and avoids the need to generate answer keys specific to the word clustering task. The third method requires a gold standard such as WordNet or some other ontological resource. For an overview of evaluation methodologies for word clustering, see (Weeds, 2003; Curran, 2003; Pantel, 2003).</Paragraph> <Paragraph position="8"> The contribution of this article is four-fold.</Paragraph> <Paragraph position="9"> The first contribution is to apply the information radius in a full dependency syntactic feature space when calculating the similarities between words. Previously, only a restricted set of dependency relations has been applied. The second contribution is a similarity recalculation during clustering, which we introduce as a fast approximation of high-dimensional feature space and study its effect on some standard clustering algorithms. The third contribution is a simple but efficient way to evaluate the synonym content of clusters by using translation dictionaries for several languages. Finally we show that 69-79% of the words in the discovered clusters are useful for thesaurus construction.</Paragraph> <Paragraph position="10"> The rest of this article is organized as follows. Section 2 presents the corpus data and the the feature extraction. Section 3 introduces the discovery methodology. Section 4 presents the evaluation methodology. In Section 5 we present the experiments and evaluate the results and their significance. Sections 6 and 7 contain the discussion and conclusion, respectively.</Paragraph> </Section> class="xml-element"></Paper>