File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-1032_intro.xml

Size: 5,561 bytes

Last Modified: 2025-10-06 14:01:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1032">
  <Title>Frequency Estimates for Statistical Word Similarity Measures</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Many different statistical tests have been proposed to measure the strength of word similarity or word association in natural language texts (Dunning, 1993; Church and Hanks, 1990; Dagan et al., 1999). These tests attempt to measure dependence between words by using statistics taken from a large corpus. In this context, a key assumption is that similarity between words is a consequence of word co-occurrence, or that the closeness of the words in text is indicative of some kind of relationship between them, such as synonymy or antonymy.</Paragraph>
    <Paragraph position="1"> Although word sequences in natural language are unlikely to be independent, these statistical tests provide quantitative information that can be used to compare pairs of co-occurring words. Also, despite the fact that word co-occurrence is a simple idea, there are a variety of ways to estimate word co-occurrence frequencies from text. Two words can appear close to each other in the same document, passage, paragraph, sentence or fixed-size window. The boundaries for determining co-occurrence will affect the estimates and as a consequence the word similarity measures.</Paragraph>
    <Paragraph position="2"> Statistical word similarity measures play an important role in information retrieval and in many other natural language applications, such as the automatic creation of thesauri (Grefenstette, 1993; Li and Abe, 1998; Lin, 1998) and word sense disambiguation (Yarowsky, 1992; Li and Abe, 1998). Pantel and Lin (2002) use word similarity to create groups of related words, in order to discover word senses directly from text. Recently, Tan et al. (2002) provide an analysis on different measures of independence in the context of association rules.</Paragraph>
    <Paragraph position="3"> Word similarity is also used in language modeling applications. Rosenfeld (1996) uses word similarity as a constraint in a maximum entropy model which reduces the perplexity on a test set by 23%. Brown et al. (1992) use a word similarity measure for language modeling in an interpolated model, grouping similar words into classes. Dagan et al. (1999) use word similarity to assign probabilities to unseen bigrams by using similar bigrams, which reduces perplexity up to 20% in held out data.</Paragraph>
    <Paragraph position="4"> In information retrieval, word similarity can be used to identify terms for pseudo-relevance feedback (Harman, 1992; Buckley et al., 1995; Xu and Croft, 2000; Vechtomova and Robertson, 2000). Xu and Croft (2000) expand queries under a pseudo-relevance feedback model by using similar words from documents retrieved and improve effectiveness by more than 20% on an 11-point average precision.</Paragraph>
    <Paragraph position="5"> Landauer and Dumais (1997) applied word similarity measures to answer TOEFL (Test Of English as a Foreign Language) synonym questions using Latent Semantic Analysis. Turney (2001) performed an evaluation of a specific word similarity measure using the same TOEFL questions and compared the results with those obtained</Paragraph>
    <Paragraph position="7"/>
    <Paragraph position="9"> In our investigation of frequency estimates for word similarity measures, we compare the results of several different measures and frequency estimates to solve human-oriented language tests. Our investigation is based in part on the questions used by Landauer and Dumais, and by Turney. An example of such tests is the determination of the best synonym in a set of alternatives</Paragraph>
    <Paragraph position="11"> a6 for a specific target word a1a3a2 in a context a0 a7 a5a22a21a3a23a9 a11 a21a3a23a14 a11a25a24a26a24a27a11 a21a28a23a29 a6a31a30 a1a3a2 , as shown in figure 1. Ideally, the context can provide support to choose best alternative for each question. We also investigate questions where no context is available, as shown in figure 2. These questions provides an easy way to assess the performance of measures and the co-occurrence frequency estimation methods used to compute them.</Paragraph>
    <Paragraph position="12"> Although word similarity has been used in many different applications, to the best of our knowledge, ours is the first comparative investigation of the impact of co-occurrence frequency estimation on the performance of word similarity measures. In this paper, we provide a comprehensive study of some of the most widely used similarity measures with frequency estimates taken from a terabyte-sized corpus of Web data, both in the presence of context and not. In addition, we investigate frequency estimates for co-occurrence that are based both on documents and on a variety of different window sizes, and examine the impact of the corpus size on the frequency estimates. In questions where context is available, we also investigate the effect of adding more words from context.</Paragraph>
    <Paragraph position="13"> The remainder of this paper is organized as follows: In section 2 we briefly introduce some of the most commonly used methods for measuring word similarity.</Paragraph>
    <Paragraph position="14"> In section 3 we present methods to assess word co-occurrence frequencies. Section 4 presents our experimental evaluation, which is followed by a discussion of the results in section 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML