File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1032_metho.xml

Size: 16,807 bytes

Last Modified: 2025-10-06 14:08:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1032">
  <Title>Frequency Estimates for Statistical Word Similarity Measures</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Measuring Word Similarity
</SectionTitle>
    <Paragraph position="0"> The notion for co-occurrence of two words can depicted by a contingency table, as shown in table 1. Each dimension represents a random discrete variable a2a33a32 with range</Paragraph>
    <Paragraph position="2"> text window or document). Each cell in the table represent the joint frequency a40a15a41a43a42a45a44a41a47a46 a7a8a48a50a49a52a51a54a53a28a55a57a56a59a58 a39 a11a61a60a63a62 , where a48a50a49a52a51a54a53 is the maximum number of co-occurrences. Under an independence assumption, the values of the cells in the contingency table are calculated using the probabilities in table 2. The methods described below perform different measures of how distant observed values are from expected values under an independence assumption. Tan et al. (2002) indicate that the difference between the methods arise from non-uniform marginals and how the methods react to this non-uniformity.</Paragraph>
    <Paragraph position="4"/>
    <Paragraph position="6"> Occasionally, a context a0 is available and can provide support for the co-occurrence and alternative methods can be used to exploit this context. The procedures to estimate a56a59a58 a21 a9a12a11 a21 a14a75a62 , as well a56a59a58 a21 a32 a62 , will be described in section 3.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Similarity between two words
</SectionTitle>
      <Paragraph position="0"> We first present methods to measure the similarity between two words a21 a9 and a21 a14 when no context is available.</Paragraph>
      <Paragraph position="1">  This measure for word similarity was first used in this context by Church and Hanks (1990). The measure is given by equation 1 and is called Pointwise Mutual Information. It is a straightforward transformation of the independence assumption (on a specific point), a56a59a58 a21 a9 a11 a21 a14 a62a76a7</Paragraph>
      <Paragraph position="3"> a14 a62 , into a ratio. Positive values indicate that words occur together more than would be expected under an independence assumption. Negative values indicate that one word tends to appear only when the other does not. Values close to zero indicate independence.</Paragraph>
      <Paragraph position="5"> This test is directly derived from observed and expected values in the contingency tables.</Paragraph>
      <Paragraph position="6">  statistic determines a specific way to calculate the difference between values expected under independence and observed ones, as depicted in equation 2. The values a40 a53 a44a117 correspond to the observed frequency estimates. null  The likelihood ratio test provides an alternative to check two simple hypotheses based on parameters of a distribution. Dunning (1993) used a likelihood ratio to test word similarity under the assumption that the words in text have a binomial distribution.</Paragraph>
      <Paragraph position="7"> Two hypotheses used are: H1:a56a59a58 a21 a14a119a118a21 a9a22a62 a7</Paragraph>
      <Paragraph position="9"> H2: a56a59a58 a21 a14 a118a21 a9 a62a121a120a7a122a56a59a58 a21 a14 a118a37 a21 a9 a62 (i.e. not independent).</Paragraph>
      <Paragraph position="10"> These two conditionals are used as sample in the likelihood function a123 a58a84a56a59a58 a21 a14 a118a21 a9 a62a54a11a18a56a59a58 a21 a14 a118a37 a21 a9 a62a100a124a92a125a119a62 , where a125 in this particular case represents the parameter of the binomial distribution a126 a58a128a127a95a11a18a129a65a124a92a125a119a62 . Under hypothesis H1, a56a59a58 a21 a14a130a118a21 a9a25a62a73a7a131a56a59a58 a21 a14a130a118a37 a21 a9a25a62a73a7a133a132 , and for H2,</Paragraph>
      <Paragraph position="12"> Equation 3 represents the likelihood ratio. Asymptotically, a143a28a144a17a145a84a146a36a147a31a148 is a104  distributed.</Paragraph>
      <Paragraph position="13">  This measure corresponds to the expected value of two random variables using the same equation as PMI. Average mutual information was used as a word similarity measure by Rosenfeld (1996) and is given by equation 4.</Paragraph>
      <Paragraph position="15"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Context supported similarity
</SectionTitle>
      <Paragraph position="0"> Similarity between two words can also be inferred from a context (if given). Given a context</Paragraph>
      <Paragraph position="2"> co-occurrence with words in context are similar.</Paragraph>
      <Paragraph position="3">  The PMI between each context word a21a10a23 and a21 a32 form a vector. The elements in the vector represents the similarity weights of a21a28a23 and a21 a32 . The cosine value between the two vectors corresponding to a21</Paragraph>
      <Paragraph position="5"> similarity between the two words in the specified context, as depicted in equation 5.</Paragraph>
      <Paragraph position="7"> Values closer to one indicate more similarity whereas values close to zero represent less similarity. Lesk (1969) was one of the first to apply the cosine measure to word similarity, but did not use pointwise mutual information to compute the weights. Pantel (2002) used the cosine of pointwise mutual information to uncover word sense from text.</Paragraph>
      <Paragraph position="8">  In this method the conditional probability of each word</Paragraph>
      <Paragraph position="10"> lated distance between the conditionals for all words in context represents the similarity between the two words, as shown in equation 6. This method was proposed as an alternative word similarity measure in language modeling to overcome zero-frequency problems of bigrams (Dagan et al., 1999).</Paragraph>
      <Paragraph position="12"> In this measure, a smaller value indicates a greater similarity. null  The conditional probabilities between each word in the context and the two words a21 a9 and a21 a14 are used to calculate the mutual information of the conditionals (equation 7). This method was also used in Dagan et.</Paragraph>
      <Paragraph position="14"> This is an alternative to the Mutual Information formula (equation 8). It helps to avoid zero frequency problem by averaging the two distributions and also provides a symmetric measure (AMIC is not symmetric). This method was also used in Dagan et. al. (1999).</Paragraph>
      <Paragraph position="15">  Turney (2001) proposes a different formula for Point-wise Mutual Information when context is available, as depicted in equation 9. The context is represented by a0 a23 , which is any subset of the context a0 . In fact, Turney argued that bigger a0 a23 sets are worse because they narrow the estimate and as consequence can be affected by noise. As a consequence, Turney used only one word a179 a32 from the context, discarding the remaining words. The chosen word was the one that has biggest pointwise information with a21 a9 . Moreover, a21 a9 (a1a3a2 ) is fixed when the method is used to find the best a4 a32 for a1a3a2 , so a56a59a58 a21 a9 a11 a0 a23 a62 is also fixed and can be ignored, which transforms the equation into the conditional a56a59a58 a21 a9 a118a21 a14 a11 a0 a62 . It is interesting to note that the equation a56a59a58 a21 a9a17a118a21 a14a17a11 a0 a62 is not the traditional n-gram model since no ordering is imposed on the words and also due to the fact that the words in this formula can be separated from one another by other words.</Paragraph>
      <Paragraph position="17"> Many other measures for word similarities exists. Tan et al. (2002) present a comparative study with 21 different measures. Lillian (2001) proposes a new word similarity measure in the context of language modeling, performing an comparative evaluation with other 7 similarity measures. null</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Co-occurrence Estimates
</SectionTitle>
    <Paragraph position="0"> We now discuss some alternatives to estimate word co-occurrence frequencies from an available corpus. All probabilities mentioned in previous section can be estimated from these frequencies. We describe two different approaches: a window-oriented approach and a document-oriented approach.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Window-oriented approach
</SectionTitle>
      <Paragraph position="0"> Let a40 a41 a42 be the frequency of a21 a32 and the co-occurrence frequency of a21 a9 and a21 a14 be denoted by a40 a41a180a64a38a44a41a47a66 . Let a48 be the size of the corpus in words. In the window-oriented approach, individual word frequencies are the corpus frequencies. The maximum likelihood estimate (MLE) for</Paragraph>
      <Paragraph position="2"> The joint frequency a40 a41a65a64a18a44a41a43a66 is estimated by the number of windows where the two words co-occur. The window size may vary, Church and Hanks (1990) used windows of size 2 and 5. Brown et al. (1992) used windows containing 1001 words. Dunning (1993) also used windows of size 2, which corresponds to word bigrams. Let the number of windows of size a182 in the corpus be a48 a41a180a183 . Recall that a48a184a49a90a51a100a53 is the maximum number of co-occurrences, i.e. a48a50a49a52a51a54a53a185a7a107a48 a41a180a183 in the windows-oriented approach. The MLE of the co-occurrence probability is given by</Paragraph>
      <Paragraph position="4"> In most common case, windows are overlapping, and in this case a48 a41a180a183 a7a69a48 a143a186a182a12a187a86a188 . The total frequency of windows for co-occurrence should be adjusted to reflect the multiple counts of the same co-occurrence. One method to account for overlap is to divide the total count of windows by a21a20a39 a127a74a189 a146a75a21 a190a22a39a45a191a130a192a20a143a193a188 . This method also reinforces closer co-occurrences by assigning them a larger weight.</Paragraph>
      <Paragraph position="5"> Smoothing techniques can be applied to address the zero-frequency problem, or alternatively, the window size can be increased, which also increases the chance of cooccurrence. To avoid inconsistency, windows do not to cross document boundaries.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Document-oriented approach
</SectionTitle>
      <Paragraph position="0"> In information retrieval, one commonly uses document statistics rather than individual word statistics. In an document-oriented approach, the frequency of a word a21 a32 is denoted by a189 a40 a41 a42 and corresponds to the number of documents in which the word appears, regardless of how frequently it occurs in each document. The number of documents is denoted by a194 . The MLE for an individual word in document oriented approach is a56a59a58 a21 a32 a62a72a7a8a189 a40 a41 a42a92a181a12a194 . The co-occurrence frequency of two words a21 a9 and  a14 , denoted by a189 a40a15a41 a64 a44a41 a66 , is the number of documents where the words co-occur. If we require only that the words co-occur in the same document, no distinction is made between distantly occurring words and adjacent words. This distortion can be reduced by imposing a maximal distance for co-occurrence, (i.e. a fixed-sized window), but the frequency will still be the number of documents where the two words co-occur within this distance. The MLE for the co-occurrence in this approach is a56a59a58 a21 a9a12a11 a21 a14a22a62a33a7a195a189 a40 a41a65a64a54a44a41a47a66 a181a15a194 , since a48 a49a52a51a54a53 a7 a194 in the document-oriented approach.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Syntax based approach
</SectionTitle>
      <Paragraph position="0"> An alternative to the Window and Document-oriented approach is to use syntactical information (Grefenstette, 1993). For this purpose, a Parser or Part-Of-Speech tagger must be applied to the text and only the interesting pairs of words in correct syntactical categories used. In this case, the fixed window can be superseded by the result of the syntax analysis or tagging process and the frequency of the pairs can be used directly. Alternatively, the number of documents that contain the pair can also be used. However, the nature of the language tests in this work make it impractical to be applied. First, the alternatives are not in a context, and as such can have more than one part-of-speech tag. Occasionally, it is possible to infer that the syntactic category of the alternatives from context of the target word a1a3a2 , if there is such a context . When the alternatives, or the target word a1a3a2 , are multiwords then the problem is harder, as depicted in the first example of figure 7. Also, both parsers and POS tagger make mistakes, thus introducing error. Finally, the size of the corpus used and its nature intensify the parser/POS</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We evaluate the methods and frequency estimates using 3 test sets. The first test set is a set of TOEFL questions first used by Landauer and Dumais (1997) and also by Turney (2001). This test set contains 80 synonym questions and for each question one a1a3a2 and four alternative options (a118a4a184a118a31a7a121a196 ) are given. The other two test sets, which we will refer to as TS1 and TS2, are practice questions for the TOEFL. These two test sets also contain four alternatives options, a118a4a186a118a43a7a197a196 , and a1a10a2 is given in context a0 (within a sentence). TS1 has 50 questions and was also used by Turney (2001). TS2 has 60 questions extracted from a TOEFL practice guide (King and Stanley, 1989).</Paragraph>
    <Paragraph position="1"> For all test sets the answer to each question is known and unique. For comparison purposes, we also use TS1 and TS2 with no context.</Paragraph>
    <Paragraph position="2"> For the three test sets, TOEFL, TS1 and TS2 without context, we applied the word and document-oriented frequency estimates presented. We investigated a variety of window sizes, varying the window size from 2 to 256 by powers of 2.</Paragraph>
    <Paragraph position="3"> The labels used in figures 3, 5, 6, 8, 9, 10, 12 are composed from a keyword indicating the frequency estimate used (W-window oriented; and DR-document retrieval oriented) and a keyword indicating the word similarity measure. For no-context measures the keywords are: PMI-Pointwise Mutual Information; CHI-Chi-Squared; MI-Average mutual information; and LL-Log-likelihood.</Paragraph>
    <Paragraph position="4"> For the measures with context: CP-Cosine pointwise mutual information; L1-L1 norm; AMIC-Average Mutual Information in the presence of context; IRAD-Jensen-Shannon Divergence; and PMIC-a127 - Pointwise Mutual Information with a127 words of context.</Paragraph>
    <Paragraph position="5"> For TS1 and TS2 with context, we also investigate Turney's hypothesis that the outcome of adding more words from a0 is negative, using DR-PMIC. The result of this experiment is shown in figures 10 and 12 for TS1 and TS2 respectively.</Paragraph>
    <Paragraph position="6"> It is important to note that in some of the questions, a1a3a2 or one or more of the a4 a32 's are multi-word strings. For these questions, we assume that the strings may be treated as collocations and use them &amp;quot;as is&amp;quot;, adjusting the size of the windows by the collocation size when applicable. null The corpus used for the experiments is a terabyte of Web data crawled from the general web in 2001. In order to balance the contents of the corpus, a breadth-first order search was used from a initial seed set of URLs representing the home page of 2392 universities and other educational organizations (Clarke et al., 2002). No duplicate pages are included in the collection and the crawler also did not allow a large number of pages from the same site to be downloaded simultaneously. Overall, the collection contains 53 billion words and 77 million documents.</Paragraph>
    <Paragraph position="7"> A key characteristic of this corpus is that it consists of HTML files. These files have a focus on the presentation, and not necessarily on the style of writing. Parsing or tagging these files can be a hard process and prone to introduction of error in rates bigger than traditional corpora used in NLP or Information Retrieval.</Paragraph>
    <Paragraph position="8"> We also investigate the impact of the collection size on</Paragraph>
    <Paragraph position="10"/>
  </Section>
class="xml-element"></Paper>
Download Original XML