File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/j05-4002_intro.xml
Size: 13,327 bytes
Last Modified: 2025-10-06 14:03:00
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-4002"> <Title>Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity</Title> <Section position="4" start_page="442" end_page="448" type="intro"> <SectionTitle> 2. Co-occurrence Retrieval </SectionTitle> <Paragraph position="0"> In this section, we present a flexible framework for distributional similarity. This framework directly defines a similarity function, does not require smoothing of the base language model, and allows us to systematically explore the questions about similarity raised in Section 1. In our approach, similarity between words is viewed as a measure of how appropriate it is to use one word (or its distribution) in place of the other. Like relative entropy (Cover and Thomas 1991), it is inherently asymmetric, since we can Computational Linguistics Volume 31, Number 4 measure how appropriate it is to use word A instead of word B separately from how appropriate it is to use word B instead of word A.</Paragraph> <Paragraph position="1"> The framework presented here is general to the extent that it can be used to compute similarities for any set of objects where each object has an associated set of features or co-occurrence types and these co-occurrence types have associated frequencies that may be used to form probability estimates. Throughout our discussion, the word for which we are finding neighbors will be referred to as the target word.Ifweare computing the similarity between the target word and another word, then the second word is a potential neighbor of the target word. A target word's nearest neighbors are the potential neighbors that have the highest similarity with the target word.</Paragraph> <Section position="1" start_page="443" end_page="443" type="sub_section"> <SectionTitle> 2.1 Basic Concepts </SectionTitle> <Paragraph position="0"> Let us imagine that we have formed descriptions of each word in terms of the other words with which they co-occur in various specified grammatical relations in some corpus. For example, the noun cat might have the co-occurrence types <dobj-of, feed> and <ncmod-by, hungry> . Now let us imagine that we have lost (or accidentally deleted) the description for word w , but before this happened we had noticed that the description of word w was very similar to that of word w . For example, the noun dog might also have the co-occurrence types <dobj-of, feed> and <ncmod-by, hungry> . Hence, we decide that we can use the description of word w .</Paragraph> <Paragraph position="1"> The task we have set ourselves can be seen as co-occurrence retrieval (CR). By analogy with information retrieval, where there is a set of documents that we would like to retrieve and a set of documents that we do retrieve, we have a scenario where there is a set of co-occurrences that we would like to retrieve, the co-occurrences of w , and a set of co-occurrences that we have retrieved, the co-occurrences of w</Paragraph> </Section> <Section position="2" start_page="443" end_page="444" type="sub_section"> <SectionTitle> .Continuing </SectionTitle> <Paragraph position="0"> the analogy, we can measure how well we have done in terms of precision and recall, where precision tells us how much of what was retrieved was correct and recall tells us how much of what we wanted to retrieve was retrieved.</Paragraph> <Paragraph position="1"> Our flexible framework for distributional similarity is based on this notion of co-occurrence retrieval. As the distribution of word B moves away from being identical to that of word A, its &quot;similarity&quot; with A can decrease along one or both of two dimensions. When B occurs in contexts that word A does not, the result is a loss of precision, but B may remain a high-recall neighbor. For example, we might expect the noun animal to be a high-recall neighbor of the noun dog. When B does not occur in contexts that A does occur in, the result is a loss of recall but B may remain a high-precision neighbor. For example, we might expect the noun dog to be a high-precision neighbor of the noun animal. We can explore the merits of symmetry and asymmetry in a similarity measure by varying the relative importance attached to precision and recall.</Paragraph> <Paragraph position="2"> This was the first question posed about distributional similarity in Section 1.</Paragraph> <Paragraph position="3"> The remainder of this section is devoted to defining two types of co-occurrence retrieval model (CRM). Additive models are based on the Boolean concept of two objects either sharing or not sharing a particular feature (where objects are words and features are co-occurrence types). Difference-weighted models incorporate the difference in extent to which each word has each feature. Exploring the two types of models, both defined on the same concepts of precision and recall, allows us to investigate the third question posed in Section 1: Is a shared context worth the same, regardless of the difference in the extent to which each word appears in that context?</Paragraph> </Section> <Section position="3" start_page="444" end_page="444" type="sub_section"> <SectionTitle> Weeds and Weir Co-occurrence Retrieval </SectionTitle> <Paragraph position="0"> We also use the CR framework to investigate the second question posed about distributional similarity, &quot;Should all contexts be treated equally?,&quot; by using different weight functions within each type of model. Weight functions decide which co-occurrence types are features of a word and determine the relative importance of features. In previous work (Weeds and Weir 2003b), we experimented with weight functions based on combinatorial, probabilistic, and mutual information (MI). These allow us to define type-based, token-based, and MI-based CRMs, respectively. This work extends the previous work by also considering weighted mutual information (WMI) (Fung and McKeown 1997), the t-test (Manning and Sch &quot;utze 1999), the z-test (Fontenelle et al. 1994), and an approximation to the log-likelihood ratio (Manning and Sch &quot;utze 1999) as weight functions.</Paragraph> </Section> <Section position="4" start_page="444" end_page="448" type="sub_section"> <SectionTitle> 2.2 Additive Models </SectionTitle> <Paragraph position="0"> Having considered the intuition behind calculating precision and recall for co-occurrence retrieval, we now formulate this formally in terms of an additive model.</Paragraph> <Paragraph position="1"> We first need to consider for each word w which co-occurrence types will be retrieved, or predicted, by it and, conversely, required in a description of it. We will refer to these co-occurrence types as the features of w, F(w):</Paragraph> <Paragraph position="3"> where D(w, c) is the weight associated with word w and co-occurrence type c. Possible weight functions will be described in Section 2.3.</Paragraph> <Paragraph position="4"> The shared features of word w 's features is the proportion of w 's features that are shared by both words, where each feature is weighted by its relative importance according to w Computational Linguistics Volume 31, Number 4 Table 1 Weight functions.</Paragraph> <Paragraph position="6"> parenrightBigparenrightBig Precision and recall both lie in the range [0,1] and are both equal to one when each word has exactly the same features. It should also be noted that the recall of are important enough to be considered part of their description, or by analogy with document retrieval, which co-occurrences we want to retrieve for w and which co-occurrences we have retrieved using the description of w actual relevance of co-occurrence type c. The weight functions we have considered so far are summarized in Table 1. Each weight function can be used to define its own CRM, which we will now discuss in more detail.</Paragraph> <Paragraph position="7"> is the proportion of verb co-occurrence types (or distinct verbs) occurring with w that also occur with w . In this case, the summed values of D are always 1, and hence the expressions for precision and recall can be simplified: is the proportion of co-occurrence tokens occurring with w that also occur with w . Hence, words have the same features as in the type-based CRM, but each feature is given a weight based on its probability of occurrence. Since F(w) =</Paragraph> <Paragraph position="9"> the expressions for precision and recall can be simplified:</Paragraph> <Paragraph position="11"> ). Using pointwise mutual information (MI) (Church and Hanks 1989) as the weight function means that a co-occurrence c is considered a feature of word n if the probability of their co-occurrence is greater than would be expected if words occurred independently. In addition, more informative co-occurrences contribute more to the sums in the calculation of precision and recall and hence have more weight.</Paragraph> <Paragraph position="13"> ). Weighted mutual information (WMI) (Fung and McKeown 1997) has been proposed as an alternative to MI, particularly when MI might lead to the over-association of low-frequency events. In this function, the pointwise MI is multiplied by the probability of the co-occurrence; hence, reducing the weight assigned to low-probability events.</Paragraph> <Paragraph position="14"> Additive t-test based CRM (D t ). The t-test (Manning and Sch &quot;utze 1999) is a standard statistical test that has been proposed for collocation analysis. It measures the (signed) difference between the observed probability of co-occurrence and the expected probability of co-occurrence, as would be observed if words occurred independently. The difference is divided by the standard deviation in the observed distribution. Similarly to MI, this score obviously gives more weight to co-occurrences that occur more than would be expected, and its use as the weight function results in any co-occurrences that occur less than would be expected being ignored.</Paragraph> <Paragraph position="15"> Additive z-test based CRM (D z ). The z-test (Fontenelle et al. 1994) is almost identical to the t-test. However, using the z-test, the (signed) difference between the observed probability of co-occurrence and the expected probability of co-occurrence is divided by the standard deviation in the expected distribution.</Paragraph> <Paragraph position="16"> Additive log-likelihood ratio based CRM (D allr ). The log-likelihood ratio (Manning and Sch &quot;utze 1999) considers the difference (as a log ratio) in probability of the observed frequencies of co-occurrences and individual words occurring under the null hypoth- null Computational Linguistics Volume 31, Number 4 esis, that words occur independently, and under the alternative hypothesis, that they do not.</Paragraph> <Paragraph position="18"> If f (w, c) is the frequency of w and c occurring together, f (w) is the total frequency of w occurring in any context, f (c) is the total frequency of c occurring with any word, and N is the grand total of co-occurrences, then the log-likelihood ratio can be written:</Paragraph> <Paragraph position="20"> In our implementation (see Table 1), an approximation to this formula is used, which we term the ALLR weight function. We use an approximation because the terms that represent the probabilities of the other contexts (i.e., seeing f (c) [?] f (w, c) under each hypothesis) tend towards [?][?] as N increases (since the probabilities tend towards zero).</Paragraph> <Paragraph position="21"> Since N is very large in our experiments (approximately 2,000,000), we found that using the full formula led to many weights being undefined. Further, since in this case the probability of seeing other contexts will be approximately equal under each hypothesis, it is a reasonable approximation to make.</Paragraph> <Paragraph position="22"> Another potential problem with using the log-likelihood ratio as the weight function is that it is always positive, since the observed distribution is always more probable than the hypothesized distribution. All of the other weight functions assign a zero or negative weight to co-occurrence types that do not occur with a given word and thus these zero frequency co-occurrence types are never selected as features.</Paragraph> <Paragraph position="23"> This is advantageous in the computation of similarity, since computing the sums over all co-occurrence types rather than just those co-occurring with at least one of the words is (1) very computationally expensive and (2) due to their vast number, the effect of these zero frequency co-occurrence types tends to outweigh the effect of those co-occurrence types that have actually occurred. Giving such weight to these shared non-occurrences seems unintuitive and has been shown by Lee (1999) to be undesirable in the calculation of distributional similarity. Hence, when using the</Paragraph> </Section> <Section position="5" start_page="448" end_page="448" type="sub_section"> <SectionTitle> Weeds and Weir Co-occurrence Retrieval </SectionTitle> <Paragraph position="0"> ALLR as the weight function, we use the additional restriction that P(c, w) > 0 when selecting features.</Paragraph> </Section> </Section> class="xml-element"></Paper>