File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/j05-4002_metho.xml

Size: 45,903 bytes

Last Modified: 2025-10-06 14:09:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="J05-4002">
  <Title>Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity</Title>
  <Section position="5" start_page="448" end_page="452" type="metho">
    <SectionTitle>
2.4 Difference-Weighted Models
</SectionTitle>
    <Paragraph position="0"> In additive models, no distinction is made between features that have occurred to the same extent with each word and features that have occurred to different extents with each word. For example, if two words have the same features, they are considered identical, regardless of whether the feature occurs with the same probability with each word or not. Here, we define a type of model that allows us to capture the difference in the extent to which each word has each feature.</Paragraph>
    <Paragraph position="1"> We do this by defining the similarity of two words with respect to an individual feature, using the same principles that we use to define the similarity of two words with respect to all their features. First, we define an extent function, E(w, c), which is the extent to which w  goes with c and which may be, but is not necessarily, the same as the weight function D(n, w). Possible extent functions will be discussed in Section 2.5. Having defined this function, we can measure the precision and recall of individual features. The precision of an individual feature c retrieved by w  is the extent to which both words go with c divided by the extent to which w  goes with c. The recall of the retrieval of c by w  Using precision and recall of individual features as weights in the definitions of precision and recall of a distribution captures the intuition that retrieval of a co-occurrence type is not a black-and-white matter. Features that are shared to a similar extent are considered more important in the calculation of distributional similarity.</Paragraph>
    <Paragraph position="3"/>
    <Section position="1" start_page="449" end_page="449" type="sub_section">
      <SectionTitle>
2.5 Extent Functions
</SectionTitle>
      <Paragraph position="0"> The extent functions we have considered so far are summarized in Table 2. Note that in general, the extent function is the same as the weight function, which leads to a standard simplification of the expressions for precision and recall in the difference-weighted CRMs. For example, in the difference-weighted MI-based model we get the expressions:  Similar expressions can be derived for the WMI-based CRM,thet-test based CRM,the z-test based CRM,andtheALLR-based CRM. An interesting special case is the difference-weighted token-based CRM. In this case, since</Paragraph>
      <Paragraph position="2"> expressions for precision and recall:</Paragraph>
      <Paragraph position="4"> Note that although we have defined separate precision and recall functions, we have arrived at the same expression for both in this model. As a result, this model is symmetric.</Paragraph>
      <Paragraph position="5"> The only CRM in which we use a different extent and weight function is the difference-weighted type-based CRM. This is because there is no difference between types and tokens for an individual feature; i.e., their retrieval is equivalent. In this case, the following expressions for precision and recall are derived:  Note that this is different from the additive token-based model because, although every token is effectively considered in this model, tokens are not weighted equally. In this model, tokens are treated differently according to which type they belong. The importance of the retrieval (or non-retrieval) of a single token depends on the proportion of the tokens for its particular type that it constitutes.</Paragraph>
    </Section>
    <Section position="2" start_page="449" end_page="451" type="sub_section">
      <SectionTitle>
2.6 Combining Precision and Recall
</SectionTitle>
      <Paragraph position="0"> We have, so far, been concerned with defining a pair of numbers that represents the similarity between two words. However, in applications, it is normally necessary to compute a single number in order to determine neighborhood or cluster membership.</Paragraph>
      <Paragraph position="1"> The classic way to combine precision and recall in IR is to compute the F-score; that is, the harmonic mean of precision and recall:</Paragraph>
      <Paragraph position="3"> However, we do not wish to assume that a good substitute requires both high precision and high recall of the target distribution. It may be that, in some situations, the best word to use in place of another word is one that only retrieves correct co-occurrences (i.e., it is a high-precision neighbor) or it may be one that retrieves all of the required co-occurrences (i.e., it is a high-recall neighbor). The other factor in each case may play only a secondary role or no role at all.</Paragraph>
      <Paragraph position="4"> We can retain generality and investigate whether high precision or high recall or high precision and high recall are required for high similarity by computing a weighted  lie in the range [0,1] where 0 is low and 1 is high. This formula can be used in combination with any of the models for precision and recall outlined earlier. Precision and recall can be computed once for every pair of words (and every model) whereas similarity depends on the values of b and g. The flexibility allows us to investigate empirically the relative significance of the different terms and thus whether one (or more) might be omitted in future work. Table 3 summarizes some special parameter settings.</Paragraph>
    </Section>
    <Section position="3" start_page="451" end_page="452" type="sub_section">
      <SectionTitle>
2.7 Discussion
</SectionTitle>
      <Paragraph position="0"> We have developed a framework based on the concept of co-occurrence retrieval (CR). Within this framework we have defined a number of models (CRMs) that allow us to systematically explore three questions about similarity. First, is similarity between words necessarily a symmetric relationship, or can we gain an advantage by considering it as an asymmetric relationship? Second, are some features inherently more salient than others? Third, does the difference in extent to which each word takes each feature matter? The CRMs and the parameter settings therein correspond to alternative possibilities. First, a high-precision neighbor is not necessarily a high-recall neighbor (and, conversely, a high-recall neighbor is not necessarily a high-precision neighbor) and therefore we are not constrained to a symmetric relationship of similarity between</Paragraph>
    </Section>
    <Section position="4" start_page="452" end_page="452" type="sub_section">
      <SectionTitle>
Weeds and Weir Co-occurrence Retrieval
</SectionTitle>
      <Paragraph position="0"> words. Second, the use of different weight functions varies the relative importance attached to features. Finally, difference-weighted models contrast with additive models in considering the difference in extent to which each word takes each feature.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="452" end_page="452" type="metho">
    <SectionTitle>
3. Data and Experimental Techniques
</SectionTitle>
    <Paragraph position="0"> The rest of this paper is concerned with evaluation of the proposed framework; first, by comparing it to existing distributional similarity measures, and second, by evaluating performance on two tasks. Throughout our empirical work, we use one data-set and one neighbor set comparison technique, which we now discuss in advance of presenting any of our actual experiments.</Paragraph>
    <Section position="1" start_page="452" end_page="452" type="sub_section">
      <SectionTitle>
3.1 Data
</SectionTitle>
      <Paragraph position="0"> The data used for all our experimental work was noun-verb direct-object data extracted from the BNC by a Robust Accurate Statistical Parser (RASP) (Briscoe and Carroll 1995; Carroll and Briscoe 1996). We constructed a list of nouns that occur in both our data set and WordNet ordered by their frequency in our corpus data. Since we are interested in the effects of word frequency on word similarity, we selected 1,000 high-frequency nouns and 1,000 low-frequency nouns. The 1,000 high-frequency nouns were selected as the nouns with frequency ranks of 1-1,000; this corresponds to a frequency range of [586,20871]. The low-frequency nouns were selected as the nouns with frequency ranks of 3,001-4,000; this corresponds to a frequency range of [72,121].</Paragraph>
      <Paragraph position="1"> For each target noun, 80% of the available data was randomly selected as training data and the other 20% was set aside as test data.</Paragraph>
      <Paragraph position="2">  The training data was used to compute similarity scores between all possible pairwise combinations of the 2,000 nouns and to provide (MLE) estimates of noun-verb co-occurrence probabilities in the pseudo-disambiguation task. The test data provides unseen co-occurrences for the pseudo-disambiguation task.</Paragraph>
      <Paragraph position="3"> Although we only consider similarity between nouns based on co-occurrences with verbs in the direct-object position, the generality of the techniques proposed is not so restricted. Any of the techniques can be applied to other parts of speech, other grammatical relations, and other types of context. We restricted the scope of our experimental work solely for computational and evaluation reasons. However, we could have chosen to look at the similarity between verbs or between adjectives.</Paragraph>
      <Paragraph position="4">  We chose nouns as a starting point since nouns tend to allow less sense extensions than verbs and adjectives (Pustejovsky 1995). Further, the noun hyponymy hierarchy in WordNet, which will be used as a pseudo-gold standard for comparison, is widely recognized in this area of research.</Paragraph>
      <Paragraph position="5"> Some previous work on distributional similarity between nouns has used only a single grammatical relation (e.g., Lee 1999), whereas other work has considered multiple grammatical relations (e.g., Lin 1998a). We consider only a single grammatical relation because we believe that it is important to evaluate the usefulness of each grammatical relation in calculating similarity before deciding how to combine information from 5 This results in a single 80:20 split of the complete data set, in which we are guaranteed that the original relative frequencies of the target nouns are maintained.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="452" end_page="453" type="metho">
    <SectionTitle>
6 The use of grammatical relations to model context precludes finding similarities between words of
</SectionTitle>
    <Paragraph position="0"> different parts of speech. Since we are looking at similarity in terms of substitutability, we would not expect to find a word of one part of speech substitutable for a word of another part of speech.</Paragraph>
    <Paragraph position="1">  Computational Linguistics Volume 31, Number 4 different relations. In previous work (Weeds 2003), we found that considering the subject relation as well as the direct-object relation did not improve performance on a pseudo-disambiguation task.</Paragraph>
    <Paragraph position="2"> Our last restriction was to only consider 2,000 of the approximately 35,000 nouns occurring in the corpus. This restriction was for computational efficiency and to avoid computing similarities based on the potentially unreliable descriptions of very low-frequency words. However, since our evaluation is comparative, we do not expect our results to be affected by this or any of the other restrictions.</Paragraph>
    <Section position="1" start_page="453" end_page="453" type="sub_section">
      <SectionTitle>
3.2 Neighbor Set Comparison Technique
</SectionTitle>
      <Paragraph position="0"> In several of our experiments, we measure the overlap between two different similarity measures. We use a neighbor set comparison technique adapted from Lin (1997).</Paragraph>
      <Paragraph position="1"> In order to compare two neighbor sets of size k, we transform each neighbor set so that each neighbor is given a rank score of k [?] rank. Potential neighbors not within a given rank distance k of the noun score zero. This transformation is required since scores computed on different scales are to be compared and because we wish to only consider neighbors up to a certain rank distance. The similarity between two neighbor sets S and S prime is computed as the cosine of the rank score vectors:</Paragraph>
      <Paragraph position="3"> (w) are the rank scores of the words within each neighbor set S and S prime respectively.</Paragraph>
      <Paragraph position="4"> In previous work (Weeds and Weir 2003b), having computed the similarity between neighbor sets for each noun according to each pair of measures under consideration, we computed the mean similarity across all high-frequency nouns and all low-frequency nouns. However, since the use of the CR framework requires parameter optimization, here, we randomly select 60% of the nouns to form a development set and use the remaining 40% as a test set. Thus, any parameters are optimized over the development set nouns and performance measured at these settings over the test set.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="453" end_page="460" type="metho">
    <SectionTitle>
4. Alternative Distributional Similarity Measures
</SectionTitle>
    <Paragraph position="0"> In this section, we consider related work on distributional similarity measures and the extent to which some of these measures can be simulated within the CR framework.</Paragraph>
    <Paragraph position="1"> However, there is a large body of work on distributional similarity measures; for a more extensive review, see Weeds (2003). Here, we concentrate on a number of more popular measures: the Dice Coefficient, Jaccard's Coefficient, the L  Norm, the a-skew divergence measure, Hindle's measure, and Lin's MI-based measure.</Paragraph>
    <Section position="1" start_page="453" end_page="454" type="sub_section">
      <SectionTitle>
4.1 The Dice Coefficient
</SectionTitle>
      <Paragraph position="0"> The Dice Coefficient (Frakes and Baeza-Yates 1992) is a popular combinatorial similarity measure adopted from the field of Information Retrieval for use as a measure of lexical</Paragraph>
    </Section>
    <Section position="2" start_page="454" end_page="454" type="sub_section">
      <SectionTitle>
Weeds and Weir Co-occurrence Retrieval
</SectionTitle>
      <Paragraph position="0"> distributional similarity. It is computed as twice the ratio between the size of the intersection of the two feature sets and the sum of the sizes of the individual feature sets:</Paragraph>
      <Paragraph position="2"> According to this measure, the similarity between words with no shared features is zero and the similarity between words with identical feature sets is 1. However, as shown below, this formula is equivalent to a special case in the CR framework: the harmonic mean of precision and recall (or F-score) using the additive type-based CRM.</Paragraph>
      <Paragraph position="4"> Thus, when g is set to 1 in the additive type-based CRM, the Dice Coefficient is exactly replicated.</Paragraph>
    </Section>
    <Section position="3" start_page="454" end_page="455" type="sub_section">
      <SectionTitle>
4.2 Jaccard's Coefficient
Jaccard's Coefficient (Salton and McGill 1983), also known as the Tanimoto Coefficient
</SectionTitle>
      <Paragraph position="0"> (Resnik 1993), is another popular combinatorial similarity measure. It can be defined as the proportion of features belonging to either word that are shared by both words; that is, the ratio between the size of the intersection of the feature sets and the size of the  As with the Dice Coefficient, the similarity between words with no shared co-occurrences is zero and the similarity between words with identical features is 1. Further, as shown by van Rijsbergen (1979), the Dice Coefficient and Jaccard's Coefficient are monotonic in one another. Thus, although in general the scores computed by each will be different, the orderings or rankings of objects will be the same. In other words, for all k and w,thek nearest neighbors of word w according to Jaccard's Coefficient will be identical to the k nearest neighbors of word w according to the Dice Coefficient and the harmonic mean of precision and recall in the additive type-based CRM.</Paragraph>
      <Paragraph position="1">  between two points in space. The L  Norm is also known as the Manhattan Distance, the taxi-cab distance, the city-block distance, and the absolute value distance, since it represents the distance traveled between the two points if you can only travel in orthogonal directions. When used to calculate lexical distributional similarity, the dimensions of the vector space are co-occurrence types and the values of the vector components are the probabilities of the co-occurrence types given the word. Thus the L</Paragraph>
      <Paragraph position="3"> In other words, the L  Norm is directly related to the difference-weighted token-based CRM. The constant and multiplying factors are required, since the CRM defines a similarity in the range [0,1], whereas the L  Norm defines a distance in the range [0,2] (where 0 distance is equivalent to 1 on the similarity scale).</Paragraph>
    </Section>
    <Section position="4" start_page="455" end_page="456" type="sub_section">
      <SectionTitle>
4.4 The a-skew Divergence Measure
</SectionTitle>
      <Paragraph position="0"> The a-skew divergence measure (Lee 1999, 2001) is a popular approximation to the Kullback-Leibler divergence measure</Paragraph>
    </Section>
    <Section position="5" start_page="456" end_page="456" type="sub_section">
      <SectionTitle>
Weeds and Weir Co-occurrence Retrieval
</SectionTitle>
      <Paragraph position="0"> would result in the actual Kullback-Leibler divergence measure being equal to [?].Itis defined (Lee 1999) as:</Paragraph>
      <Paragraph position="2"> In effect, the q distribution is smoothed with the r distribution, which results in it always being non-zero when the r distribution is non-zero. The parameter a controls the extent to which the measure approximates the Kullback-Leibler divergence measure.</Paragraph>
      <Paragraph position="3"> When a is close to 1, the approximation is close while avoiding the problem with zero probabilities associated with using the Kullback-Leibler divergence measure. This theoretical justification for using a very high value of a (e.g., 0.99) is also borne out by empirical evidence (Lee 2001).</Paragraph>
      <Paragraph position="4"> The a-skew divergence measure retains the asymmetry of the Kullback-Leibler divergence, and Weeds (2003) discusses the significance in the direction in which it is calculated. For the purposes of this paper, we will find the neighbors of w</Paragraph>
      <Paragraph position="6"> Due to the form of the a-skew divergence measure, we do not expect any of the CRMs to exactly simulate it. However, this measure does take into account the differences between the probabilities of co-occurrences in each distribution (as a log ratio) and therefore we might expect that it will be fairly closely simulated by the difference-weighted token-based CRM. Further, the a-skew divergence measure is asymmetric.</Paragraph>
      <Paragraph position="7">  and different parameter settings within the CR framework for 1,000 high-frequency nouns and for 1,000 low-frequency nouns, using the data and the neighbor set comparison technique described in Section 3. Table 4 shows the optimal parameters in each CRM for simulating dist a , computed over the development set, and the mean similarity at these settings over both the development set and the test set. From these results, we can make the following observations. First, the differences in mean similarities over the development set and the test set are minimal. Thus, performance of the models with respect to different parameter settings appears stable across different words.</Paragraph>
      <Paragraph position="8"> Second, the differences between the models are fairly small. The difference-weighted token-based CRM achieves a fairly close approximation to dist  overall best approximation is achieved by the additive t-test based CRM. Although none of the CRMs are able to simulate dist a exactly, the closeness of approximation achieved in the best cases (greater than 0.7) is substantially higher than the degree of overlap observed between other measures of distributional similarity. Weeds, Weir, and McCarthy (2004) report an average overlap of 0.4 between neighbor sets produced using dist a and Jaccard's Measure and an average overlap of 0.48 between neighbor sets produced using dist a and Lin's similarity measure.</Paragraph>
      <Paragraph position="9"> A third observation is that all of the asymmetric models get closest at high levels of recall for both high- and low-frequency nouns. For example, Figure 1 illustrates the variation in mean similarity between neighbor sets with the parameters b and g for the additive t-test based model. As can be seen, similarity between neighbor sets is significantly higher at high recall settings (low b) within the model than at high-precision settings (high b), which suggests that dist a has high-recall CR characteristics.</Paragraph>
    </Section>
    <Section position="6" start_page="456" end_page="458" type="sub_section">
      <SectionTitle>
4.5 Hindle's Measure
</SectionTitle>
      <Paragraph position="0"> Hindle (1990) proposed an MI-based measure, which he used to show that nouns could be reliably clustered based on their verb co-occurrences. We consider the variant of  ). However, we also note that the denominator in the expression for recall depends only on w  will be the value of sim hind divided by a constant. Hence, neighbor sets derived using sim hind are identical to those obtained using recall (g= 0,b = 0) in the difference-weighted MI-based CRM.</Paragraph>
    </Section>
    <Section position="7" start_page="458" end_page="460" type="sub_section">
      <SectionTitle>
4.6 Lin's Measure
</SectionTitle>
      <Paragraph position="0"> Lin (1998a) proposed a measure of lexical distributional similarity based on his information-theoretic similarity theorem (Lin 1997, 1998b): The similarity between A and B is measured by the ratio between the amount of information needed to state the commonality of A and B and the information needed to fully describe what A and B are.</Paragraph>
      <Paragraph position="1">  where T(w) ={c : I(w, c) &gt; 0}. There are parallels between sim lin and sim dice in that both measures compute a ratio between what is shared by the descriptions of both nouns and the sum of the descriptions of each noun. The major difference appears to be the use of MI, and hence we predicted that there would be a close relationship between sim lin and the harmonic mean in the additive MI-based CRM. This relationship is shown below:  , c) holds, the CR framework reduces to sim lin . However, this last necessary condition for equivalence is not one we can expect to hold for many (if any) pairs of words. In order to investigate how good an approximation the harmonic mean is to sim lin in practice, we compared neighbor sets according to each measure using the neighbor set comparison technique outlined earlier. Figure 2 illustrates the variation in mean similarity between neighbor sets with the parameters b and g.Atg= 1, the average similarity between neighbor rankings was 0.967 for high-frequency nouns and 0.923 for low-frequency nouns. This is significantly higher than similarities between other standard similarity measures. However, the optimal approximation of sim lin was found using g= 0.75 and b= 0.5 in the additive MI-based CRM. With these settings, the development set similarity was 0.987 for high-</Paragraph>
    </Section>
    <Section position="8" start_page="460" end_page="460" type="sub_section">
      <SectionTitle>
Weeds and Weir Co-occurrence Retrieval
</SectionTitle>
      <Paragraph position="0"> frequency nouns and 0.977 for low-frequency nouns. This suggests that sim lin allows more compensation for lack of recall by precision and vice versa than the harmonic mean.</Paragraph>
    </Section>
    <Section position="9" start_page="460" end_page="460" type="sub_section">
      <SectionTitle>
4.7 Discussion
</SectionTitle>
      <Paragraph position="0"> We have seen that five of the existing lexical distributional similarity measures are (approximately) equivalent to settings within the CR framework and for one other, a weak approximation can be made. The CR framework, however, more than simulates existing measures of distributional similarity. It defines a space of distributional similarity measures that is already populated with a few named measures. By exploring the space, we can discover the desirable characteristics of distributional similarity measures. It may be that the most useful measure within this space has already been discovered, or it may be that a new optimal combination of characteristics is discovered. The primary goal, however, is to understand how different characteristics relate to high performance in different applications and thus explain why one measure performs better than another.</Paragraph>
      <Paragraph position="1"> With this goal in mind, we now turn to the applications of distributional similarity.</Paragraph>
      <Paragraph position="2"> In the next section, we consider what characteristics of distributional similarity measures are desirable in two different application areas: (1) automatic thesaurus generation and (2) language modeling.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="460" end_page="468" type="metho">
    <SectionTitle>
5. Application-Based Evaluation
</SectionTitle>
    <Paragraph position="0"> As discussed by Weeds (2003), evaluation is a major problem in this area of research.</Paragraph>
    <Paragraph position="1"> In some areas of natural language research, evaluation can be performed against a gold standard or against human plausibility judgments. The first of these approaches is taken by Curran and Moens (2002), who evaluate a number of different distributional similarity measures and weight functions against a gold standard thesaurus compiled from Roget's,theMacquarie thesaurus, and the Moby thesaurus. However, we argue that this approach can only be considered when distributional similarity is required as an approximation to semantic similarity and that, in any case, it is not ideal since it is not Figure 2 Variation (with parameters b and g) in development set mean similarity between neighbor sets of the additive MI-based CRM and of sim lin .</Paragraph>
    <Paragraph position="2">  Computational Linguistics Volume 31, Number 4 clear that there is a single &amp;quot;right answer&amp;quot; as to which words are most distributionally similar. The best measure of distributional similarity will be the one that returns the most useful neighbors in the context of a particular application and thus leads to the best performance in that application. This section investigates whether the desirable characteristics of a lexical distributional similarity measure in an automatic thesaurus generation task (WordNet prediction) are the same as those in a language modeling task (pseudo-disambiguation).</Paragraph>
    <Section position="1" start_page="461" end_page="464" type="sub_section">
      <SectionTitle>
5.1 WordNet Prediction Task
</SectionTitle>
      <Paragraph position="0"> In this section, we evaluate the ability of distributional similarity measures to predict semantic similarity by making comparisons with WordNet. An underlying assumption of this approach is that WordNet is a gold standard for semantic similarity, which, as is discussed by Weeds (2003), is unrealistic. However, it seems reasonable to suppose that a distributional similarity measure that more closely predicts a semantic measure based on WordNet is more likely to be a good predictor of semantic similarity. We chose WordNet as our gold standard for semantic similarity since, as discussed by Kilgarriff and Yallop (2000), distributional similarity scores calculated over grammatical relation level context tend to be more similar to tighter thesauri, such as WordNet, than looser thesauri such as Roget's.</Paragraph>
      <Paragraph position="1"> 5.1.1 Experimental Set-Up. There are a number of ways to measure the distance between two nouns in the WordNet noun hierarchy (see Budanitsky [1999] for a review).</Paragraph>
      <Paragraph position="2"> In previous work (Weeds and Weir 2003b), we used the WordNet-based similarity measure first proposed in Lin (1997) and used in Lin (1998a):  where S(w) is the set of senses of the word w in WordNet, sup(c) is the set of possibly indirect super-classes of concept c in WordNet, and P(c) is the probability that a randomly selected word refers to an instance of concept c (estimated over some corpus such as SemCor [Miller et al. 1994]).</Paragraph>
      <Paragraph position="3"> However, in other research (Budanitsky and Hirst 2001; Patwardhan, Banerjee, and Pedersen 2003; McCarthy, Koeling, and Weeds 2004), it has been shown that the distance measure of Jiang and Conrath (1997) (referred to herein as the &amp;quot;JC measure&amp;quot;) is a  In our work, we make an empirical comparison of neighbors derived using a WordNet-based measure and each of the distributional similarity measures using the technique discussed in Section 3. We have carried out the same experiments using both the Lin measure and the JC measure. Correlation between distributional similarity measures and the WordNet measure tends to be slightly higher when using the JC measure  (percentage increase in similarity of approximately 10%), but the relative differences between distributional similarity measures remain approximately the same. Here, for brevity, we present results just using the JC measure.</Paragraph>
      <Paragraph position="4"> 5.1.2 Results. As before, we present the results separately for the 1,000 high-frequency target nouns and for the 1,000 low-frequency target nouns. Table 5 shows the optimal parameter settings for each CRM (computed over the development set) and the mean similarities with the JC measure at these settings in both the development set and the test set. It also shows the mean similarities over the development set and the test set for each of the existing similarity measures discussed in Section 4. For reference, we also present the mean similarity for the WordNet-based measure wn sim lin . For ease  Computational Linguistics Volume 31, Number 4 Figure 3 Bar chart illustrating test set similarity with WordNet for each distributional similarity measure. of comparison, the test set correlation values for each distributional measure are also illustrated in Figure 3.</Paragraph>
      <Paragraph position="5"> We would expect a mean overlap score of 0.08 by chance. Standard deviations in the observed test set mean similarities were all less than 0.1, and thus any difference between mean scores of greater than 0.016 is significant at the 99% level, and differences greater than 0.007 are significant at the 90% level. Thus, from the results in Table 5 we can make the following observations.</Paragraph>
      <Paragraph position="6"> First, the best-performing distributional similarity measures, in terms of WordNet prediction, for both high- and low-frequency nouns, are the MI-based and the t-test based CRMs. The additive MI-based CRM performs the best for high-frequency nouns and the additive t-test based CRM performs the best for low-frequency nouns. However, the differences between these models are not statistically significant. These CRMs perform substantially better than all of the unparameterized distributional similarity measures, of which the best performing are sim hind and sim lin for high-frequency nouns and dist  for low-frequency nouns. Second, the difference-weighted versions of each model generally perform slightly worse than their additive counterparts. Thus, the difference in extent to which each word occurs in each context does not appear to be a factor in determining semantic similarity. Third, all of the measures perform significantly better for high-frequency nouns than for low-frequency nouns. However, some of the measures (sim lin , sim jacc and sim dice ) perform considerably worse for low-frequency nouns.</Paragraph>
      <Paragraph position="7"> We now consider the effects of b and g in the CRMs on performance. The pattern of variation across the CRMs was very similar. This pattern is illustrated using one of the best-performing CRMs (sim add mi ) in Figure 4. With reference to this figure and to the results for the other models (not shown), we make the following observations.  Variation in similarity with WordNet with respect to b and g for the additive MI-based CRM. First, for high- and low-frequency nouns, similarity with WordNet is higher for low values of b than for high values of b. In other words, neighbors according to the WordNet based measure tend to have high-recall retrieval of the target noun's co-occurrences. Second, a high value of g leads to high performance for high-frequency nouns but poor performance for low-frequency nouns. This suggests that WordNet-derived neighbors of high-frequency target nouns also have high-precision retrieval of the target noun's co-occurrences, whereas the WordNet-derived neighbors of low-frequency target nouns do not. This also explains why particular existing measures (Jaccard's / the Dice Coefficient and Lin's Measure), which are very similar to a g= 1 setting in the CR framework, perform well for high-frequency nouns but poorly for low-frequency nouns.</Paragraph>
      <Paragraph position="8"> 5.1.3 Discussion. Our results in this section are comparable to those of Curran and Moens (2002), who showed that combining the t-test with Jaccard's coefficient outperformed combining MI with Jaccard's coefficient by approximately 10% in a comparison against a gold-standard thesaurus. However, we do not find a significant difference between using the t-test and MI in similarity calculation. Further, we found that using a combination of precision and recall weighted towards recall performs substantially better than using the harmonic mean (which is equivalent to Jaccard's measure). In our experiments, the development-set similarity using the harmonic mean in the additive MI-based CRM was 0.312 for high-frequency nouns and 0.153 for low-frequency nouns, and the development-set similarity using the harmonic mean in the additive t-test based CRM was 0.294 for high-frequency nouns and 0.129 for low-frequency nouns.</Paragraph>
    </Section>
    <Section position="2" start_page="464" end_page="468" type="sub_section">
      <SectionTitle>
5.2 Pseudo-Disambiguation Task
</SectionTitle>
      <Paragraph position="0"> Pseudo-disambiguation tasks have become a standard evaluation technique (Gale, Church, and Yarowsky 1992; Sch &amp;quot;utze 1992; Pereira, Tishby, and Lee 1993; Sch &amp;quot;utze 1998; Lee 1999; Dagan, Lee, and Pereira 1999; Golding and Roth 1999; Rooth et al. 1999; Even-Zohar and Roth 2000; Lee 2001; Clark and Weir 2002) and, in the current setting, we may use a noun's neighbors to decide which of two co-occurrences is the most likely.</Paragraph>
      <Paragraph position="1"> Although pseudo-disambiguation is an artificial task, it has relevance in at least two application areas. First, by replacing occurrences of a particular word in a test suite with  Computational Linguistics Volume 31, Number 4 a pair of words from which a technique must choose, we recreate a simplified version of the word sense disambiguation task; that is, we choose between a fixed number of homonyms based on local context. The second is in language modeling where we wish to estimate the probabilities of co-occurrences of events but, due to the sparse data problem, it is often the case that a possible co-occurrence has not been seen in the training data.</Paragraph>
      <Paragraph position="2"> 5.2.1 Experimental Set-up. A typical approach to performing pseudo-disambiguation is as follows. A large set of noun-verb direct-object pairs is extracted from a corpus, of which a portion is used as test data and another portion is used as training data. The training data can be used to construct a language model and/or determine the distributionally nearest neighbors of each noun. Noun-verb pairs (n, v  ) and the task is to decide which of the two verbs is the most likely to take the noun as its direct object. Performance is usually measured as error rate. We will now discuss the details of our own experimental set-up. As already discussed (Section 3), 80% of the noun-verb direct-object data extracted from the BNC for each of 2,000 nouns was used to compute the similarity between nouns and is also used as the language model in the pseudo-disambiguation task, and 20% of the data was set aside as test data, providing unseen co-occurrences for this pseudo-disambiguation task.</Paragraph>
      <Paragraph position="3"> In order to construct the test set from the test data, we took all  of the test data set aside for each target noun and modified it as follows. We converted each noun-verb pair  were then selected for each target noun in a two-step process of (1) while more than ten triples remained, discarding duplicate triples and (2) randomly selecting ten triples from those remaining after step 1. At this point, we have 10,000 test instances pertaining to high-frequency nouns and 10,000 test instances pertaining to low-frequency nouns, and there are no biases towards the higher-frequency or lower-frequency nouns within these sets. Each of these sets was split into five disjoint subsets, each containing two instances for each target noun. We use these five subsets in two ways. First, we perform five-fold cross validation. In five-fold cross validation, we compute the optimal parameter settings in four of the subsets and the error rate at this optimal parameter setting in the remaining subset. This is repeated five times with a different subset held out each time. We then compute an average optimal error rate. We cannot, however, compute an average optimal parameter setting, since this would assume a convex relationship between parameter settings and error rate. In order to study the relationship between parameter settings and error rate, we combine three of the sets to form a development set and two of the sets to form a test set. The development set is used to optimize parameters and the test set  Weeds and Weir Co-occurrence Retrieval to determine error rates at the optimal settings. In graphs showing the relationship between error rate and parameter settings, it is the error rate in this development set that is shown. In the case of the CRMs, the parameters that are optimized are b, g,and k (the number of nearest neighbors).</Paragraph>
      <Paragraph position="4">  For the existing measures, the only parameter to be optimized is k.</Paragraph>
      <Paragraph position="5"> Having constructed the test sets, the task is to take each test instance (n, v  ) and that it casts to the verb with which it occurs most frequently. Thus, we distinguish between cases where a neighbor occurs with each verb approximately the same number of times and where a neighbor occurs with one verb significantly more often than the other. The votes for each verb are summed over all of the k nearest neighbors of n, and the verb with the most votes wins. Performance is measured as error rate.</Paragraph>
      <Paragraph position="6">  where T is the number of test instances and a tie results when the neighbors cannot decide between the two alternatives.</Paragraph>
      <Paragraph position="7">  all of the CRMs described in Section 2. We also compare the results with the six existing distributional similarity measures (Section 4) and the two WordNet-based measures (Section 5.1).</Paragraph>
      <Paragraph position="8"> A baseline for these experiments is the performance obtained by a technique that backs-off to the unigram probabilities of the verbs being disambiguated. By construction of the test set, this should be approximately 0.5. The actual empirical figures are 0.553 for the high-frequency noun test set and 0.586 for the low-frequency noun test set. The deviation from 0.5 is due to the unigram probabilities of the verbs not being exactly equal and to their being calculated over a larger data set than just the training data for the 2,000 target nouns. These baseline error-rates are also different from what is observed when all 1,999 potential neighbors are considered. In this case, we obtain an error rate of 0.6885 for the high-frequency noun test set and 0.6178 for the low-frequency noun test set. These differences are due to the fact that the correct choice verb, but not the incorrect choice verb, has occurred, possibly many times, with the target noun in the training data, but a noun is not considered as a potential neighbor of itself.</Paragraph>
      <Paragraph position="9"> The results are summarized in Table 6. The table gives the average optimal error rates for each measure, and for high- and low-frequency nouns, calculated using five-fold cross validation. For ease of comparison, the cross-validated average optimal error rates are illustrated in Figure 5. Standard deviation in the mean optimal error rate across the five folds was always less than 0.15 and thus differences greater than 0.028 are significant at the 99% level and differences greater than 0.012 are significant at the 90% level. From the results, we make the following observations.</Paragraph>
      <Paragraph position="10"> 12 We also experimented with optimizing a similarity threshold t, but found that optimizing k gave better results (Weeds 2003).</Paragraph>
      <Paragraph position="11">  First, the best measure appears to be the additive t-test based CRM. This significantly outperforms all but one (the z-test based CRM) of the other measures for high-frequency nouns. For low-frequency nouns, slightly higher performance is obtained using the additive MI-based CRM. This difference, however, is not statistically significant. Second, all of the distributional similarity measures perform considerably better than the WordNet-based measures  at this task for high- and low-frequency nouns.</Paragraph>
      <Paragraph position="12"> Third, for many measures, performance over high-frequency nouns is not significantly higher (and is in some cases lower) than over low-frequency nouns. This suggests that distributional similarity can be used in language modeling even when there is relatively little corpus data over which to calculate distributional similarity.</Paragraph>
      <Paragraph position="13"> We now consider the effects of the different parameters on performance. Since we use the development set to determine the optimal parameters, we consider performance on the development set as each parameter is varied. Table 7 shows the optimized parameter settings in the development set, error rate at these settings in the development set, and error rate at these settings in the test set. For the CRMs, we considered how the performance varies with each parameter when the other parameters are held constant at their optimum values. Figure 6 shows how performance varies with b, and Figure 7 shows how performance varies with g for the additive and difference-weighted t-test based and MI-based CRMs. For reference, the optimal error rates for the best performing existing distributional similarity measure (sim lin ) is also shown as a straight line on each graph.</Paragraph>
      <Paragraph position="14"> We do not show the variation with respect to k for any of the measures, but this was fairly similar for all measures and is as would be expected. To begin with, considering 13 However, for this task, in contrast to earlier work, wn sim lin gives slightly, although insignificantly, better performance than wn dist  Weeds and Weir Co-occurrence Retrieval Figure 5 Bar chart illustrating cross-validated optimal error rates for each measure when k is optimised. more neighbors increases performance, since more neighbors allow decisions to be made in a greater number of cases. However, when k increases beyond an optimal value, a greater number of these decisions will be in the wrong direction, since these words are not very similar to the target word, leading to a decrease in performance. In a small number of cases (when using the ALLR-based CRMs or the WMI-based CRMs for high frequency nouns), performance peaks at k = 1. This suggests that these measures may be very good at finding a few very close neighbors.</Paragraph>
      <Paragraph position="15"> The majority of models, including the additive t-test based and additive MI-based CRMs, perform significantly better at low values of g (0.25-0.5) and high values of b (around 0.8). This indicates that a potential neighbor with high-precision retrieval of informative features is more useful than one with high-recall retrieval. In other words, it seems that it is better to sacrifice being able to make decisions on every test instance with a small number of neighbors in favor of not having neighbors that predict incorrect verb co-occurrences. This also suggests why we saw fairly low performance by the a-skew divergence measure on this task, since it is closest to a high-recall setting in the additive t-test based model. The low values of g indicate that a combination of precision and recall that is closer to a weighted arithmetic mean is generally better than one that is closer to an unweighted harmonic mean. However, this does not hold for the t-test based CRMs for low-frequency nouns. Here a higher value of g is optimal, indicating that, in this case, requiring both recall and precision results in high performance.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML