File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1004_metho.xml
Size: 13,406 bytes
Last Modified: 2025-10-06 14:15:20
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1004"> <Title>Measures of Distributional Similarity</Title> <Section position="3" start_page="25" end_page="27" type="metho"> <SectionTitle> 2 Distributional Similarity Functions </SectionTitle> <Paragraph position="0"> In this section, we describe the seven distributional similarity functions we initally evaluated. 2 For concreteness, we choose N and V to be the set of nouns and the set of transitive verbs, respectively; a cooccurrence pair (n, v) results when n appears as the head noun of the direct object of v. We use P to denote probabilities assigned by a base language model (in our experiments, we simply used unsmoothed relative frequencies derived from training corpus counts).</Paragraph> <Paragraph position="1"> Let n and m be two nouns whose distributional similarity is to be determined; for notational simplicity, we write q(v) for P(vln ) and r(v) for P(vlm), their respective conditional verb cooccurrence probabilities.</Paragraph> <Paragraph position="2"> Figure 1 lists several familiar functions. The cosine metric and Jaccard's coefficient are commonly used in information retrieval as measures of association (Salton and McGill, 1983). Note that Jaccard's coefficient differs from all the other measures we consider in that it is essentially combinatorial, being based only on the sizes of the supports of q, r, and q * r rather than the actual values of the distributions.</Paragraph> <Paragraph position="3"> Previously, we found the Jensen-Shannon divergence (Rao, 1982; J. Lin, 1991) to be a useful measure of the distance between distributions: JS(q,r)=-~l \[D(q aVgq,r)+D(r aVgq,r) \] The function D is the KL divergence, which measures the (always nonnegative) average inefficiency in using one distribution to code for another (Cover and Thomas, 1991):</Paragraph> <Paragraph position="5"> The function avga, r denotes the average distribution avgq,r(V ) --= (q(v)+r(v))/2; observe that its use ensures that the Jensen-Shannon divergence is always defined. In contrast, D(qllr ) is undefined if q is not absolutely continuous with respect to r (i.e., the support of q is not a subset of the support of r).</Paragraph> <Paragraph position="6"> 2Strictly speaking, some of these functions are dissimilarity measures, but each such function f can be recast as a similarity function via the simple transformation C - f, where C is an appropriate constant. Whether we mean f or C - f should be clear from context.</Paragraph> <Paragraph position="8"> The confusion probability has been used by several authors to smooth word cooccurrence probabilities (Sugawara et al., 1985; Essen and Steinbiss, 1992; Grishman and Sterling, 1993); it measures the degree to which word m can be substituted into the contexts in which n appears. If the base language model probabilities obey certain Bayesian consistency conditions (Dagan et al., 1999), as is the case for relative frequencies, then we may write the confusion probability as follows:</Paragraph> <Paragraph position="10"> Note that it incorporates unigram probabilities as well as the two distributions q and r.</Paragraph> <Paragraph position="11"> Finally, Kendall's % which appears in work on clustering similar adjectives (Hatzivassiloglou and McKeown, 1993; Hatzivassiloglou, 1996), is a nonparametric measure of the association between random variables (Gibbons, 1993). In our context, it looks for correlation between the behavior of q and r on pairs of verbs. Three versions exist; we use the simplest, Ta, here:</Paragraph> <Paragraph position="13"> where sign(x) is 1 for positive arguments, -1 for negative arguments, and 0 at 0. The intuition behind Kendall's T is as follows. Assume all verbs have distinct conditional probabilities.</Paragraph> <Paragraph position="14"> If sorting the verbs by the likelihoods assigned by q yields exactly the same ordering as that which results from ranking them according to r, then T(q, r) = 1; if it yields exactly the opposite ordering, then T(q, r) -- -1. We treat a value of -1 as indicating extreme dissimilarity. 3 It is worth noting at this point that there are several well-known measures from the NLP literature that we have omitted from our experiments. Arguably the most widely used is the mutual information (Hindle, 1990; Church and Hanks, 1990; Dagan et al., 1995; Luk, 1995; D. Lin, 1998a). It does not apply in the present setting because it does not measure the similarity between two arbitrary probability distributions (in our case, P(VIn ) and P(VIm)) , but rather the similarity between a joint distribution P(X1,X2) and the corresponding product distribution P(X1)P(X2).</Paragraph> <Paragraph position="15"> Hamming-type metrics (Cardie, 1993; Zavrel and Daelemans, 1997) are intended for data with symbolic features, since they count feature label mismatches, whereas we are dealing feature Values that are probabilities. Variations of the value difference metric (Stanfill and Waltz, 1986) have been employed for supervised disambiguation (Ng and H.B. Lee, 1996; Ng, 1997); but it is not reasonable in language modeling to expect training data tagged with correct probabilities. The Dice coej~cient (Smadja et al., 1996; D. Lin, 1998a, 1998b) is monotonic in Jaccard's coefficient (van Rijsbergen, 1979), so its inclusion in our experiments would be redundant. Finally, we did not use the KL divergence because it requires a smoothed base language model.</Paragraph> <Paragraph position="16"> SZero would also be a reasonable choice, since it indicates zero correlation between q and r. However, it would then not be clear how to average in the estimates of negatively correlated words in equation (1).</Paragraph> </Section> <Section position="4" start_page="27" end_page="29" type="metho"> <SectionTitle> 3 Empirical Comparison </SectionTitle> <Paragraph position="0"> We evaluated the similarity functions introduced in the previous section on a binary decision task, using the same experimental framework as in our previous preliminary comparison (Dagan et al., 1999). That is, the data consisted of the verb-object cooccurrence pairs in the 1988 Associated Press newswire involving the 1000 most frequent nouns, extracted via Church's (1988) and Yarowsky's processing tools. 587,833 (80%) of the pairs served as a training set from which to calculate base probabilities. From the other 20%, we prepared test sets as follows: after discarding pairs occurring in the training data (after all, the point of similarity-based estimation is to deal with unseen pairs), we split the remaining pairs into five partitions, and replaced each noun-verb pair (n, vl) with a noun-verb-verb triple (n, vl, v2) such that P(v2) ~ P(vl). The task for the language model under evaluation was to reconstruct which of (n, vl) and (n, v2) was the original cooccurrence. Note that by construction, (n, Vl) was always the correct answer, and furthermore, methods relying solely on uni-gram frequencies would perform no better than chance. Test-set performance was measured by the error rate, defined as T(# of incorrect choices + (# of ties)/2), where T is the number of test triple tokens in the set, and a tie results when both alternatives are deemed equally likely by the language model in question.</Paragraph> <Paragraph position="1"> To perform the evaluation, we incorporated each similarity function into a decision rule as follows. For a given similarity measure f and neighborhood size k, let 3f, k(n) denote the k most similar words to n according to f. We define the evidence according to f for the cooccurrence ( n, v~) as Ef, k(n, vi) = \[(m E SLk(n) : P(vilm) > l }l * Then, the decision rule was to choose the alternative with the greatest evidence.</Paragraph> <Paragraph position="2"> The reason we used a restricted version of the distance-weighted averaging model was that we sought to discover fundamental differences in behavior. Because we have a binary decision task, Ef,k(n, vl) simply counts the number of k nearest neighbors to n that make the right decision. If we have two functions f and g such that Ef,k(n, Vl) > Eg,k(n, vi), then the k most similar words according to f are on the whole better predictors than the k most similar words according to g; hence, f induces an inherently better similarity ranking for distance-weighted averaging. The difficulty with using the full model (Equation (1)) for comparison purposes is that fundamental differences can be obscured by issues of weighting. For example, suppose the probability estimate ~v(2 -Ll(q, r)). r(v) (suitably normalized) performed poorly. We would not be able to tell whether the cause was an inherent deficiency in the L1 norm or just a poor choice of weight function -- perhaps (2- Ll(q,r)) 2 would have yielded better estimates.</Paragraph> <Paragraph position="3"> Figure 2 shows how the average error rate varies with k for the seven similarity metrics introduced above. As previously mentioned, a steeper slope indicates a better similarity ranking. null All the curves have a generally upward trend but always lie far below backoff (51% error rate). They meet at k = 1000 because Sf, looo(n) is always the set of all nouns. We see that the functions fall into four groups: (1) the L2 norm; (2) Kendall's T; (3) the confusion probability and the cosine metric; and (4) the L1 norm, Jensen-Shannon divergence, and Jaccard's coefficient. null We can account for the similar performance of various metrics by analyzing how they incorporate information from the intersection of the supports of q and r. (Recall that we are using q and r for the conditional verb cooccurrrence probabilities of two nouns n and m.) Consider the following supports (illustrated in Figure 3):</Paragraph> <Paragraph position="5"> We can rewrite the similarity functions from Section 2 in terms of these sets, making use of the identities ~-~veyq\yq~ q(v) + ~veyq~ q(v) =</Paragraph> <Paragraph position="7"> these alternative forms in order of performance.</Paragraph> <Paragraph position="9"> performance. \ denotes set difference; A denotes symmetric set difference.</Paragraph> <Paragraph position="10"> We see that for the non-combinatorial functions, the groups correspond to the degree to which the measures rely on the verbs in Vat. The Jensen-Shannon divergence and the L1 norm can be computed simply by knowing the values of q and r on Vqr. For the cosine and the confusion probability, the distribution values on Vqr are key, but other information is also incorporated. The statistic Ta takes into account all verbs, including those that occur neither with n nor m. Finally, the Euclidean distance is quadratic in verbs outside Vat; indeed, Kaufman and Rousseeuw (1990) note that it is &quot;extremely sensitive to the effect of one or more outliers&quot; (pg. 117).</Paragraph> <Paragraph position="11"> The superior performance of Jac(q, r) seems to underscore the importance of the set Vqr.</Paragraph> <Paragraph position="12"> Jaccard's coefficient ignores the values of q and r on Vqr; but we see that simply knowing the size of Vqr relative to the supports of q and r leads to good rankings.</Paragraph> </Section> <Section position="5" start_page="29" end_page="30" type="metho"> <SectionTitle> 4 The Skew Divergence </SectionTitle> <Paragraph position="0"> Based on the results just described, it appears that it is desirable to have a similarity function that focuses on the verbs that cooccur with both of the nouns being compared. However, we can make a further observation: with the exception of the confusion probability, all the functions we compared are symmetric, that is, f(q, r) -= f(r, q). But the substitutability of one word for another need not symmetric. For instance, &quot;fruit&quot; may be the best possible approximation to &quot;apple&quot;, but the distribution of &quot;apple&quot; may not be a suitable proxy for the distribution of &quot;fruit&quot;.a In accordance with this insight, we developed a novel asymmetric generalization of the KL divergence, the a-skew divergence: sa(q,r) = D(r \[\[a'q + (1 - a)-r) for 0 <_ a < 1. It can easily be shown that sa depends only on the verbs in Vat. Note that at a -- 1, the skew divergence is exactly the KL divergence, and su2 is twice one of the summands of JS (note that it is still asymmetric).</Paragraph> <Paragraph position="1"> 40n a related note, an anonymous reviewer cited the following example from the psychology literature: we can say Smith's lecture is like a sleeping pill, but &quot;not the other way round&quot;.</Paragraph> <Paragraph position="2"> We can think of a as a degree of confidence in the empirical distribution q; or, equivalently, (1 - a) can be thought of as controlling the amount by which one smooths q by r. Thus, we can view the skew divergence as an approximation to the KL divergence to be used when sparse data problems would cause the latter measure to be undefined.</Paragraph> <Paragraph position="3"> Figure 4 shows the performance of sa for a = .99. It performs better than all the other functions; the difference with respect to Jaccard's coefficient is statistically significant, according to the paired t-test, at all k (except k = 1000), with significance level .01 at all k except 100, 400, and 1000.</Paragraph> </Section> class="xml-element"></Paper>