File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1004_intro.xml
Size: 5,720 bytes
Last Modified: 2025-10-06 14:06:50
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1004"> <Title>Measures of Distributional Similarity</Title> <Section position="2" start_page="0" end_page="25" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> An inherent problem for statistical methods in natural language processing is that of sparse data -- the inaccurate representation in any training corpus of the probability of low frequency events. In particular, reasonable events that happen to not occur in the training set may mistakenly be assigned a probability of zero.</Paragraph> <Paragraph position="1"> These unseen events generally make up a substantial portion of novel data; for example, Essen and Steinbiss (1992) report that 12% of the test-set bigrams in a 75%-25% split of one million words did not occur in the training partition. null We consider here the question of how to estimate the conditional cooccurrence probability P(v\[n) of an unseen word pair (n, v) drawn from some finite set N x V. Two state-of-the-art technologies are Katz's (1987) backoff method and Jelinek and Mercer's (1980) interpolation method. Both use P(v) to estimate P(v\[n) when (n, v) is unseen, essentially ignoring the identity of n.</Paragraph> <Paragraph position="2"> An alternative approach is distance-weighted averaging, which arrives at an estimate for unseen cooccurrences by combining estimates for cooccurrences involving similar words: 1</Paragraph> <Paragraph position="4"> where S(n) is a set of candidate similar words and sim(n, m) is a function of the similarity between n and m. We focus on distributional rather than semantic similarity (e.g., Resnik (1995)) because the goal of distance-weighted averaging is to smooth probability distributions -- although the words &quot;chance&quot; and &quot;probability&quot; are synonyms, the former may not be a good model for predicting what cooccurrences the latter is likely to participate in.</Paragraph> <Paragraph position="5"> There are many plausible measures of distributional similarity. In previous work (Dagan et al., 1999), we compared the performance of three different functions: the Jensen-Shannon divergence (total divergence to the average), the L1 norm, and the confusion probability. Our experiments on a frequency-controlled pseudoword disambiguation task showed that using any of the three in a distance-weighted averaging scheme yielded large improvements over Katz's backoff smoothing method in predicting unseen coocurrences. Furthermore, by using a restricted version of model (1) that stripped incomparable parameters, we were able to empirically demonstrate that the confusion probability is fundamentally worse at selecting useful similar words. D. Lin also found that the choice of similarity function can affect the quality of automatically-constructed thesauri to a statistically significant degree (1998a) and the ability to determine common morphological roots by as much as 49% in precision (1998b).</Paragraph> <Paragraph position="6"> 1The term &quot;similarity-based&quot;, which we have used previously, has been applied to describe other models as well (L. Lee, 1997; Karov and Edelman, 1998).</Paragraph> <Paragraph position="7"> These empirical results indicate that investigating different similarity measures can lead to improved natural language processing. On the other hand, while there have been many similarity measures proposed and analyzed in the information retrieval literature (Jones and Furnas, 1987), there has been some doubt expressed in that community that the choice of similarity metric has any practical impact: Several authors have pointed out that the difference in retrieval performance achieved by different measures of association is insignificant, providing that these are appropriately normalised.</Paragraph> <Paragraph position="8"> (van Rijsbergen, 1979, pg. 38) But no contradiction arises because, as van Rijsbergen continues, &quot;one would expect this since most measures incorporate the same information&quot;. In the language-modeling domain, there is currently no agreed-upon best similarity metric because there is no agreement on what the &quot;same information&quot;- the key data that a similarity function should incorporate -- is.</Paragraph> <Paragraph position="9"> The overall goal of the work described here was to discover these key characteristics. To this end, we first compared a number of common similarity measures, evaluating them in a parameter-free way on a decision task. When grouped by average performance, they fell into several coherent classes, which corresponded to the extent to which the functions focused on the intersection of the supports (regions of positive probability) of the distributions. Using this insight, we developed an information-theoretic metric, the skew divergence, which incorporates the support-intersection data in an asymmetric fashion. This function yielded the best performance overall: an average error rate reduction of 4% (significant at the .01 level) with respect to the Jensen-Shannon divergence, the best predictor of unseen events in our earlier experiments (Dagan et al., 1999).</Paragraph> <Paragraph position="10"> Our contributions are thus three-fold: an empirical comparison of a broad range of similarity metrics using an evaluation methodology that factors out inessential degrees of freedom; a proposal, building on this comparison, of a characteristic for classifying similarity functions; and the introduction of a new similarity metric incorporating this characteristic that is superior at evaluating potential proxy distributions.</Paragraph> </Section> class="xml-element"></Paper>