XML Viewer - p93-1022

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/p93-1022_metho.xml
Size: 27,472 bytes
Last Modified: 2025-10-06 14:13:29
<?xml version="1.0" standalone="yes"?>
<Paper uid="P93-1022">
  <Title>CONTEXTUAL WORD SIMILARITY AND ESTIMATION FROM SPARSE DATA</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Statistical data on word cooccurrence relations play a major role in many corpus based approaches for natural language processing. Different types of cooccurrence relations are in use, such as cooccurrence within a consecutive sequence of words (n-grams), within syntactic relations (verb-object, adjective-noun, etc.) or the cooccurrence of two words within a limited distance in the context. Statistical data about these various cooccurrence relations is employed for a variety of applications, such as speech recognition (Jelinek, 1990), language generation (Smadja and McKeown, 1990), lexicography (Church and Hanks, 1990), machine translation (Brown et al., ; Sadler, 1989), information retrieval (Maarek and Smadja, 1989) and various disambiguation tasks (Dagan et al., 1991; Hindle and Rooth, 1991; Grishman et al., 1986; Dagan and Itai, 1990).</Paragraph>
    <Paragraph position="1"> A major problem for the above applications is how to estimate the probability of cooccurrences that were not observed in the training corpus. Due to data sparseness in unrestricted language, the aggregate probability of such cooccurrences is large and can easily get to 25% or more, even for a very large training corpus (Church and Mercer, 1992).</Paragraph>
    <Paragraph position="2"> Since applications often have to compare alternative hypothesized cooccurrences, it is important to distinguish between those unobserved cooccurrences that are likely to occur in a new piece of text and those that are not. These distinctions ought to be made using the data that do occur in the corpus. Thus, beyond its own practical importance, the sparse data problem provides an informative touchstone for theories on generalization and analogy in linguistic data.</Paragraph>
    <Paragraph position="3"> The literature suggests two major approaches for solving the sparse data problem: smoothing and class based methods. Smoothing methods estimate the probability of unobserved cooccurrences using frequency information (Good, 1953; Katz, 1987; Jelinek and Mercer, 1985; Church and Gale, 1991). Church and Gale (Church and Gale, 1991) show, that for unobserved bigrams, the estimates of several smoothing methods closely agree with the probability that is expected using the frequencies of the two words and assuming that their occurrence is independent ((Church and Gale, 1991), figure 5).</Paragraph>
    <Paragraph position="4"> Furthermore, using held out data they show that this is the probability that should be estimated by a smoothing method that takes into account the frequencies of the individual words. Relying on this result, we will use frequency based es~imalion (using word frequencies) as representative for smoothing estimates of unobserved cooccurrences, for comparison purposes. As will be shown later, the problem with smoothing estimates is that they ignore the expected degree of association between the specific words of the cooccurrence. For example, we would not like to estimate the same probability for two cooccurrences like 'eat bread' and 'eat cars', despite the fact that both 'bread' and 'cars' may have the same frequency.</Paragraph>
    <Paragraph position="5"> Class based models (Brown et al., ; Pereira et al., 1993; Hirschman, 1986; Resnik, 1992) distinguish between unobserved cooccurrences using classes of &amp;quot;similar&amp;quot; words. The probability of a specific cooccurrence is determined using generalized parameters about the probability of class cooccur\] 64 rence. This approach, which follows long traditions in semantic classification, is very appealing, as it attempts to capture &amp;quot;typical&amp;quot; properties of classes of words. However, it is not clear at all that unrestricted language is indeed structured the way it is assumed by class based models. In particular, it is not clear that word cooccurrence patterns can be structured and generalized to class cooccurrence parameters without losing too much information.</Paragraph>
    <Paragraph position="6"> This paper suggests an alternative approach which assumes that class based generalizations should be avoided, and therefore eliminates the intermediate level of word classes. Like some of the class based models, we use a similarity metric to measure the similarity between cooccurrence patterns of words. But then, rather than using this metric to construct a set of word classes, we use it to identify the most specific analogies that can he drawn for each specific estimation. Thus, to estimate the probability of an unobserved cooccurfence of words, we use data about other cooccurfences that were observed in the corpus, and contain words that are similar to the given ones. For example, to estimate the probability of the unobserved cooccurrence 'negative results', we use cooccurrences such as 'positive results' and 'negative numbers', that do occur in our corpus.</Paragraph>
    <Paragraph position="7"> The analogies we make are based on the assumption that similar word cooccurrences have similar values of mutual information. Accordingly, our similarity metric was developed to capture similarities between vectors of mutual information values. In addition, we use an efficient search heuristic to identify the most similar words for a given word, thus making the method computationally affordable. Figure 1 illustrates a portion of the similarity network induced by the similarity metric (only some of the edges, with relatively high values, are shown). This network may be found useful for other purposes, independently of the estimation method.</Paragraph>
    <Paragraph position="8"> The estimation method was implemented using the relation of cooccurrence of two words within a limited distance in a sentence. The proposed method, however, is general and is applicable for anY type of lexical cooccurrence. The method was evaluated in two experiments. In the first one we achieved a complete scenario of the use of the estimation method, by implementing a variant of the d\[Sambiguation method in (Dagan et al., 1991), for sense selection in machine translation. The estimation method was then successfully used to increase the coverage of the disambiguation method by 15%, with an increase of the overall precision compared to a naive, frequency based, method. In the second experiment we evaluated the estimation method on a data recovery task. The task simulates a typical scenario in disambiguation, and also relates to theoretical questions about redundancy and idiosyncrasy in cooccurrence data. In this evaluation, which involved 300 examples, the performance of the estimation method was by 27% better than frequency based estimation.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="165" type="metho">
    <SectionTitle>
2 Definitions
</SectionTitle>
    <Paragraph position="0"> We use the term cooccurrence pair, written as (x, y), to denote a cooccurrence of two words in a sentence within a distance of no more than d words.</Paragraph>
    <Paragraph position="1"> When computing the distance d, we ignore function words such as prepositions and determiners. In the experiments reported here d = 3.</Paragraph>
    <Paragraph position="2"> A cooccurrence pair can be viewed as a generalization of a bigram, where a bigram is a cooccurrence pair with d = 1 (without ignoring function words). As with bigrams, a cooccurrence pair is directional, i.e. (x,y) C/ (y,x). This captures some information about the asymmetry in the linear order of linguistic relations, such as the fact that verbs tend to precede their objects and follow their subjects.</Paragraph>
    <Paragraph position="3"> The mutual information of a cooccurrence pair, which measures the degree of association between the two words (Church and Hanks, 1990), is defined as (Fano, 1961):</Paragraph>
    <Paragraph position="5"> where P(x) and P(y) are the probabilities of the events x and y (occurrences of words, in our case) and P(x, y) is the probability of the joint event (a cooccurrence pair).</Paragraph>
    <Paragraph position="6"> We estimate mutual information values using the Maximum Likelihood Estimator (MLE):</Paragraph>
    <Paragraph position="8"> (2) where f denotes the frequency of an eyent and N is the length of the corpus. While better estimates for small probabilities are available (Good, 1953; Church and Gale, 1991), MLE is the simplest to implement and was adequate for the purpose of this study. Due to the unreliability of measuring negative mutual information values in corpora that are not extremely large, we have considered in this work any negative value to be 0. We also set/~(x, y) to 0 if f(x, y) = 0. Thus, we assume in both cases that the association between the two words is as expected by chance.</Paragraph>
    <Paragraph position="9">  paper articles deg14I /\00 1 conference. 0.132 . papers ~ /~ ,, U. I6 ~, l&amp;quot;,, &amp;quot;-,,</Paragraph>
    <Paragraph position="11"/>
  </Section>
  <Section position="5" start_page="165" end_page="166" type="metho">
    <SectionTitle>
3 Estimation for an Unobserved
</SectionTitle>
    <Paragraph position="0"> Cooccurrence Assume that we have at our disposal a method for determining similarity between cooccurrence patterns of two words (as described in the next section). We say that two cooccurrence pairs, (wl, w2) and (w~, w~), are similar if w~ is similar to wl and w~ is similar to w2. A special (and stronger) case of similarity is when the two pairs differ only in one of their words (e.g. (wl,w~) and (wl,w2)).</Paragraph>
    <Paragraph position="1"> This special case is less susceptible to noise than unrestricted similarity, as we replace only one of the words in the pair. In our experiments, which involved rather noisy data, we have used only this restricted type of similarity. The mathematical formulations, though, are presented in terms of the general case.</Paragraph>
    <Paragraph position="2"> The question that arises now is what analogies can be drawn between two similar cooccurrence pairs, (wl,w2) and tw' wt~ Their proba- k 1' 21&amp;quot; bilities cannot be expected to be similar, since the probabilities of the words in each pair can be different. However, since we assume that wl and w~ have similar cooccurrence patterns, and so do w~ and w~, it is reasonable to assume that the mutual information of the two pairs will be similar (recall that mutual information measures the degree of association between the words of the pair).</Paragraph>
    <Paragraph position="3"> Consider for example the pair (chapter, de- scribes), which does not occur in our corpus 1 . This pair was found to be similar to the pairs (intro1 We used a corpus of about 9 million words of texts in the computer domain, taken from articles posted to the USENET news system.</Paragraph>
    <Paragraph position="4"> duction, describes), (book, describes)and (section, describes), that do occur in the corpus. Since these pairs occur in the corpus, we estimate their mutual information values using equation 2, as shown in Table 1. We then take the average of these mutual information values as the similarity based estimate for I(chapter, describes), denoted as f(chapter, describes) 2. This represents the assumption that the word 'describes' is associated with the word 'chapter' to a similar extent as it is associated with the words 'introduction', 'book' and 'section'. Table 2 demonstrates how the analogy is carried out also for a pair of unassociated words, such as (chapter, knows).</Paragraph>
    <Paragraph position="5"> In our current implementation, we compute i(wl, w2) using up to 6 most similar words to each of wl and w~, and averaging the mutual information values of similar pairs that occur in the corpus (6 is a parameter, tuned for our corpus. In some cases the similarity method identifies less than 6 similar words).</Paragraph>
    <Paragraph position="6"> Having an estimate for the mutual information of a pair, we can estimate its expected frequency in a corpus of the given size using a variation of</Paragraph>
    <Paragraph position="8"> and d = 3, getting a similarity based estimate of f(chapter, describes)= 3.15. This value is much 2We use I for similarity based estimates, and reserve i for the traditional maximum fikefihood estimate. The similarity based estimate will be used for cooccurrence pairs that do not occur in the corpus.</Paragraph>
    <Paragraph position="9">  higher than the frequency based estimate (0.037), reflecting the plausibility of the specific combination of words 3. On the other hand, the similarity based estimate for \](chapter, knows) is 0.124, which is identical to the frequency based estimate, reflecting the fact that there is no expected association between the two words (notice that the frequency based estimate is higher for the second pair, due to the higher frequency of 'knows').</Paragraph>
  </Section>
  <Section position="6" start_page="166" end_page="167" type="metho">
    <SectionTitle>
4 TheSimilarity Metric
</SectionTitle>
    <Paragraph position="0"> Assume that we need to determine the degree of similarity between two words, wl and w2. Recall that if we decide that the two words are similar, then we may infer that they have similar mutual information with some other word, w. This inference would be reasonable if we find that on average wl and w2 indeed have similar mutual information values with other words in the lexicon. The similarity metric therefore measures the degree of similarity between these mutual information values.</Paragraph>
    <Paragraph position="1"> We first define the similarity between the mutual information values of Wl and w2 relative to a single other word, w. Since cooccurrence pairs are directional, we get two measures, defined by the position of w in the pair. The left context similarity of wl and w2 relative to w, termed simL(Wl, w2, w), is defined as the ratio between the two mutual information values, having the larger value in the denominator: null</Paragraph>
    <Paragraph position="3"> quency of a cooccurrence pair, assuming independent occurrence of the two words and using their individual frequencies, is -~f(wz)f(w2). As mentioned earlier, we use this estimate as representative for smoothing estimates of unobserved cooccurrences.</Paragraph>
    <Paragraph position="4"> This way we get a uniform scale between 0 and 1, in which higher values reflect higher similarity. If both mutual information values are 0, then sirnL(wl,w2, w) is defined to be 0. The right context similarity, simn(wl, w2, w), is defined equivalently, for I(Wl, w) and I(w2, w) 4.</Paragraph>
    <Paragraph position="5"> Using definition 4 for each word w in the lexicon, we get 2 * l similarity values for Wl and w2, where I is the size of the lexicon. The general similarity between Wl and w2, termed sim(wl, w2), is defined as a weighted average of these 2 * l values.</Paragraph>
    <Paragraph position="6"> It is necessary to use some weighting mechanism, since small values of mutual information tend to be less significant and more vulnerable to noisy data.</Paragraph>
    <Paragraph position="7"> We found that the maximal value involved in computing the similarity relative to a specific word provides a useful weight for this word in computing the average. Thus, the weight for a specific left context similarity value, WL(Wl, W2, W), is defined as: Wt(wl, w) = max(I(w, wl), :(w, (5) (notice that this is the same as the denominator in definition 4). This definition provides intuitively appropriate weights, since we would like to give more weight to context words that have a large mutual information value with at least one of Wl and w2. The mutual information value with the other word may then be large, providing a strong &amp;quot;vote&amp;quot; for similarity, or may be small, providing a strong &amp;quot;vote&amp;quot; against similarity. The weight for a specific right context similarity value is defined equivalently. Using these weights, we get the weighted average in Figure 2 as the general definition of 4In the case of cooccurrence pairs, a word may be involved in two types of relations, being the left or right argument of the pair. The definitions can be easily adopted to cases in which there are more types of relations, such as provided by syntactic parsing.</Paragraph>
    <Paragraph position="9"> results.</Paragraph>
    <Paragraph position="10"> similar words of aspects: heurissearch produce nearly the same similarity s.</Paragraph>
    <Paragraph position="11"> The values produced by our metric have an intuitive interpretation, as denoting a &amp;quot;typical&amp;quot; ratio between the mutual information values of each of the two words with another third word. The metric is reflexive (sirn(w,w) -- 1), symmetric (sim(wz, w2) = sirn(w2, wz)), but is not transitive (the values of sire(w1, w2) and sire(w2, w3) do not imply anything on the value of sire(w1, w3)). The left column of Table 3 lists the six most similar words to the word 'aspects' according to this metric, based on our corpus. More examples of similarity were shown in Figure 1.</Paragraph>
    <Section position="1" start_page="167" end_page="167" type="sub_section">
      <SectionTitle>
4.1 An efficient search heuristic
</SectionTitle>
      <Paragraph position="0"> The estimation method of section 3 requires that we identify the most similar words of a given word w. Doing this by computing the similarity between w and each word in the lexicon is computationally very expensive (O(12), where I is the size of the lexicon, and O(l J) to do this in advance for all the words in the lexicon). To account for this problem we developed a simple heuristic that searches for words that are potentially similar to w, using thresholds on mutual information values and frequencies of cooccurrence pairs. The search is based on the property that when computing sim(wl, w2), words that have high mutual information values 5The nominator in our metric resembles the similarity metric in (Hindle, 1990). We found, however, that the difference between the two metrics is important, because the denominator serves as a normalization factor. with both wl and w2 make the largest contributions to the value of the similarity measure. Also, high and reliable mutual information values are typically associated with relatively high frequencies of the involved cooccurrence pairs. We therefore search first for all the &amp;quot;strong neighbors&amp;quot; of w, which are defined as words whose cooccurrence with w has high mutual information and high frequency, and then search for all their &amp;quot;strong neighbors&amp;quot;. The words found this way (&amp;quot;the strong neighbors of the strong neighbors of w&amp;quot;) are considered as candidates for being similar words of w, and the similarity value with w is then computed only for these words. We thus get an approximation for the set of words that are most similar to w. For the example given in Table 3, the exhaustive method required 17 minutes of CPU time on a Sun 4 workstation, while the approximation required only 7 seconds. This was done using a data base of 1,377,653 cooccurrence pairs that were extracted from the corpus, along with their counts.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="167" end_page="169" type="metho">
    <SectionTitle>
5 Evaluations
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="167" end_page="168" type="sub_section">
      <SectionTitle>
5.1 Word sense disambiguation in
</SectionTitle>
      <Paragraph position="0"> machine translation The purpose of the first evaluation was to test whether the similarity based estimation method can enhance the performance of a disambiguation technique. Typically in a disambiguation task, different cooccurrences correspond to alternative interpretations of the ambiguous construct. It is therefore necessary that the probability estimates for the alternative cooccurrences will reflect the relative order between their true probabilities. However, a consistent bias in the estimate is usually not harmful, as it still preserves the correct relative order between the alternatives.</Paragraph>
      <Paragraph position="1"> To carry out the evaluation, we implemented a variant of the disambiguation method of (Dagan et al., 1991), for sense disambiguation in machine translation. We term this method as THIS, for Target Word Selection. Consider for example the Hebrew phrase 'laxtom xoze shalom', which translates as 'to sign a peace treaty'. The word 'laxtom', however, is ambiguous, and can be translated to either 'sign' or 'seal'. To resolve the ambiguity, the</Paragraph>
    </Section>
    <Section position="2" start_page="168" end_page="168" type="sub_section">
      <SectionTitle>
Word Frequency methods
</SectionTitle>
      <Paragraph position="0"> TWS method first generates the alternative lexical cooccurrence patterns in the targel language, that correspond to alternative selections of target words. Then, it prefers those target words that generate more frequent patterns. In our example, the word 'sign' is preferred upon the word 'seal', since the pattern 'to sign a treaty' is much more frequent than the pattern 'to seal a treaty'. Similarly, the word 'xoze' is translated to 'treaty' rather than 'contract', due to the high frequency of the pattern 'peace treaty '6. In our implementation, cooccurrence pairs were used instead of lexical cooccurfence within syntactic relations (as in the original work), to save the need of parsing the corpus.</Paragraph>
      <Paragraph position="1"> We randomly selected from a software manual a set of 269 examples of ambiguous Hebrew words in translating Hebrew sentences to English. The expected success rate of random selection for these examples was 23%. The similarity based estimation method was used to estimate the expected frequency of unobserved cooccurrence pairs, in cases where none of the alternative pairs occurred in the corpus (each pair corresponds to an alternative target word). Using this method, which we term Augmented TWS, 41 additional cases were disambiguated, relative to the original method. We thus achieved an increase of about 15% in the applicability (coverage) of the TWS method, with a small decrease in the overall precision. The performance of the Augmented TWS method on these 41 examples was about 15% higher than that of a naive, Word Frequency method, which always selects the most frequent translation. It should be noted that the Word Frequency method is equivalent to using the frequency based estimate, in which higher word frequencies entail a higher estimate for the corresponding cooccurrence. The results of the experiment are summarized in Table 4.</Paragraph>
    </Section>
    <Section position="3" start_page="168" end_page="169" type="sub_section">
      <SectionTitle>
5.2 A data recovery task
</SectionTitle>
      <Paragraph position="0"> In the second evaluation, the estimation method had to distinguish between members of two sets of 8It should be emphasized that the TWS method uses only a monolingual target corpus, and not a bilingual corpus as in other methods ((Brown et al., 1991; Gale et al., 1992)). The alternative cooccurrence patterns in the target language, which correspond to the alternative translations of the ambiguous source words, are constructed using a bilingual lexicon.</Paragraph>
      <Paragraph position="1"> cooccurrence pairs, one of them containing pairs with relatively high probability and the other pairs with low probability. To a large extent, this task simulates a typical scenario in disambiguation, as demonstrated in the first evaluation.</Paragraph>
      <Paragraph position="2"> Ideally, this evaluation should be carried out using a large set of held out data, which would provide good estimates for the true probabilities of the pairs in the test sets. The estimation method should then use a much smaller training corpus, in which none of the example pairs occur, and then should try to recover the probabilities that are known to us from the held out data. However, such a setting requires that the held out corpus would be several times larger than the training corpus, while the latter should be large enough for robust application of the estimation method. This was not feasible with the size of our corpus, and the rather noisy data we had.</Paragraph>
      <Paragraph position="3"> To avoid this problem, we obtained the set of pairs with high probability from the training corpus, selecting pairs that occur at least 5 times.</Paragraph>
      <Paragraph position="4"> We then deleted these pairs from the data base that is used by the estimation method, forcing the method to recover their probabilities using the other pairs of the corpus. The second set, of pairs with low probability, was obtained by constructing pairs that do not occur in the corpus. The two sets, each of them containing 150 pairs, were constructed randomly and were restricted to words with individual frequencies between 500 and 2500. We term these two sets as the occurring and non-occurring sets.</Paragraph>
      <Paragraph position="5"> The task of distinguishing between members of the two sets, without access to the deleted frequency information, is by no means trivial. Trying to use the individual word frequencies will result in performance close to that of using random selection. This is because the individual frequencies of all participating words are within the same range of values.</Paragraph>
      <Paragraph position="6"> To address the task, we used the following procedure: The frequency of each cooccurrence pair was estimated using the similarity-based estimation method. If the estimated frequency was above 2.5 (which was set arbitrarily as the average of 5 and 0), the pair was recovered as a member of the occurring set. Otherwise, it was recovered as a member of the non-occurring set.</Paragraph>
      <Paragraph position="7"> Out of the 150 pairs of the occurring set, our method correctly identified 119 (79%). For th e non-occurring set, it correctly identified 126 pairs (84%). Thus, the method achieved an 0retail accuracy of 81.6%. Optimal tuning of the threshold, to a value of 2, improves the overall accuracy to 85%, where about 90% of the members of the occurring set and 80% of those in the non-occurring  set are identified correctly. This is contrasted with the optimal discrimination that could be achieved by frequency based estimation, which is 58%.</Paragraph>
      <Paragraph position="8"> Figures 3 and 4 illustrate the results of the experiment. Figure 3 shows the distributions of the expected frequency of the pairs in the two sets, using similarity based and frequency based estimation. It clearly indicates that the similarity based method gives high estimates mainly to members of the occurring set and low estimates mainly to members of the non-occurring set. Frequency based estimation, on the other hand, makes a much poorer distinction between the two sets. Figure 4 plots the two types of estimation for pairs in the occurring set as a function of their true frequency in the corpus. It can be seen that while the frequency based estimates are always low (by construction) the similarity based estimates are in most cases closer to the true value.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML