File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2127_metho.xml

Size: 13,612 bytes

Last Modified: 2025-10-06 14:15:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2127">
  <Title>Automatic Retrieval and Clustering of Similar Words</Title>
  <Section position="3" start_page="768" end_page="770" type="metho">
    <SectionTitle>
2 Word Similarity
</SectionTitle>
    <Paragraph position="0"> Our similarity measure is based on a proposal in (Lin, 1997), where the similarity between two objects is defined to be the amount of information contained in the commonality between the objects divided by the amount of information in the descriptions of the objects.</Paragraph>
    <Paragraph position="1"> We use a broad-coverage parser (Lin, 1993; Lin, 1994) to extract dependency triples from the text corpus. A dependency triple consists of two words and the grammatical relationship between them in the input sentence. For example, the triples extracted from the sentence &amp;quot;I have a brown dog&amp;quot; are: (2) (have subj I), (I subj-of have), (dog obj-of have), (dog adj-mod brown), (brown adj-mod-of dog), (dog det a), (a det-of dog) We use the notation IIw, r, w'll to denote the frequency count of the dependency triple (w, r, w ~) in the parsed corpus. When w, r, or w ~ is the wild card (*), the frequency counts of all the dependency triples that matches the rest of the pattern are summed up. For example, Ilcook, obj, *11 is the total occurrences of cook-object relationships in the parsed corpus, and I1., *, *11 is the total number of dependency triples extracted from the parsed corpus. null The description of a word w consists of the frequency counts of all the dependency triples that matches the pattern (w,., .). The commonality between two words consists of the dependency triples that appear in the descriptions of both words. For example, (3) is the the description of the word &amp;quot;cell&amp;quot;.</Paragraph>
    <Paragraph position="2">  (3) Ilcell, subj-of, absorbll=l  Ilcell, nmod, blood vesselH=l IIcell, nmod, bodYll=2 Ilcell, nmod, bone marrowll=2 Ilcell, nmod, burialH=l Ilcell, nmod, chameleonll=l Assuming that the frequency counts of the dependency triples are independent of each other, the information contained in the description of a word is the sum of the information contained in each individual frequency count.</Paragraph>
    <Paragraph position="3"> To measure the information contained in the statement IIw, r, w' H=c, we first measure the amount of information in the statement that a randomly selected dependency triple is (w, r, w') when we do not know the value of IIw, r,w'll. We then measure the amount of information in the same statement when we do know the value of II w, r, w' II. The difference between these two amounts is taken to be the information contained in Hw, r, w' \[l=c.</Paragraph>
    <Paragraph position="4"> An occurrence of a dependency triple (w, r, w') can be regarded as the co-occurrence of three events: A: a randomly selected word is w; B: a randomly selected dependency type is r; C: a randomly selected word is w ~.</Paragraph>
    <Paragraph position="5"> When the value of Ilw, r,w'll is unknown, we assume that A and C are conditionally independent given B. The probability of A, B and C co-occurring is estimated by PMLE( B ) PMLE( A\[B ) PMLE( C\[B ), where PMLE is the maximum likelihood estimation of a probability distribution and</Paragraph>
    <Paragraph position="7"> When the value of Hw, r, w~H is known, we can obtain PMLE(A, B, C) directly: PMLE(A, B, C) = \[\[w, r, wll/\[\[*, *, *H Let I(w,r,w ~) denote the amount information contained in Hw, r,w~\]\]=c. Its value can be corn-</Paragraph>
    <Paragraph position="9"> puted as follows:</Paragraph>
    <Paragraph position="11"> It is worth noting that I(w,r,w') is equal to the mutual information between w and w' (Hindle, 1990).</Paragraph>
    <Paragraph position="12"> Let T(w) be the set of pairs (r, w') such that log Iw'r'w'lrxll*'r'*ll is positive. We define the sim- wlr~* X *~r~w ! ilarity sim(wl, w2) between two words wl and w2 as follows: )&amp;quot;~(r,w)eT(w, )NT(w~)(I(Wl, r, w) + I(w2, r, w) ) ~-,(r,w)eT(wl) I(Wl, r, w) q- ~(r,w)eT(w2) I(w2, r, w) We parsed a 64-million-word corpus consisting of the Wall Street Journal (24 million words), San Jose Mercury (21 million words) and AP Newswire (19 million words). From the parsed corpus, we extracted 56.5 million dependency triples (8.7 million unique). In the parsed corpus, there are 5469 nouns, 2173 verbs, and 2632 adjectives/adverbs that occurred at least 100 times. We computed the pair-wise similarity between all the nouns, all the verbs and all the adjectives/adverbs, using the above similarity measure. For each word, we created a thesaurus entry which contains the top-N ! words that are most similar to it. 2 The thesaurus entry for word w has the following format:</Paragraph>
    <Paragraph position="14"> where pos is a part of speech, wi is a word, si=sim(w, wi) and si's are ordered in descending  order. For example, the top-10 words in the noun, verb, and adjective entries for the word &amp;quot;brief&amp;quot; are shown below: brief (noun): affidavit 0.13, petition 0.05, memorandum 0.05, motion 0.05, lawsuit 0.05, deposition 0.05, slight 0.05, prospectus 0.04, document 0.04 paper 0.04 ....</Paragraph>
    <Paragraph position="15"> brief(verb): tell 0.09, urge 0.07, ask 0.07, meet 0.06, appoint 0.06, elect 0.05, name 0.05, empower 0.05, summon 0.05, overrule 0.04 ....</Paragraph>
    <Paragraph position="16"> brief (adjective): lengthy 0.13, short 0.12, recent 0.09, prolonged 0.09, long 0.09, extended 0.09, daylong 0.08, scheduled 0.08, stormy 0.07, planned 0.06 ....</Paragraph>
    <Paragraph position="17"> Two words are a pair of respective nearest neighbors (RNNs) if each is the other's most similar word. Our program found 543 pairs of RNN nouns, 212 pairs of RNN verbs and 382 pairs of RNN adjectives/adverbs in the automatically created thesaurus. Appendix A lists every 10th of the RNNs. The result looks very strong. Few pairs of RNNs in Appendix A have clearly better alternatives.</Paragraph>
    <Paragraph position="18"> We also constructed several other thesauri using the same corpus, but with the similarity measures in Figure 1. The measure simHinate is the same as the similarity measure proposed in (Hindle, 1990), except that it does not use dependency triples with negative mutual information. The measure simHindle,, is the same as simHindle except that all types of dependency relationships are used, instead of just subject and object relationships. The measures simcosine, simdice and simdacard are versions of similarity measures commonly used in information retrieval (Frakes and Baeza-Yates, 1992). Unlike sim, simninale and simHinater, they only</Paragraph>
    <Paragraph position="20"> where S(w) is the set of senses of w in the WordNet, super(c) is the set of (possibly indirect) superclasses of concept c in the WordNet, R(w) is the set of words that belong to a same Roget category as w.</Paragraph>
  </Section>
  <Section position="4" start_page="770" end_page="771" type="metho">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> In this section, we present an evaluation of automatically constructed thesauri with two manually compiled thesauri, namely, WordNetl.5 (Miller et al., 1990) and Roget Thesaurus. We first define two word similarity measures that are based on the structures of WordNet and Roget (Figure 2). The similarity measure simwN is based on the proposal in (Lin, 1997). The similarity measure simRoget treats all the words in Roget as features. A word w possesses the feature f if f and w belong to a same Roget category. The similarity between two words is then defined as the cosine coefficient of the two feature vectors.</Paragraph>
    <Paragraph position="1"> With simwN and simRoget, we transform Word-Net and Roget into the same format as the automatically constructed thesauri in the previous section. We now discuss how to measure the similarity between two thesaurus entries. Suppose two thesaurus entries for the same word are as follows: 'tO : '//31~ 81~'//12~ 82~... ~I)N~S N Their similarity is defined as:</Paragraph>
    <Paragraph position="3"> For example, (5) is the entry for &amp;quot;brief (noun)&amp;quot; in our automatically generated thesaurus and (6) and  (7) are corresponding entries in WordNet thesaurus and Roget thesaurus.</Paragraph>
    <Paragraph position="4"> (5) brief (noun): affidavit 0.13, petition 0.05, memorandum 0.05, motion 0.05, lawsuit 0.05, deposition 0.05, slight 0.05, prospectus 0.04, document 0.04 paper 0.04.</Paragraph>
    <Paragraph position="5"> (6) brief (noun): outline 0.96, instrument 0.84, summary 0.84, affidavit 0.80, deposition 0.80, law 0.77, survey 0.74, sketch 0.74, resume 0.74, argument 0.74.</Paragraph>
    <Paragraph position="6"> (7) brief (noun): recital 0.77, saga 0.77, autobiography 0.77, anecdote 0.77, novel 0.77, novelist 0.77, tradition 0.70, historian 0.70, tale 0.64.</Paragraph>
    <Paragraph position="7"> According to (4), the similarity between (5) and (6) is 0.297, whereas the similarities between (5) and (7) and between (6) and (7) are 0.</Paragraph>
    <Paragraph position="8">  Our evaluation was conducted with 4294 nouns that occurred at least 100 times in the parsed corpus and are found in both WordNetl.5 and the Roget Thesaurus. Table 1 shows the average similarity between corresponding entries in different thesauri and the standard deviation of the average, which is the standard deviation of the data items divided by the square root of the number of data items. Since the differences among simcosine, simdice and simJacard are very small, we only included the results for simcosine in Table 1 for the sake of brevity. It can be seen that sire, Hindler and cosine are significantly more similar to WordNet than Roget is, but are significantly less similar to Roget than WordNet is. The differences between Hindle and Hindler clearly demonstrate that the use of other types of dependencies in addition to subject and object relationships is very beneficial.</Paragraph>
    <Paragraph position="9"> The performance of sim, Hindler and cosine are quite close. To determine whether or not the differences are statistically significant, we computed their differences in similarities to WordNet and Roget thesaurus for each individual entry. Table 2 shows the average and standard deviation of the average difference. Since the 95% confidence inter- null vals of all the differences in Table 2 are on the positive side, one can draw the statistical conclusion that simis better than simnindle ~, which is better than simcosine.</Paragraph>
  </Section>
  <Section position="5" start_page="771" end_page="771" type="metho">
    <SectionTitle>
4 Future Work
</SectionTitle>
    <Paragraph position="0"> Reliable extraction of similar words from text corpus opens up many possibilities for future work. For example, one can go a step further by constructing a tree structure among the most similar words so that different senses of a given word can be identified with different subtrees. Let wl,..., Wn be a list of words in descending order of their similarity to a given word w. The similarity tree for w is created as follows: * Initialize the similarity tree to consist of a single node w.</Paragraph>
    <Paragraph position="1"> * For i=l, 2 ..... n, insert wi as a child of wj such that wj is the most similar one to wi among {w, Wl ..... wi-1}.</Paragraph>
    <Paragraph position="2"> For example, Figure 3 shows the similarity tree for the top-40 most similar words to duty. The first number behind a word is the similarity of the word to its parent. The second number is the similarity of the word to the root node of the tree.</Paragraph>
    <Paragraph position="3">  Inspection of sample outputs shows that this algorithm works well. However, formal evaluation of its accuracy remains to be future work.</Paragraph>
  </Section>
  <Section position="6" start_page="771" end_page="772" type="metho">
    <SectionTitle>
5 Related Work and Conclusion
</SectionTitle>
    <Paragraph position="0"> There have been many approaches to automatic detection of similar words from text corpora. Ours is  similar to (Grefenstette, 1994; Hindle, 1990; Ruge, 1992) in the use of dependency relationship as the word features, based on which word similarities are computed.</Paragraph>
    <Paragraph position="1"> Evaluation of automatically generated lexical resources is a difficult problem. In (Hindle, 1990), a small set of sample results are presented. In (Smadja, 1993), automatically extracted collocations are judged by a lexicographer. In (Dagan et al., 1993) and (Pereira et al., ! 993), clusters of similar words are evaluated by how well they are able to recover data items that are removed from the input corpus one at a time. In (Alshawi and Carter, 1994), the collocations and their associated scores were evaluated indirectly by their use in parse tree selection. The merits of different measures for association strength are judged by the differences they make in the precision and the recall of the parser outputs.</Paragraph>
    <Paragraph position="2"> The main contribution of this paper is a new evaluation methodology for automatically constructed thesaurus. While previous methods rely on indirect tasks or subjective judgments, our method allows direct and objective comparison between automatically and manually constructed thesauri. The results show that our automatically created thesaurus is significantly closer to WordNet than Roget Thesaurus is. Our experiments also surpasses previous experiments on automatic thesaurus construction in scale and (possibly) accuracy.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML