File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/06/w06-1605_concl.xml
Size: 5,859 bytes
Last Modified: 2025-10-06 13:55:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1605"> <Title>Distributional Measures of Concept-Distance: A Task-oriented Evaluation</Title> <Section position="7" start_page="40" end_page="42" type="concl"> <SectionTitle> 6 Related Work </SectionTitle> <Paragraph position="0"> Patwardhan and Pedersen (2006) create aggregate co-occurrence vectors for a WordNet sense by adding the co-occurrence vectors of the words in its WordNet gloss. The distance between two senses is then determined by the cosine of the an- null gle between their aggregate vectors. However, as we pointed out in Mohammad and Hirst (2005), such aggregate co-occurrence vectors are expected to be noisy because they are created from data that is not sense-annotated. Therefore, we employed simple word sense disambiguation and bootstrapping techniques on our base WCCM to create more-accurate co-occurrence vectors, which gave markedly higher accuracies in the task of determining word sense dominance. In the experiments described in this paper, we used these bootstrapped co-occurrence vectors to determine concept-distance.</Paragraph> <Paragraph position="1"> Pantel (2005) also provides a way to create co-occurrence vectors for WordNet senses. The lexical co-occurrence vectors of words in a leaf node are propagated up the WordNet hierarchy.</Paragraph> <Paragraph position="2"> A parent node inherits those co-occurrences that are shared by its children. Lastly, co-occurrences not pertaining to the leaf nodes are removed from its vector. Even though the methodology attempts at associating a WordNet node or sense with only those co-occurrences that pertain to it, no attempt is made at correcting the frequency counts. After all, word1-word2 co-occurrence frequency (or association) is likely not the same as SENSE1-word2 co-occurrence frequency (or association), simply because word1 may have senses other than SENSE1, as well. The co-occurrence frequency of a parent is the weighted sum of co-occurrence frequencies of its children. The frequencies of the child nodes are used as weights. Sense ambiguity issues apart, this is still problematic because a parent concept (say, BIRD)may co-occur much more frequently (or infrequently) with a word than its children (such as, hen, archaeopteryx, aquatic bird, trogon, and others). In contrast, the bootstrapped WCCM we use not only identifies which words co-occur with which concepts, but also has more sophisticated estimates of the co-occurrence frequencies.</Paragraph> <Paragraph position="3"> 7Conclusion We have proposed a framework that allows distributional measures to estimate concept-distance using a published thesaurus and raw text. We evaluated them in comparison with traditional distributional word-distance measures and WordNet-based measures through their ability in ranking word-pairs in order of their human-judged linguistic distance, and in correcting real-word spelling errors. We showed that distributional concept-distance measures outperformed word-distance measures in both tasks. They do not perform as well as the best WordNet-based measures in ranking a small set of word pairs, but in the task of correcting real-word spelling errors, they beat all WordNet-based measures except for Jiang-Conrath (which is markedly better) and Leacock-Chodorow (which is slightly better if we consider correction performance as the bottom-line statistic, but slightly worse if we rely on correction ratio). It should be noted that the Rubenstein and Goodenough word-pairs used in the ranking task, as well as all the real-word spelling errors in the correction task are nouns. We expect that the WordNet-based measures will perform poorly when other parts of speech are involved, as those hierarchies of WordNet are not as extensively developed. On the other hand, our DPC-based measures do not rely on any hierarchies (even if they exist in a thesaurus) but on sets of words that unambiguously represent each sense. Further, because our measures are tied closely to the corpus from which co-occurrence counts are made, we expect the use of domain-specific corpora to result in even better results.</Paragraph> <Paragraph position="4"> All the distributional measures that we have considered in this paper are lexical--that is, the distributional profiles of the target word or concept are based on their co-occurrence with words in a text. By contrast, semantic DPs would be based on information such as what concepts usually co-occur with the target word or concept. Semantic profiles of words can be obtained from the WCCM itself (using the row entry for the word). It would be interesting to see how distributional measures of word-distance that use these semantic DPs of words perform. We also intend to explore the use of semantic DPs of concepts acquired from a concept-concept co-occurrence matrix (CCCM). A CCCM can be created from the WCCM by setting the row entry for a concept or category to be the average of WCCM row values for all the words pertaining to it.</Paragraph> <Paragraph position="5"> Both DPW- and WordNet-based measures have large space and time requirements for precomputing and storing all possible distance values for a language. However, by using the categories of a thesaurus as very coarse concepts, precomputing and storing all possible distance values for our DPC-based measures requires a matrix of size only about 800A2800. This level of conceptcoarseness might seem drastic at first glance, but we have shown that distributional measures of distance between these coarse concepts are quite useful. Part of our future work will be to try an intermediate degree of coarseness (still much coarser than WordNet) by using the paragraph subdivisions of the thesaurus instead of its categories to see if this gives even better results.</Paragraph> </Section> class="xml-element"></Paper>