File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1605_metho.xml
Size: 9,688 bytes
Last Modified: 2025-10-06 14:10:43
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1605"> <Title>Distributional Measures of Concept-Distance: A Task-oriented Evaluation</Title> <Section position="4" start_page="36" end_page="36" type="metho"> <SectionTitle> 3 Conceptual grain size and storage </SectionTitle> <Paragraph position="0"> requirements As applications for linguistic distance become more sophisticated and demanding, it becomes attractive to pre-compute and store the distance values between all possible pairs of words or senses. But both kinds of measures have large space requirements to do this, requiring matrices of size N A2N,whereN is the size of the vocabulary (perhaps 100,000 for most languages) in the case of distributional measures and the number of senses (75,000 just for nouns in WordNet) in the case of semantic measures.</Paragraph> <Paragraph position="1"> It is generally accepted, however, that WordNet senses are far too fine-grained (Agirre and Lopez de Lacalle Lekuona (2003) and citations therein). On the other hand, published thesauri, such as Roget's and Macquarie, group near-synonymous and semantically related words into a relatively small number of categories--typically between 800 and 1100--that roughly correspond to very coarse concepts or senses (Yarowsky, 1992). Words with more than one sense are listed in more than one category. A published thesaurus thus provides us with a very coarse human-developed set or inventory of word senses or concepts that are more intuitive and discernible than the &quot;concepts&quot; generated by dimensionality-reduction methods such as latent semantic analysis. Using coarse senses from a known inventory means that the senses can be represented unambiguously by a large number of possibly ambiguous words (conveniently available in the thesaurus)--a feature that we exploited in our earlier work (Mohammad and Hirst, 2006) to determine useful estimates of the strength of association between a concept and co-occurring words. In this paper, we go one step further and use the idea of a very coarse sense inventory to develop a framework for distributional measures of concepts that can more naturally and more accurately be used in place of semantic measures of word senses. We use the Macquarie Thesaurus (Bernard, 1986) as a sense inventory and repository of words pertaining to each sense. It has 812 categories with around 176,000 word tokens and 98,000 word types. This allows us to have much smaller concept-concept distance matrices of size just 812A2812 (roughly .01% the size We use the terms senses and concepts interchangeably. This is in contrast to studies, such as that of Cooper (2005), that attempt to make a principled distinction between them. of matrices required by existing measures). We evaluate our distributional concept-distance measures on two tasks: ranking word pairs in order of their semantic distance, and correcting real-word spelling errors. We compare performance with distributional word-distance measures and the WordNet-based concept-distance measures.</Paragraph> </Section> <Section position="5" start_page="36" end_page="38" type="metho"> <SectionTitle> 4 Distributional measures of </SectionTitle> <Paragraph position="0"> concept-distance</Paragraph> <Section position="1" start_page="36" end_page="37" type="sub_section"> <SectionTitle> 4.1 Capturing distributional profiles of concepts </SectionTitle> <Paragraph position="0"> We use relation-free lexical DPs--both DPWs and DPCs--in our experiments, as they allow determination of semantic properties of the target from just its co-occurring words.</Paragraph> <Paragraph position="1"> Determining lexical DPWs simply involves making word-word co-occurrence counts in a corpus. A direct method to determine lexical DPCs, on the other hand, requires information about which words occur with which concepts.</Paragraph> <Paragraph position="2"> This means that the text from which counts are made has to be sense annotated. Since existing labeled data is minimal and manual annotation is far too expensive, indirect means must be used. In an earlier paper (Mohammad and Hirst, 2006), we showed how this can be done with simple word sense disambiguation and bootstrapping techniques. Here, we summarize the method.</Paragraph> <Paragraph position="3"> First, we create a word-category co-occurrence matrix (WCCM) using the British National Corpus (BNC) and the Macquarie Thesaurus. The WCCM has the following form:</Paragraph> <Paragraph position="5"> A cell m ij , corresponding to word w i and category c j , contains the number of times w i co-occurs (in a window of A65 words in the corpus) with any of the words listed under category c j in the thesaurus. Intuitively, the cell m ij captures the number of times c j and w i co-occur. A contingency table for a single word and single category can be created by simply collapsing all other rows and columns into one and summing their frequencies. Applying a suitable statistic, such as odds ratio, on the contingency table gives the strength of association between a concept (category) and co-occurring word. Therefore, the WCCM can be used to create the lexical DP for any concept. The matrix that is created after one pass of the corpus, which we call the base WCCM, although noisy (as it is created from raw text and not sense-annotated data), captures strong associations between categories and co-occurring words. Therefore the intended sense (thesaurus category) of a word in the corpus can now be determined using frequencies of co-occurring words and its various senses as evidence. A new bootstrapped WCCM is created, after a second pass of the corpus, in which the cell m ij contains the number of times any word used in sense c</Paragraph> <Paragraph position="7"> have shown (Mohammad and Hirst, 2006) that the bootstrapped WCCM captures word-category co-occurrences much more accurately than the base WCCM, using the task of determining word sense dominance as a test bed.</Paragraph> </Section> <Section position="2" start_page="37" end_page="38" type="sub_section"> <SectionTitle> 4.2 Applying distributional measures to DPCs </SectionTitle> <Paragraph position="0"> Recall that in computing distributional worddistance, we consider two target words to be distributionally similar (less distant) if they occur in similar contexts. The contexts are represented by the DPs of the target words, where a DP gives the strength of association between the target and the co-occurring units. A distributional measure uses a measure of DP distance to determine the distance between two DPs and thereby between the two target words (see Figure 1). The various measures differ in what statistic they use to calculate the strength of association and the measure of DP dis- null Near-upper-bound results were achieved in the task of determining predominant senses of 27 words in 11 target texts with a wide range of sense distributions over their two most dominant senses.</Paragraph> <Paragraph position="1"> tance they use (see Mohammad and Hirst (2005) for details). For example, following is the cosine formula for distance between words w</Paragraph> <Paragraph position="3"> ing relation-free lexical DPWs, with conditional probability of the co-occurring word given the target as the strength of association: Here, CB4xB5 is the set of words that co-occur with word x within a pre-determined window.</Paragraph> <Paragraph position="4"> In order to calculate distributional conceptdistance, consider the same scenario, except that the targets are now senses or concepts. Two concepts are closer if their DPs are similar, and these DPCs require the strength of association between the target concepts and their co-occurring words. The associations can be estimated from the bootstrapped WCCM, described in Section 4.1 above. Any of the distributional measures used for DPWs can now be used to estimate concept-distance with DPCs. Figure 2 illustrates our methodology. Below is the formula for cosine with conditional probabilities when applied to concepts: Now, CB4xB5 is the set of words that co-occur with concept x within a pre-determined window.</Paragraph> <Paragraph position="5"> We will refer to such measures as distributional measures of concept-distance (Distrib concept ), in contrast to the earlier-described distributional measures of word-distance (Distrib word ) and WordNet-based (or semantic) measures of concept-distance (WNet concept ). We shall refer to these three kinds of distance measures as measure-types. Individual measures in each kind will be referred to simply as measures. A distributional measure of concept-distance can be used to populate a small 812 A2 812 concept-concept distance matrix where a cell</Paragraph> <Paragraph position="7"> the distance between the two concepts. In contrast, a word-word distance matrix for a conservative vocabulary of 100,000 word types will have a size 100,000 A2 100,000, and a WordNet-based concept-concept distance matrix will have a size 75,000 A2 75,000 just for nouns. Our concept-concept distance matrix is roughly .01% the size of these matrices.</Paragraph> <Paragraph position="8"> Note that the DPs we are using are relation-free because (1) we use all co-occurring words (not just those that are related to the target by certain syntactic or semantic relations) and (2) the WCCM, as described in Section 4.1, does not maintain separate counts for the different relations between the target and co-occurring words. Creating a larger matrix with separate counts for the different relations would lead to relation-constrained DPs.</Paragraph> </Section> </Section> class="xml-element"></Paper>