File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/p95-1025_metho.xml
Size: 16,686 bytes
Last Modified: 2025-10-06 14:14:06
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1025"> <Title>Statistical Sense Disambiguation with Relatively Small Corpora Using Dictionary Definitions</Title> <Section position="4" start_page="182" end_page="183" type="metho"> <SectionTitle> 4 Church and Hanks (1989) use Mutual Information </SectionTitle> <Paragraph position="0"> to measure word association norms.</Paragraph> <Paragraph position="1"> N is taken to be the total number of pairs of words processed, given by ~ f ( DC)/2 since for each pair of surface words processed,</Paragraph> <Paragraph position="3"> is increased by 2.</Paragraph> <Paragraph position="4"> Our scoring method is based on a probabilistic model at the conceptual level. In a standard model, the logarlthm of the probability of occurrence of a conceptual set {x,, x~ ..... xm} in the context of the conceptual set {y~, y~.....y,} is given by</Paragraph> <Paragraph position="6"> assuming that each P(x~) is independent of each other given y~, y2...., y, and each P(Y.i) is independent of each other given x~, for all x~.S Our scoring method deviates from the standard model in a number of aspects: 1. log 2 P(x~), the term of the occurrence Probability of each of the defining concepts in the sense, is excluded in our scoring method. Since the training data is not sense-tagged, the occurrence probability is highly unreliable. Moreover, the magnitude of mutual information is decreased due to the noise of the spurious senses while the average magnitude of the occurrence probability is unaffected, e Inclusion of the occurrence probability term will lead to the dominance of this term over the mutual information term, resulting in the system flavouring the sense with the more frequently occurring defining concepts most of the time.</Paragraph> <Paragraph position="7"> 2. The score of a sense with respect to the current context is normalised by subtracting the score of the sense calculated with respect to the GlobalCS (which contains all defining concepts) from it (see formula 5 The occurrence probabilities of some defining concepts will not be independent in some contexts.</Paragraph> <Paragraph position="8"> However, modelling the dependency between different concepts in different contexts will lead to an explosion of the complexity of the model.</Paragraph> <Paragraph position="9"> 6 The noise only leads to incorrect distribution of the occurrence probability.</Paragraph> <Paragraph position="10"> \[1\]). In effect, we are comparing the score between the sense with the current context and the score between the sense and an artificially constructed &quot;average&quot; context. This is needed to rectify the bias towards the sense(s) with defining concepts of higher average mutual information (over the set of all defining concepts), 'which is intensified by the ambiguity of the context words.</Paragraph> <Paragraph position="11"> 3. Negative mutual information score is taken to be 0 (\[6\]). Negative mutual information is unreliable due to the smaller number of data points.</Paragraph> <Paragraph position="12"> 4. The evidence (mutual information score) from multiple defining concepts/words is averaged rather than summed (\[2\], \[4\] & \[5\]). This is to compensate for the different lengths of definitions of different senses and different lengths of the context. The evidence from a polysemous context word is taken to be the evidence from its sense with the highest mutual information score (\[3\]). This is due to the fact that only one of the senses is used in the given sentence.</Paragraph> </Section> <Section position="5" start_page="183" end_page="184" type="metho"> <SectionTitle> 3 Evaluation </SectionTitle> <Paragraph position="0"> Our system is tested on the twelve words discussed in Yarowsky (1992) and previous publications on sense disambiguation. Results are shown in Table 1.</Paragraph> <Paragraph position="1"> Our system achieves an average accuracy of 77% on a mean 3-way sense distinction over the twelve words. Numerically, the result is not as good as the 92% as reported in Yarowsky (1992). However, direct comparison between the numerical results can be misleading since the experiments are carried out on two very different corpora both in size and genre.</Paragraph> <Paragraph position="2"> Firstly, Yarowsky's system is trained with the 10 million word Grolier's Encyclopedia, which is a magnitude larger than the Brown corpus used by our system. Secondly, and more importantly, the two corpora, which are also the test corpora, are very different in genre. Semantic coherence of text, on which both systems rely, is generally stronger in technical writing than in most other kinds of text.</Paragraph> <Paragraph position="3"> Statistical disambiguation systems which rely on semantic coherence will generally perform better on technical writing, which encyclopedia entry can be regarded as one kind of, than on most other kinds of text. On the other hand, the Brown corpus is a collection of text with all kinds of genre.</Paragraph> <Paragraph position="4"> People make use of syntactic, semantic and pragmatic knowledge in sense disambiguation. It is not very realistic to expect any system which only possesses semantic coherence knowledge (including ours as well as Yarowsky's) to achieve a very high level of accuracy for all words in general text. To provide a better evaluation of our approach, we have conducted an informal experiment aiming at establishing a more reasonable upper bound of the performance of such systems. In the experiment, a human subject is asked to perform the same disambiguation task as our system, given the same contextual information, 7 Since our system only uses semantic coherence information and has no deeper understanding of the meaning of the text, the human subject is asked to disambiguate the target word, given a list of all the content words in the context (sentence) of the target word in random order. The words are put in random order because the system does not make use of syntactic information of the sentence either. The human subject is also allowed access to a copy of LDOCE which the system also uses. The results are listed in Table 1. The actual upper bound of the performance of statistical methods using semantic coherence information only should be slightly better than the performance of human since the human is disadvantaged by a number of factors, including but not limited to: 1. it is unnatural for human to disambiguate in the described manner; 2. the semantic coherence knowledge used by the human is not complete or specific to the current corpusS; 3. human error.</Paragraph> <Paragraph position="5"> However, the results provide a rough approximation of the upper bound of performance of such systems, The human subject achieves an average accuracy of 71% over the twelve words, which is 6% lower than our system. More interestingly, the results of the human subject are found to exhibit a similar pattern to the results of our system - the human subject performs better on words and senses for which our system achieve higher accuracy and less well on words and senses for which our system has a lower accuracy.</Paragraph> </Section> <Section position="6" start_page="184" end_page="186" type="metho"> <SectionTitle> 4 The Use of Sentence as Local Context </SectionTitle> <Paragraph position="0"> Another significant point our experiments have shown is that the sentence can also provide enough contextual information for semantic coherence based 7 The result is less than conclusive since only one human subject is tested. In order to acquire more reliable results, we are currently seeking a few more subjects to repeat the experiment.</Paragraph> <Paragraph position="1"> s The subject has not read through the whole corpus.</Paragraph> <Paragraph position="2"> approaches in a large proportion of cases. 9 The average sentence length in the Brown corpus is 19.41deg words which is 5 times smaller than the 100 word window used in Gale et al. (1992) and Yarowsky (1992). Our approach works well even with a small &quot;window&quot; because it is based on the identification of salient concepts rather than salient words. In salient word based approaches, due to the problem of data sparseness, many less frequently occurring words which are intuitively salient to a particular word sense will not be identified in practice unless an extremely large corpus is used.</Paragraph> <Paragraph position="3"> Therefore the sentence usually does not contain enough identified salient words to provide enough contextual information. Using conceptual co-occurrence data, contextual information from the salient but less frequently used words in the sentence will also be utilised through the salient concepts in the conceptual expansions of these words. Obviously, there are still cases where the sentence does not provide enough contextual information even using conceptual co-occurrence data, such as when the sentence is too short, and contextual information from a larger context has to be used. However, the ability to make use of information in a smaller context is very important because the smaller context always overrules the larger context if their sense preferences are different. For example, in a legal trial context, the correct sense of sentence in the clause she was asked to repeat the last word of her previous sentence will be its word sense rather than its legal sense which would have been selected if a larger context is used instead.</Paragraph> <Paragraph position="4"> 9 Analysis of the test samples which our system fails to correctly disambiguate also shows that increasing the window size will benefit the disambiguation process only in a very small proportion of these samples. The main cause of errors is the polysemous words in dictionary definitions which we will discuss in Section 6.</Paragraph> <Paragraph position="6"/> <Paragraph position="8"> 1. N marks the column with the number of tcst samples for each sense. DBCC (Defmition-Bascd Conceptual Cooccurrence) and Human mark the columns with the results of our system and the human subject in disambiguating the occurrences of the 12 words in the Brown corpus, respectively. Thes. (thesaurus) marks the column with the results of Yarowsky (1992) tested on the Grolier's Encyclopedia.</Paragraph> <Paragraph position="9"> 2. The &quot;correct&quot; sense of each test sample is chosen by hand disambiguation carried out by the author using the sentence as the context. A small proportion of test samples cannot be disambiguated within the given context and are excluded from the experiment.</Paragraph> </Section> <Section position="7" start_page="186" end_page="186" type="metho"> <SectionTitle> 5 Related Work </SectionTitle> <Paragraph position="0"> Previous attempts to tackle the data sparseness problem in general corpus-based work include the class-based approaches and similarity-based approaches. In these approaches, relationships between a given pair of words are modelled by analogy with other words that resemble the given pair in some way. The class-based approaches (Brown et al., 1992; Resnik, 1992; Pereira et al., 1993) calculate co-occurrence data of words belonging to different classes,~ rather than individual words, to enhance the co-occurrence data collected and to cover words which have low occurrence frequencies. Dagan et al. (1993) argue that using a relatively small number of classes to model the similarity between words may lead to substantial loss of information. In the similarity-based approaches (Dagan et al., 1993 & 1994; Grishman et al., 1993), rather than a class, each word is modelled by its own set of similar words derived from statistical data collected from corpora. However, deriving these sets of similar words requires a substantial amount of statistical data and thus these approaches require relatively large corpora to start with.~ 2 Our definition-based approach to statistical sense disambiguation is similar in spirit to the similarity-based approaches, with respect to the &quot;specificity&quot; of modelling individual words. However, using definitions from existing dictionaries rather than derived sets of similar words allows our method to work on corpora of much smaller sizes. In our approach, each word is modelled by its own set of defining concepts. Although only 1792 defining concepts are used, the set of all possible combinations (a power set of the defining concepts) is so huge that it is very unlikely two word senses will have the same combination of defining concepts unless they are almost identical in meaning. On the other hand, the thesaurus-based method of Yarowsky (1992) may suffer from loss of information (since it is semi-class-based) as well as data sparseness (since H Classes used in Resnik (1992) are based on the WordNet taxonomy while classes of Brown et al.</Paragraph> <Paragraph position="1"> (1992) and Pereira et al. (1993) are derived from statistical data collected from corpora.</Paragraph> <Paragraph position="2"> ~2 The corpus used in Dagan et al. (1994) contains 40.5 million words.</Paragraph> <Paragraph position="3"> it is based on salient words) and may not perform as well on general text as our approach.</Paragraph> </Section> <Section position="8" start_page="186" end_page="186" type="metho"> <SectionTitle> 6 Limitation and Further work </SectionTitle> <Paragraph position="0"> Being a dictionary-based method, the natural limitation of our approach is the dictionary. The most serious problem is that many of the words in the controlled vocabulary of LDOCE are polysemous themselves. The result is that many of our list of 1792 defining concepts actually stand for a number of distinct concepts. For example, the defining concept point is used in its place sense, idea sense and sharp end sense in different definitions. This affects the accuracy of disambiguating senses which have definitions containing these polysemous words and is found to be the main cause of errors for most of the senses with below-average results.</Paragraph> <Paragraph position="1"> We are currently working on ways to disambiguate the words in the dictionary definitions.</Paragraph> <Paragraph position="2"> One possible way is to apply the current method of disambiguation on the defining text of dictionary itself. The LDOCE defining text has roughly half a million words in its 41000 entries, which is half the size of the Brown corpus used in the current experiment. Although the result on the dictionary cannot be expected to be as good as the result on the Brown corpus due to the smaller size of the dictionary, the reliability of further co-occurrence data collected and, thus, the performance of the disambiguation system can be improved significantly as long as the disambiguation of the dictionary is considerably more accurate than by chance.</Paragraph> <Paragraph position="3"> Our success in using definitions of word senses to overcome the data sparseness problem may also lead to further improvement of sense disambiguation technologies. In many cases, semantic coherence information is not adequate to select the correct sense, and knowledge about local constraints is needed. ~3 For disambiguation of polysemous nouns, these constraints include the modifiers of these nouns and the verbs which take these nouns as objects, etc. This knowledge has been successfully acquired from corpora in manual or semi-automatic approaches such as that described in Hearst (1991).</Paragraph> <Paragraph position="4"> However, fully automatic lexically based approaches</Paragraph> </Section> <Section position="9" start_page="186" end_page="187" type="metho"> <SectionTitle> 3 Hatzivassiloglou (1994) shows that the </SectionTitle> <Paragraph position="0"> introduction of linguistic cues improves the performance of a statistical semantic knowledge acquisition system in the context of word grouping.</Paragraph> <Paragraph position="1"> such as that described in Yarowsky (1992) are very unlikely to be capable of acquiring this finer knowledge because the problem of data sparseness becomes even more serious with the introduction of syntactic constraints. Our approach has overcome the data sparseness problem by using the defining concepts of words. It is found to be effective in acquiring semantic coherence knowledge from a relatively small corpus. It is possible that a similar approach based on dictionary definitions will be successful in acquiring knowledge of local constraints from a reasonably sized corpus.</Paragraph> </Section> class="xml-element"></Paper>