File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0201_metho.xml
Size: 14,802 bytes
Last Modified: 2025-10-06 14:14:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0201"> <Title>Getting Serious about Word Sense Disambiguation</Title> <Section position="4" start_page="0" end_page="2" type="metho"> <SectionTitle> 2 The Utility of Word Sense </SectionTitle> <Paragraph position="0"> Disambiguation Although there is agreement in general about the utility of WSD within the NLP community, I will briefly address some objections to WSD in this section. To justify the investment of manpower and time to gather a large sense-tagged corpus, it is important to examine the benefits brought about by WSD.</Paragraph> <Paragraph position="1"> Information retrieval (IR) is a practical NLP task where WSD has brought about improvement in accuracy. When tested on some standard IR test collection, the use of WSD improves precision by about 4.3% (from 29.9% to 34.2%) (Schiitze and Pedersen, 1995). The work of (Dagan and Itai, 1994) has also successfully used WSD to improve the accuracy of machine translation. These examples clearly demonstrate the utility of WSD in practical NLP applications. null In this paper, by word sense disambiguation, I mean identifying the correct sense of a word in context such that the sense distinction is at the level of a good desk-top dictionary like WORDNET (Miller, 1990). I only focus on content word disambiguation (i.e., words in the part of speech noun t, verb, adjective and adverb). This is also the task addressed by other WSD research such as (Bruce and Wiebe, 1994; Miller et al., 1994). When the task is to resolve word senses to the fine-grain distinction of WORD-NET senses, the accuracy figures achieved are generally not very high (Miller et al., 1994; Ng and Lee, 1996). This indicates that WSD is a challenging task and much improvement is still needed.</Paragraph> <Paragraph position="2"> However, if one were to resolve word sense to the level of homograph, or coarse sense distinction, then quite high accuracy can be achieved (in excess of 90%), as reported in (Wilks and Stevenson, 1996).</Paragraph> <Paragraph position="3"> Similarly, if the task is to distinguish between binary, coarse sense distinction, then current WSD techniques can achieve very high accuracy (in excess of 96% when tested on a dozen words in (Yarowsky, 1995)). This is to be expected, since homograph contexts are quite distinct and hence it is a much simpler task to disambiguate among a small number of coarse sense classes. This is in contrast to disambiguating word senses to the refined senses of WoRDNET, where for instance, the average number of senses per noun is 7.8 and the average number of senses per verb is 12.0 for the set of 191 most ambiguous words investigated in (Ng and Lee, 1996).</Paragraph> <Paragraph position="4"> We can readily collapse the refined senses of WORDNET into a smaller set if only a coarse (hot I will only focus on common noun in this paper and ignore proper noun.</Paragraph> <Paragraph position="5"> mographic) sense distinction is needed, say for some NLP applications. Indeed, the WORDNET software has an option for grouping noun senses into a smaller number of sense classes. WSD techniques that work well for refined sense distinction will apply equally to homograph dlsambiguation. That is, if we succeed in working on the harder WSD task of resolution into refined senses, the same techniques will also work on the simpler task of homograph disambiguation.</Paragraph> <Paragraph position="6"> A related objection to WSD research is that the sense distinction made by a good desk-top dictionary like WOI~DNET is simply too refined, to the point that two humans cannot genuinely agree on the most.</Paragraph> <Paragraph position="7"> appropriate sense to assign to some word occurrence (Kilgarriff, 1996). This objection has some merits.</Paragraph> <Paragraph position="8"> However, the remedy is not to throw out word senses completely, but rather to work on a level of sense distinction that is somewhere in between homograph distinction and the refined WoRVNET sense distinction. The existing lumping of noun senses in WORD-NET into coarser sense groups is perhaps a good compromise.</Paragraph> <Paragraph position="9"> However, in the absence of well accepted guidelines for making an appropriate level of sense distinction, using the sense classification given in WOI~I)NET, an on-line, publicly available dictionary, seems a natural choice. Hence, I believe that using the current WoRDNET sense distinction to build a sense-tagged corpus is a reasonable approach to go forward. In any case, if some aggregation of senses into coarser grouping is done in future, this can be readily incorporated into my proposed sense-tagged corpus which uses the refined sense distinction of WOItDNET.</Paragraph> <Paragraph position="10"> In the rest of this paper, I will assume that broad coverage, high accuracy WSD is indeed useful in practical NLP tasks, and that resolving senses to the refined level of WORDNET is a worthwhile task to pursue.</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 The Effect of Training Corpus Size </SectionTitle> <Paragraph position="0"> A number of past research work on WSD, such as (Leacock et al., 1993; Bruce and Wiebe, 1994; Mooney, 1996), were tested on a small number of words like &quot;line&quot; and &quot;interest&quot;. Similarly, (Yarowsky, 1995) tested his WSD algorithm on a dozen words. The sense-tagged corpus SEMCOI~, prepared by (Miller et al., 1994), contains a substantial subset of the Brown corpus tagged with the refined senses of WORDNET. However, as reported in (Miller et al., 1994), there are not enough training examples per word in SP.MCOR to yield a broad coverage, high accuracy WSD program, due to the fact that sense tagging is done on every word in a running text in SEMCOR.</Paragraph> <Paragraph position="1"> To overcome this data sparseness problem of WSD, I initiated a mini-project in sense tagging and collected a corpus in which 192,800 occurrences of 191 words have been manually tagged with senses of WORDNET (Ng and Lee, 1996). These 192,800 word occurrences consist of only 121 nouns and 70 verbs which are the most frequently occurring and most ambiguous words of English. 2 To investigate the effect of the number of training examples on WSD accuracy, I ran the exemplar-based WSD algorithm L~.XAS on varying number of training examples to obtain learning curves for the 191 words (details of LEXAS are described in (Ng and Lee, 1996)). For each word, 10 random trials were conducted and the accuracy figures were averaged over the I0 trials. In each trial, I00 examples were randomly selected to form the test set, while the remaining examples (randomly shuffled) were used for training. LEXAS was given training examples in multiple s of i00, starting with I00,200,300, ... training examples, up to the maximum number of training examples (in a multiple of 100) available in the corpus. null Note that each word w (of the 191 words) can have a different number of sense-tagged occurrences in our corpus. From the combination of Brown corpus (1 million words) and Wall Street Journal corpus (2.5 million words), up to 1,500 sentences each containing an occurrence of the word w are extracted from the combined corpus, with each sentence containing a sense-tagged occurrence of w. When the combined corpus has less than 1,500 occurrences of w, the max= imum number of available occurrences of w is used.</Paragraph> <Paragraph position="2"> For instance, while 137 words have at least 600 occurrences in the combined corpus, only a subset of 43 words has at least 1400 occurrences. Figure 1 and 2 show the learning curves averaged over these 43 words and 137 words with at least 1300 and 500 training examples, respectively. Each figure shows the accuracy of LEXAS versus the base-line, most-frequent-sense classifier.</Paragraph> <Paragraph position="3"> Both figures indicate that WSD accuracy continues to climb as the number of training examples increases. They confirm that all the training examples collected in our corpus are effectively utilized by LZXAS to improve its WSD performance. In fact, it appears that for this set of most ambiguous words of English, more training data may be beneficial to subsets of test sentences of our sense-tagged corpus, as shown in Table 1.</Paragraph> <Paragraph position="4"> The two test sets, BC50 and WSJ6, are the same as those reported in (Ng and Lee, 1996). BC50 consists of 7,119 occurrences of the 191 words that occur in 50 text files of the Brown corpus. The second test set, WSJ6, consists of 14,139 occurrences of these 191 words that occur in 6 text files of the Wall Street Journal corpus.</Paragraph> <Paragraph position="5"> The performance figures of LEXAS in Table 1 are higher than those reported in (Ng and Lee, 1996).</Paragraph> <Paragraph position="6"> The classification accuracy of the nearest neighbor algorithm used by LEXAS (Cost and Salzberg, 1993) is quite sensitive to the number of nearest neighbors used to select the best matching example. By using 10-fold cross validation (Kohavi and John, 1995) to automatically pick the best number of nearest neighbors to use, the performance of LSXAS has improved.</Paragraph> </Section> <Section position="6" start_page="2" end_page="5" type="metho"> <SectionTitle> 4 Word Sense Disambiguation in the </SectionTitle> <Paragraph position="0"> Large In (Gale et al., 1992), it was argued that any wide coverage WSD program must be able to perform significantly better than the most-frequent-sense classifier to be worthy of serious consideration. The performance of LEXAS as indicated in Table 1 is significantly better than the most-frequent-sense classifier for the set of 191 words collected in our corpus. Figure 1 and 2 also confirm that all the training examples collected in our corpus are effectively utilized by LEXAS to improve its WSD performance. This is encouraging as it demonstrates the feasibility of building a wide coverage WSD program using a supervised learning approach.</Paragraph> <Paragraph position="1"> Unfortunately, our corpus only contains tagged senses for 191 words, and this set of words does not constitute a sufficiently large fraction of all occurrences of content words in an arbitrarily chosen unrestricted text. As such, our sense-tagged corpus is still not large enough to enable the building of a wide coverage, high accuracy WSD program that can significantly outperform the most-frequent-sense classifier over all content words encountered in an arbitrarily chosen unrestricted text.</Paragraph> <Paragraph position="2"> This brings us to the question: how much data do we need to achieve wide coverage, high accuracy of speech making up the top 80%, ..., 99% of word occurrences in the Brown corpus.</Paragraph> <Paragraph position="3"> of speech making up the top 80%, ..., 99% of word occurrences in the Wall Street Journal corpus.</Paragraph> <Paragraph position="4"> To shed light on this question, it is instructive to examine the distribution of words and their occurrence frequency in a large corpus. Table 2 lists the number of polysemous words in each part of speech making up the top 80%, ..., top 99% of word occurrences in the Brown corpus, where the polysemous words are ordered in terms of their occurrence frequency from the most frequently occurring word to the least frequently occurring word. For example, Table 2 indicates that when the polysemous nouns are ordered from the most frequently occurring noun to the least frequently occurring noun, the top 975 polysemous nouns constitute 80% of all noun occurrences in the Brown corpus. This 80% of all noun occurrences include all nouns in the Brown corpus that are monosemous (about 15.4%) and all rare nouns in the Brown corpus that do not appear in WORD-NP.T and hence have no valid sense definition (about 3.3%) (i.e., the remaining 20% noun occurrences are all polysemous). Table 3 lists the analogous statistics for the Wall Street Journal corpus.</Paragraph> <Paragraph position="5"> It is also the case that the last 5%-10% of polysemous words in a corpus have only a small number of distinct senses on average. Table 4 lists the average number of senses per polysemous word in the Brown corpus for the top 80%, ..., top 99%, and the bottom 20%, ..., bottom 1% of word occurrences, where the words are again ordered from the most frequently occurring word to the least frequently occurring word. For example, the average number of senses per polysemous noun is 5.14 for the nouns which account for the top 80% noun occurrences in the Brown corpus. Similarly, the average number of senses per polysemous noun is 2.86 for the polysemous nouns which account for the bottom 20% of noun occurrences in the Brown corpus. Table 5 lists the analogous statistics for the Wall Street Journal corpus.</Paragraph> <Paragraph position="6"> Table 2 and 3 indicate that a sense-tagged corpus collected for 3,200 words will cover at least 90% of all (content) word occurrences in the Brown corpus, and at least 95% of all (content) word occurrences in the Wall Street Journal corpus. From Table 4, the average number of senses per polysemous word in the Brown corpus for the remaining 10% word occurrences is only 3.15 or less. Similarly, from Table 5, the average number of senses per polysemous word in the Wall Street Journal corpus for the remaining 5% word occurrences is only 3.10 or less. For these remaining polysemous words which account for the last 5%-10% word occurrences with an average of about 3 senses per word, we can always assign the most frequent sense as a first approximation in building our wide coverage WSD program.</Paragraph> <Paragraph position="7"> Based on these figures, I estimate that a sense-tagged corpus of 3,200 words is sufficient to build a broad coverage, high accuracy WSD program capable of significantly outperforming the most-frequent-sense classifier on average over all content words appearing in an arbitrary, unrestricted English text. Assuming an average of 1,000 sense-tagged occurrences per word, this will mean a corpus of 3.2 million sense-tagged word occurrences. Assuming human sense tagging throughput at 200 words, or 200,000 word occurrences, per man-year (which is the approximate human tagging throughput of my completed sense-tagging mini-project), such a corpus will require about 16 man-years to construct.</Paragraph> <Paragraph position="8"> Given the benefits of a wide coverage, high accuracy and domain-independent WSD program, I believe it is justifiable to spend the 16 man-years of human annotation effort needed to construct such a sense-tagged corpus.</Paragraph> </Section> class="xml-element"></Paper>