File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/96/p96-1006_intro.xml
Size: 12,030 bytes
Last Modified: 2025-10-06 14:06:02
<?xml version="1.0" standalone="yes"?> <Paper uid="P96-1006"> <Title>Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach</Title> <Section position="6" start_page="42" end_page="499" type="intro"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"> To evaluate the performance of LEXAS, we conducted two tests, one on a common data set used in (Bruce and Wiebe, 1994), and another on a larger data set that we separately collected.</Paragraph> <Section position="1" start_page="42" end_page="499" type="sub_section"> <SectionTitle> 4.1 Evaluation on a Common Data Set </SectionTitle> <Paragraph position="0"> To our knowledge, very few of the existing work on WSD has been tested and compared on a common data set. This is in contrast to established practice in the machine learning community. This is partly because there are not many common data sets publicly available for testing WSD programs.</Paragraph> <Paragraph position="1"> One exception is the sense-tagged data set used in (Bruce and Wiebe, 1994), which has been made available in the public domain by Bruce and Wiebe.</Paragraph> <Paragraph position="2"> This data set consists of 2369 sentences each containing an occurrence of the noun &quot;interest&quot; (or its plural form &quot;interests&quot;) with its correct sense manually tagged. The noun &quot;interest&quot; occurs in six different senses in this data set. Table 2 shows the distribution of sense tags from the data set that we obtained. Note that the sense definitions used in this data set are those from Longman Dictionary of Contemporary English (LDOCE) (Procter, 1978). This does not pose any problem for LEXAS, since LEXAS only requires that there be a division of senses into different classes, regardless of how the sense classes are defined or numbered.</Paragraph> <Paragraph position="3"> POS of words are given in the data set, as well as the bracketings of noun groups. These are used to determine the POS of neighboring words and the verb-object syntactic relation to form the features of examples.</Paragraph> <Paragraph position="4"> In the results reported in (Bruce and Wiebe, 1994), they used a test set of 600 randomly selected sentences from the 2369 sentences. Unfortunately, in the data set made available in the public domain, there is no indication of which sentences are used as test sentences. As such, we conducted 100 random trials, and in each trial, 600 sentences were randomly selected to form the test set. LEXAS is trained on the remaining 1769 sentences, and then tested on a separate test set of sentences in each trial.</Paragraph> <Paragraph position="5"> Note that in Bruce and Wiebe's test run, the proportion of sentences in each sense in the test set is approximately equal to their proportion in the whole data set. Since we use random selection of test sentences, the proportion of each sense in our test set is also approximately equal to their proportion in the whole data set in our random trials.</Paragraph> <Paragraph position="6"> The average accuracy of LEXAS over 100 random trials is 87.4%, and the standard deviation is 1.37%.</Paragraph> <Paragraph position="7"> In each of our 100 random trials, the accuracy of LEXAS is always higher than the accuracy of 78% reported in (Bruce and Wiebe, 1994).</Paragraph> <Paragraph position="8"> Bruce and Wiebe also performed a separate test by using a subset of the &quot;interest&quot; data set with only 4 senses (sense 1, 4, 5, and 6), so as to compare their results with previous work on WSD (Black, 1988; Zernik, 1990; Yarowsky, 1992), which were tested on 4 senses of the noun &quot;interest&quot;. However, the work of (Black, 1988; Zernik, 1990; Yarowsky, 1992) were not based on the present set of sentences, so the comparison is only suggestive. We reproduced in Table 3 the results of past work as well as the classification accuracy of LEXAS, which is 89.9% with a standard deviation of 1.09% over 100 random trials.</Paragraph> <Paragraph position="9"> In summary, when tested on the noun &quot;interest&quot;, LEXAS gives higher classification accuracy than previous work on WSD.</Paragraph> <Paragraph position="10"> In order to evaluate the relative contribution of the knowledge sources, including (1) POS and mor-</Paragraph> </Section> <Section position="2" start_page="499" end_page="499" type="sub_section"> <SectionTitle> Sources </SectionTitle> <Paragraph position="0"> phological form; (2) unordered set of surrounding words; (3) local collocations; and (4) verb to the left (verb-object syntactic relation), we conducted 4 separate runs of 100 random trials each. In each run, we utilized only one knowledge source and compute the average classification accuracy and the standard deviation. The results are given in Table 4.</Paragraph> <Paragraph position="1"> Local collocation knowledge yields the highest accuracy, followed by POS and morphological form.</Paragraph> <Paragraph position="2"> Surrounding words give lower accuracy, perhaps because in our work, only the current sentence forms the surrounding context, which averages about 20 words. Previous work on using the unordered set of surrounding words have used a much larger window, such as the 100-word window of (Yarowsky, 1992), and the 2-sentence context of (Leacock et al., 1993).</Paragraph> <Paragraph position="3"> Verb-object syntactic relation is the weakest knowledge source.</Paragraph> <Paragraph position="4"> Our experimental finding, that local collocations are the most predictive, agrees with past observation that humans need a narrow window of only a few words to perform WSD (Choueka and Lusignan, 1985).</Paragraph> <Paragraph position="5"> The processing speed of LEXAS is satisfactory.</Paragraph> <Paragraph position="6"> Running on an SGI Unix workstation, LEXAS can process about 15 examples per second when tested on the &quot;interest&quot; data set.</Paragraph> </Section> <Section position="3" start_page="499" end_page="499" type="sub_section"> <SectionTitle> 4.2 Evaluation on a Large Data Set </SectionTitle> <Paragraph position="0"> Previous research on WSD tend to be tested only on a dozen number of words, where each word frequently has either two or a few senses. To test the scalability of LEXAS, we have gathered a corpus in which 192,800 word occurrences have been manually tagged with senses from WoRDNET 1.5. This data set is almost two orders of magnitude larger in size than the above &quot;interest&quot; data set. Manual tagging was done by university undergraduates majoring in Linguistics, and approximately one man-year of efforts were expended in tagging our data set.</Paragraph> <Paragraph position="1"> These 192,800 word occurrences consist of 121 nouns and 70 verbs which are the most frequently occurring and most ambiguous words of English. The 121 nouns are: action activity age air area art board body book business car case center century change child church city class college community company condition cost country course day death development difference door effect effort end example experience face fact family field figure foot force form girl government ground head history home hour house information interest job land law level life light line man material matter member mind moment money month name nation need number order part party picture place plan point policy position power pressure problem process program public purpose question reason result right room school section sense service side society stage state step student study surface system table term thing time town type use value voice water way word work world The 70 verbs are: add appear ask become believe bring build call carry change come consider continue determine develop draw expect fall give go grow happen help hold indicate involve keep know lead leave lie like live look lose mean meet move need open pay raise read receive remember require return rise run see seem send set show sit speak stand start stop strike take talk tell think turn wait walk want work write For this set of nouns and verbs, the average number of senses per noun is 7.8, while the average number of senses per verb is 12.0. We draw our sentences containing the occurrences of the 191 words listed above from the combined corpus of the 1 million word Brown corpus and the 2.5 million word Wall Street Journal (WSJ) corpus. For every word in the two lists, up to 1,500 sentences each containing an occurrence of the word are extracted from the combined corpus. In all, there are about 113,000 noun occurrences and about 79,800 verb occurrences. This set of 121 nouns accounts for about 20% of all occurrences of nouns that one expects to encounter in any unrestricted English text. Similarly, about 20% of all verb occurrences in any unrestricted text come from the set of 70 verbs chosen. We estimate that there are 10-20% errors in our sense-tagged data set. To get an idea of how the sense assignments of our data set compare with those provided by WoRDNET linguists in SEMCOR, the sense-tagged subset of Brown corpus prepared by Miller et al. (Miller et al., 1994), we compare a subset of the occurrences that overlap. Out of 5,317 occurrences that overlap, about 57% of the sense assignments in our data set agree with those in SEMCOR. This should not be too surprising, as it is widely believed that sense tagging using the full set of refined senses found in a large dictionary like WORDNET involve making subtle human judgments (Wilks et al., 1990; Bruce and Wiebe, 1994), such that there are many genuine cases where two humans will not agree fully on the best sense assignments. null We evaluated LEXAS on this larger set of noisy, sense-tagged data. We first set aside two subsets for testing. The first test set, named BC50, consists of 7,119 occurrences of the 191 content words that occur in 50 text files of the Brown corpus. The second test set, named WSJ6, consists of 14,139 occurrences of the 191 content words that occur in 6 text files of the WSJ corpus.</Paragraph> <Paragraph position="2"> We compared the classification accuracy of LEXAS against the default strategy of picking the most frequent sense. This default strategy has been advocated as the baseline performance level for comparison with WSD programs (Gale et al., 1992). There are two instantiations of this strategy in our current evaluation. Since WORDNET orders its senses such that sense 1 is the most frequent sense, one possibility is to always pick sense 1 as the best sense assignment. This assignment method does not even need to look at the training sentences. We call this method &quot;Sense 1&quot; in Table 5. Another assignment method is to determine the most frequently occurring sense in the training sentences, and to assign this sense to all test sentences. We call this method &quot;Most Frequent&quot; in Table 5. The accuracy of LEXAS on these two test sets is given in Table 5.</Paragraph> <Paragraph position="3"> Our results indicate that exemplar-based classification of word senses scales up quite well when tested on a large set of words. The classification accuracy of LEXAS is always better than the default strategy of picking the most frequent sense. We believe that our result is significant, especially when the training data is noisy, and the words are highly ambiguous with a large number of refined sense distinctions per word.</Paragraph> <Paragraph position="4"> The accuracy on Brown corpus test files is lower than that achieved on the Wall Street Journal test files, primarily because the Brown corpus consists of texts from a wide variety of genres, including newspaper reports, newspaper editorial, biblical passages, science and mathematics articles, general fiction, romance story, humor, etc. It is harder to dis- null ambiguate words coming from such a wide variety of texts.</Paragraph> </Section> </Section> class="xml-element"></Paper>