File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/p05-1005_evalu.xml
Size: 7,060 bytes
Last Modified: 2025-10-06 13:59:25
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1005"> <Title>Learning Semantic Classes for Word Sense Disambiguation</Title> <Section position="5" start_page="38" end_page="39" type="evalu"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> In what follows, we present the results of our experiments in various test cases.3 We combined the three classifiers and the WORDNET first-sense classifier through simple majority voting. For evaluating the systems with SENSEVAL data sets, we mapped the outputs of our classifiers to WORDNET senses by picking the most-frequent sense (the one with the lowest sense number) within each of the class. This mapping was used in all tests. For all evaluations, we used SENSEVAL official scorer.</Paragraph> <Paragraph position="1"> We could use the setting only for nouns and verbs, because the similarity measures we used were not defined for adjectives or adverbs, due to the fact that hypernyms are not defined for these two parts of speech. So we list the initial results only for nouns and verbs.</Paragraph> <Section position="1" start_page="38" end_page="38" type="sub_section"> <SectionTitle> 4.1 Individual classifiers vs. combination </SectionTitle> <Paragraph position="0"> We evaluated the results of the individual classifiers before combination. Only local context classifier could outperform the baseline in general, although there is a slight improvement with the syntactic pattern classifier on SENSEVAL-2 data.</Paragraph> <Paragraph position="1"> The results are given in the table 2, together with the results of voted combination, and baseline WORDNET first sense. Classifier shown as 'concatenated' is a single classifier trained from all of these feature vectors concatenated to make a single vector. Concatenating features this way does not seem to improve performance. Although exact reasons for this are not clear, this is consistent with pre3Note that the experiments and results are reported for SENSEVAL data for comparison purposes, and were not involved in parameter optimization, which was done with the development sample.</Paragraph> <Paragraph position="2"> recall, combined results for nouns and verbs weighting. Combined results for nouns and verbs with voting schemes Simple Majority (SM), Global classifier weights (GW) and local weights (LW).</Paragraph> <Paragraph position="3"> vious observations (Hoste et al., 2001; Decadt et al., 2004) that combining classifiers, each using different features, can yield good performance.</Paragraph> </Section> <Section position="2" start_page="38" end_page="38" type="sub_section"> <SectionTitle> 4.2 Effect of similarity measure </SectionTitle> <Paragraph position="0"> Table 3 shows the effect of JCn and Resnik similarity measures, along with no similarity weighting, for the combined classifier. It is clear that proper similarity measure has a major impact on the performance, with Resnik measure performing worse than the baseline.</Paragraph> </Section> <Section position="3" start_page="38" end_page="39" type="sub_section"> <SectionTitle> 4.3 Optimizing the voting process </SectionTitle> <Paragraph position="0"> Several voting schemes were tried for combining classifiers. Simple majority voting improves performance over baseline. However, previously reported results such as (Hoste et al., 2001) and (Decadt et al., 2004) have shown that optimizing the voting process helps improve the results. We used a variation of Weighted Majority Algorithm (Littlestone and Warmuth, 1994). The original algorithm was formulated for binary classification tasks; however, our use of it for multi-class case proved to be successful.</Paragraph> <Paragraph position="1"> We used the held-out development data set for adjusting classifier weights. Originally, all classifiers have the same weight of 1. With each test instance, the classifier builds the final output considering the weights. If this output turns out to be wrong, the classifiers that contributed to the wrong answer get their weights reduced by some factor. We could ad- null just the weights locally or globally; In global setting, the weights were adjusted using a random sample of held-out data, which contained different words.</Paragraph> <Paragraph position="2"> These weights were used for classifying all words in the actual test set. In local setting, each classifier weight setting was optimized for individual words that were present in test sets, by picking up random samples of the same word from SEMCOR .4 Table 4 shows the improvements with each setting.</Paragraph> <Paragraph position="3"> Coarse grained (at semantic-class level) results for the same system are shown in table 5. Baseline figures reported are for the most-frequent class.</Paragraph> <Paragraph position="4"> 4.4 Final results on SENSEVAL data Here, we list the performance of the system with adjectives and adverbs added for the ease of comparison. Due to the facts mentioned at the beginning of this section, our system was not applicable for these parts of speech, and we classified all instances of these two POS types with their most frequent sense. We also identified the multi-word phrases from the test documents. These phrases generally have a unique sense in WORDNET ; we marked all of them with their first sense without classifying them. All the multiple-class instances of nouns and verbs were classified and converted to WORD-NET senses by the method described above, with locally optimized classifier voting.</Paragraph> <Paragraph position="5"> The results of the systems are shown in tables 7 and 8. Our system's results in both cases are listed as Simil-Prime, along with the baseline WORD-NET first sense (including multi-word phrases and 'U' answers), and the two best performers' results reported.5 These results compare favorably with the official results reported in both tasks.</Paragraph> <Paragraph position="6"> 4Words for which there were no samples in SEMCOR were classified using a weight of 1 for all classifiers.</Paragraph> <Paragraph position="7"> 5The differences of the baseline figures from the previously reported figures are clearly due to different handling of multi-word phrases, hyphenated words, and unknown words in each system. We observed by analyzing the answer keys that even better baseline figures are technically possible, with better techniques to identify these special cases.</Paragraph> <Paragraph position="8"> data for all parts of speech and fine grained scoring.</Paragraph> <Paragraph position="9"> Significance of results To verify the significance of these results, we used one-tailed paired t-test, using results of baseline WORDNET first sense and our system as pairs. Tests were done both at micro-average level and macro-average level, (considering test data set as a whole and considering per-word average). Null hypothesis was that there is no significant improvement over the baseline. Both settings yield good significance levels, as shown in table 6.</Paragraph> </Section> </Section> class="xml-element"></Paper>