File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/p04-1081_evalu.xml
Size: 6,424 bytes
Last Modified: 2025-10-06 13:59:14
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1081"> <Title>A Kernel PCA Method for Superior Word Sense Disambiguation Dekai WU1 Weifeng SU Marine CARPUAT dekai@cs.ust.hk weifeng@cs.ust.hk marine@cs.ust.hk</Title> <Section position="5" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 KPCA versus na&quot;ive Bayes and maximum </SectionTitle> <Paragraph position="0"> entropy models We established two baseline models to represent the state-of-the-art for individual WSD models: (1) na&quot;ive Bayes, and (2) maximum entropy models. The na&quot;ive Bayes model was found to be the most accurate classifier in a comparative study using a subset of Senseval-2 English lexical sample data by Yarowsky and Florian (2002). However, the maximum entropy (Jaynes, 1978) was found to yield higher accuracy than na&quot;ive Bayes in a subsequent comparison by Klein and Manning (2002), who used a different subset of either Senseval-1 or Senseval-2 English lexical sample data. To control for data variation, we built and tuned models of both kinds. Note that our objective in these experiments is to understand the performance and characteristics of KPCA relative to other individual methods. It is not our objective here to compare against voting or other ensemble methods which, though known to be useful in practice (e.g., Yarowsky et al. (2001)), would not add to our understanding.</Paragraph> <Paragraph position="1"> To compare as evenly as possible, we employed features approximating those of the &quot;featureenhanced na&quot;ive Bayes model&quot; of Yarowsky and Florian (2002), which included position-sensitive, syntactic, and local collocational features. The models in the comparative study by Klein and Manning (2002) did not include such features, and so, again for consistency of comparison, we experimentally verified that our maximum entropy model (a) consistently yielded higher scores than when the features were not used, and (b) consistently yielded higher scores than na&quot;ive Bayes using the same features, in agreement with Klein and Manning (2002). We also verified the maximum entropy results against several different implementations, using various smoothing criteria, to ensure that the comparison was even.</Paragraph> <Paragraph position="2"> Evaluation was done on the Senseval 2 English lexical sample task. It includes 73 target words, among which nouns, adjectives, adverbs and verbs.</Paragraph> <Paragraph position="3"> For each word, training and test instances tagged with WordNet senses are provided. There are an average of 7.8 senses per target word type. On average 109 training instances per target word are available.</Paragraph> <Paragraph position="4"> Note that we used the set of sense classes from Senseval's &quot;fine-grained&quot; rather than &quot;coarse-grained&quot; classification task.</Paragraph> <Paragraph position="5"> The KPCA-based model achieves the highest accuracy, as shown in Table 5, followed by the maximum entropy model, with na&quot;ive Bayes doing the poorest. Bear in mind that all of these models are significantly more accurate than any of the other reported models on Senseval. &quot;Accuracy&quot; here refers to both precision and recall since disambiguation of all target words in the test set is attempted. Results are statistically significant at the 0.10 level, using bootstrap resampling (Efron and Tibshirani, 1993); moreover, we consistently witnessed the same level of accuracy gains from the KPCA-based model over KPCA-based model versus the SVM model.</Paragraph> <Paragraph position="6"> WSD Model Accuracy Sig. Int.</Paragraph> <Paragraph position="7"> SVM-based model 65.2% +/-1.00% KPCA-based model 65.8% +/-0.79% many variations of the experiments.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 KPCA versus SVM models </SectionTitle> <Paragraph position="0"> Support vector machines (e.g., Vapnik (1995), Joachims (1998)) are a different kind of kernel method that, unlike KPCA methods, have already gained high popularity for NLP applications (e.g., Takamura and Matsumoto (2001), Isozaki and Kazawa (2002), Mayfield et al. (2003)) including the word sense disambiguation task (e.g., Cabezas et al. (2001)). Given that SVM and KPCA are both kernel methods, we are frequently asked whether SVM-based WSD could achieve similar results.</Paragraph> <Paragraph position="1"> To explore this question, we trained and tuned an SVM model, providing the same rich set of features and also varying the feature representations to optimize for SVM biases. As shown in Table 6, the highest-achieving SVM model is also able to obtain higher accuracies than the na&quot;ive Bayes and maximum entropy models. However, in all our experiments the KPCA-based model consistently out-performs the SVM model (though the margin falls within the statistical significance interval as computed by bootstrap resampling for this single experiment). The difference in KPCA and SVM performance is not surprising given that, aside from the use of kernels, the two models share little structural resemblance.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Running times </SectionTitle> <Paragraph position="0"> Training and testing times for the various model implementations are given in Table 7, as reported by the Unix time command. Implementations of all models are in C++, but the level of optimization is not controlled. For example, no attempt was made to reduce the training time for na&quot;ive Bayes, or to reduce the testing time for the KPCA-based model.</Paragraph> <Paragraph position="1"> Nevertheless, we can note that in the operating range of the Senseval lexical sample task, the running times of the KPCA-based model are roughly within the same order of magnitude as for na&quot;ive Bayes or maximum entropy. On the other hand, training is much faster than the alternative kernel method based on SVMs. However, the KPCA-based model's times could be expected to suffer in situations where significantly larger amounts of</Paragraph> </Section> </Section> class="xml-element"></Paper>