File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0814_metho.xml

Size: 16,174 bytes

Last Modified: 2025-10-06 14:08:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0814">
  <Title>Evaluating the results of a memory-based word-expert approach to unrestricted word sense disambiguation.</Title>
  <Section position="4" start_page="0" end_page="1991" type="metho">
    <SectionTitle>
2 Memory-based word-experts
</SectionTitle>
    <Paragraph position="0"> Our approach in the SENSEVAL-2 experiments was to train so-called word-experts per word-POS combination. These word-experts consist of several learning modules, each of them taking different information as input, which are furthermore combined in a voting scheme.</Paragraph>
    <Paragraph position="1"> In the experiments, the Semcor corpus included in WordNet1.61 was used as train set. In the corpus, every word is linked to its appropriate sense in the WordNet lexicon. This training corpus consists of 409,990 word forms, of which 190,481 are sense-tagged. The test data in the SENSEVAL-2 English all words task consist of three articles on different topics, with at total of 2,473 words to be sensetagged. WordNet1.7 was used for the annotation of these test data. No mapping was performed between both versions of WordNet. For both the training and the test corpus, only the word forms were used and tokenization, lemmatization and POS-tagging were done with our own software. For the part of speech tagging, the memory-based tagger MBT (Daelemans et al., 1996), trained on the Wall Street Journal corpus2, was used. On the basis of word and POS information, lemmatization (van den Bosch and Daelemans, 1999) was done.</Paragraph>
    <Paragraph position="2"> After this preprocessing stage, all word-experts were built. This process was guided by WordNet1.7: for every combination of a word form and a POS, WordNet1.7 was consulted to determine whether this combination had one or more possible senses.</Paragraph>
    <Paragraph position="3"> In case of only one possible sense (about 20% of the test words), the appropriate sense was assigned.</Paragraph>
    <Paragraph position="4"> In case of more possible senses, a minimal threshold of ten occurrences in the Semcor training data was determined, since 10-fold cross-validation was used for testing in all experiments. This threshold  voting techniques in relation to a threshold varying between 10 and 100. This accuracy is calculated on the words with more than one sense which qualify for the construction of a word-expert.</Paragraph>
    <Paragraph position="5"> was then varied between 10 and 100 training items in order to determine the optimal number of training instances. For all words of which the frequency was lower than the threshold (also about 20% of the test words), the most frequent sense according to Word-Net1.7 was predicted. The cross-validation results in Figure 2 clearly show that accuracy drops when the contribution of the baseline classifier increases. The application of the WordNet baseline classifier yields a 61.7% accuracy. The &amp;quot;best&amp;quot; graph displays the accuracy when applying the optimal classifier for each single word-expert: with a threshold of 10, a 73.8% classification accuracy is obtained. On the basis of these results, we set the threshold for the construction of a word-expert to 10 training items. For all words below this threshold, the most frequent sense according to WordNet1.7 was assigned as sense-tag.</Paragraph>
    <Paragraph position="6"> For the other words in the test set (1,404 out of 2,473), word-experts were built for each word form-POS combination, leading to 596 word-experts for the SENSEVAL-2 test data.</Paragraph>
    <Paragraph position="7"> The word-experts consist of different trained sub-components which make use of different knowledge: (i) a classifier trained on the local context of the ambiguous focus word, (ii) a learner trained on keywords, (iii) a classifier trained on both of the previous information sources, (iv) a baseline classifier always providing the most frequent sense in the sense lexicon and (v) four voting strategies which vote on the outputs of the previously mentioned classifiers. For the experiments with the single classifiers, we used the MBL algorithms implemented in TIMBL3. In this memory-based learning approach to WSD, all instances are stored in memory during training and during testing (i.e. sensetagging), the instance most similar (Hamming distance) to that of the focus word and its local context and/or keyword information is selected and the associated class is returned as sense-tag. For an overview of the algorithms and metrics, we refer to Daelemans et al. (2001).</Paragraph>
    <Paragraph position="8"> a3 The first classifier in a word-expert takes as input a vector representing the local context of the focus word in a window of three words to the left and three to the right. For the focus word, both the lemma and POS are provided. For the context words, POS information is given. E.g., the following is a training instance: American JJ history NN and CC most most JJS American JJ literature NN is VBZ most%3:00:01::.</Paragraph>
    <Paragraph position="9"> a3 The second classifier in a word-expert is trained with information about possible disambiguating content keywords in a context of three sentences (focus sentence and one sentence to the left and to the right). The method used to extract these keywords for each sense is based on the work of Ng and Lee (1996). In addition to the keyword information extracted from the local context of the focus word, possible disambiguating content words were also extracted from the examples in the sense definitions for a given focus word in WordNet.</Paragraph>
    <Paragraph position="10"> a3 The third subcomponent is a learner combining both of the previous information sources.</Paragraph>
    <Paragraph position="11"> In order to improve the predictions of the different learning algorithms, algorithm parameter optimiza- null tion was performed where possible. Furthermore, the possible gain in accuracy of different voting strategies was explored. On the output of these three (optimized) classifiers and the WordNet1.7. most frequent sense, both majority voting and weighted voting was performed. In case of majority voting, each sense-tagger is given one vote and the tag with most votes is selected. In weighted voting, the accuracies of the taggers on the validation set are used as weights and more weight is given to the taggers with a higher accuracy. In case of ties when voting over the output of 4 classifiers, the first decision (TIMBL) was taken as output class. Voting was also performed on the output of the three classifiers without taking into account the WordNet class.</Paragraph>
    <Paragraph position="12"> For a more complete description of this word-expert approach, we refer to (Hoste et al., 2001) and (Hoste et al., 2002).</Paragraph>
  </Section>
  <Section position="5" start_page="1991" end_page="1991" type="metho">
    <SectionTitle>
3 Evaluation of the results
</SectionTitle>
    <Paragraph position="0"> For the evaluation of our word sense disambiguation system, we concentrated on the words for which a word-expert was built. We first evaluated our approach using cross-validation on the training data, giving us the possiblity to evaluate over a large set (2,401) of word-experts. The results on the test set (596 word-experts) are discussed in Section 4.</Paragraph>
    <Section position="1" start_page="1991" end_page="1991" type="sub_section">
      <SectionTitle>
3.1 Parts-of-speech vs. information sources
</SectionTitle>
      <Paragraph position="0"> In a first evaluation step, we investigated the interaction between the use of different information sources and the part-of-speech category of the ambiguous words. Table 1 shows the results of the different component classifiers and voting mechanisms per part-of-speech category. This table shows the same tendencies among all classifiers and voters: the best scores are obtained for the adverbs, nouns and adjectives. Their average scores range between 64.2% (score of the baseline classifier on the nouns) and 76.6% (score of the context classifier on the adverbs). For the verbs, accuracies drop by nearly 10% and range between 56.9% (baseline classifier) and 64.6% (weighted voters). A similar observation was made by Kilgarriff and Rosenzweig (2000) in the SENSEVAL-1 competition in which a restricted set of words had to be disambiguated. They also showed that in English the verbs were the hardest  category to predict.</Paragraph>
      <Paragraph position="1"> Each row in Table 1 shows results of the different word-expert components per part-of-speech category. This comparison reveals that there is no optimal classifier/voter per part-of-speech, nor an over-all optimal classifier. However, making use of different classifiers/voters which take as input different information sources does make sense, if the selection of the classifier/voter is done at the word level. We already showed this gain in accuracy in Figure 2: selecting the optimal classifier/voter for each single word-expert leads to an overall accuracy of 73.8% on the train set, whereas the second best method (weighted voting without taking into account the baseline classfier) yields a 70.3% accuracy.</Paragraph>
    </Section>
    <Section position="2" start_page="1991" end_page="1991" type="sub_section">
      <SectionTitle>
3.2 Number of training items
</SectionTitle>
      <Paragraph position="0"> We also investigated whether the words with the same part-of-speech have certain characteristics which make them harder/easier to disambiguate. In other words, why are verbs harder to disambiguate than adverbs? For this evaluation, the results of the context classifier were taken as a test case and evaluated in terms of (i) the number of training items, (ii) the number of senses in the training corpus and (iii) the sense distribution within the word-experts.</Paragraph>
      <Paragraph position="1"> With respect to the number of training items, we observed that their frequency distribution is Zipflike (Zipf, 1935): many training instances only occur a limited number of times, whereas few training items occur frequently. In order to analyze the effect of the number of training items on accuracy, all word-experts were sorted according to their performance and then divided into equally-sized groups of 50. Figure 2 displays the accuracy of the word-experts in relation to the averages of these bags of 50. The Figure shows that the accuracy fluctuations for these bags are higher for the experts with a lim- null ations decrease as the number of training items increases. The average accuracy level of 70% can be situated somewhere in the middle of this fluctuating line.</Paragraph>
      <Paragraph position="2"> This tendency of performance being independent of the number of training items is also confirmed when averaging over the number of training items per part-of-speech category. The adjectives have on average 49.0 training items and the nouns have an average of 52.9 training items. The highest average number of training items is for the verbs (86.7) and adverbs (82.1). When comparing these figures with the scores in Table 1, in which it is shown that the verbs are hardest to predict, whereas the accuracy levels on the adverbs, nouns, adjectives are close, we can conclude that the mere number of training items is not an accurate predictor of accuracy. This again confirms the usefulness of training classifiers even on very small data sets, also shown in Figure 1.</Paragraph>
    </Section>
    <Section position="3" start_page="1991" end_page="1991" type="sub_section">
      <SectionTitle>
3.3 Polysemy and sense distribution
</SectionTitle>
      <Paragraph position="0"> For the English lexical sample task in SENSEVAL-1, Kilgarriff and Rosenzweig (2000) investigated the effect of polysemy and entropy on accuracy. Pol- null senses and the exponential trendline per POS in relation to the accuracy of the context classifier. ysemy can be described as the number of senses of a word-POS combination; entropy is an estimation of the information chaos in the frequency distribution of the senses. If the corpus instances are evenly spread across the lexicon senses, entropy will be high. The sense distribution of ambiguous words can also be highly skewed, giving rise to low entropy scores. Kilgarriff and Rosenzweig (2000) found that the nouns on average had higher polysemy than the verbs and the verbs had higher entropy. Since verbs were harder to predict than nouns, they came to the conclusion that entropy was a better measure of task difficulty than polysemy. Since we were interested whether the same could be concluded for the English all-words task, we investigated this effect of polysemy and entropy in relation to the accuracy of one classifier in our word-expert, namely the context classifier.</Paragraph>
      <Paragraph position="1"> Figure 3 shows the number of senses (polysemy) over all word experts with the same part-of-speech in relation to the scores from the context classifier, whereas Figure 4 displays the sense distributions (entropy) over all word-experts with the same part-of-speech. Although it is not very clear from the scatter plot in Figure 3, the exponential trendlines show that accuracy increases as the number of senses decreases. For the sense distributions, the same tendency, but much stronger, can be observed: low entropy values mostly coincide with high accuracies, whereas high entropies lead to low accuracy  sense distributions and the exponential trendline per POS in relation to the accuracy of the context classifier. null scores. This tendency is also confirmed when averaging these scores over all word-experts with the same part-of-speech (see Table 2): the verbs, which are hardest to predict, are most polysemic and also show the highest entropy. The adverbs, which are easiest to predict, have on average the lowest number of senses and the lowest entropy. We can conclude that both polysemy and in particular entropy are good measures for determining task difficulty.</Paragraph>
      <Paragraph position="2"> These results indicate it would be interesting to work towards a more coarse-grained granularity of the distinction between word senses. We believe that this would increase performance of the WSD systems and make them a possible candidate for integration in practical applications such as machine translation systems. This is also shown by Stevenson and Wilks (2001), who used the Longman Dictionary of Contemporary English (LDOCE) as sense inventory. In LDOCE, the senses for each word type are grouped into sets of senses with related meanings (homographs). Senses which are far enough apart are grouped into separate homographs.</Paragraph>
      <Paragraph position="3"> The vast majority of homographs in LDOCE are marked with a single part-of-speech. This makes the task of WSD partly a part-of-speech tagging task, which is generally held to be an easier task than word sense disambiguation: on a corpus of 5 articles in the Wall Street Journal, their system already correctly classifies 87.4% of the words when only using POS information (baseline: 78%).</Paragraph>
      <Paragraph position="4">  speech category.</Paragraph>
      <Paragraph position="5"> As illustrated in Figure 4, the context classifier performs best on word-POS combinations with low entropy values. However, since low entropy scores are caused by at the one end, many instances having the same sense and at the other, a very few instances having different senses, this implies that simply choosing the majority class for all instances already leads to high accuracies. In order to determine performance on those low entropy words, we selected 100 words with the lowest entropy values.</Paragraph>
      <Paragraph position="6"> The local context classifier has an average accuracy of 96.8% on these words, whereas the baseline classifier which always predicts the majority class has an average accuracy of 90.2%. These scores show that even in the case of highly skewed sense distributions, where the large majority of the training instances receives a majority sense, our memory-based learning approach performs well.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML