File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-0814_evalu.xml

Size: 3,341 bytes

Last Modified: 2025-10-06 13:58:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0814">
  <Title>Evaluating the results of a memory-based word-expert approach to unrestricted word sense disambiguation.</Title>
  <Section position="6" start_page="1991" end_page="1991" type="evalu">
    <SectionTitle>
4 Results on the Senseval test data
</SectionTitle>
    <Paragraph position="0"> In order to evaluate our word-expert approach on the SENSEVAL-2 test data, we divided the data into three groups as illustrated in Table 3. The onesense group (90.5% accuracy) contains the words with one sense according to WordNet1.7. Besides the errors made for the &amp;quot;U&amp;quot; words, the errors in this group were all due to incorrect POS tags and lemmata. The more-sense a6 threshold group (63.3% accuracy) contains the words with more senses but for which no word-expert was built due to an insufficient number (less than 10) of training instances. These words all receive the majority sense according to WordNet1.7. The more-sense a7 threshold group (55.3% accuracy) contains the words for which a word-expert is built. In all three groups, top performance is for the nouns and adverbs; the verbs are hardest to classify. The last row of Table 3 shows the accuracy of our system on the English all words test set. Since all 2,473 word forms were covered, no distinction is made between precision and recall.</Paragraph>
    <Paragraph position="1"> On the complete test set, an accuracy of 64.4% is obtained according to the fine-grained SENSEVAL-2 scoring.</Paragraph>
    <Paragraph position="2"> This result is slightly different from the score obtained during the competition (63.6%), since for these new experiments complete optimization was performed over all parameter settings. Moreover, in the competition experiments, Ripper (Cohen, 1995) was used as the keyword classifier, whereas in the new experiments TIMBL was used for training all classifiers. Just as in the SENSEVAL-1 task for English (Kilgarriff and Rosenzweig, 2000), overall top performance is for the nouns and adverbs. For the verbs, the overall accuracy is lowest: 48.6%. This was also the case in the train set (see Table 1). All 86 &amp;quot;unknown&amp;quot; word forms, for which the annotators decided that no WordNet1.7 sense-tag was applicable, were mis-classified.</Paragraph>
    <Paragraph position="3"> Although our WSD system performed second best on the SENSEVAL-2 test data, this 64.4% accuracy is rather low. When only taking into account the words for which a word-expert is built, a 55.3% classification accuracy is obtained. This score is nearly 20% below the result on the train set (see Figure 1): 73.8%. A possible explanation for the accuracy differences between the word-expert classifiers on the test and train data, is that the instances in the Semcor training corpus do not cover all possible WordNet senses: in the training corpus, the words we used for the construction of word-experts had on average 4.8a8 3.2 senses, whereas those same words had on average 7.4a8 5.8 senses in WordNet. This implies that for many sense distinctions in the test material no training material was provided: for 603 out of 2,473 test instances (24%), the assigned sense tag (or in case of multiple possible sense tags, one of those senses) was not provided in the train set.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML