File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/j01-3001_evalu.xml

Size: 10,625 bytes

Last Modified: 2025-10-06 13:58:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-3001">
  <Title>The Interaction of Knowledge Sources in Word Sense Disambiguation</Title>
  <Section position="8" start_page="341" end_page="344" type="evalu">
    <SectionTitle>
6. Performance
</SectionTitle>
    <Paragraph position="0"> Using the evaluation procedure described in the previous section, it was found that the system correctly disambiguated 90% of the ambiguous instances to the fine-grained sense level, and in excess of 94% to the homograph level.</Paragraph>
    <Paragraph position="1">  System results, baselines, and corpus characteristics. Sense level results are calculated over all polysemous words in the evaluation corpus while those reported for the homograph level are calculated only over polyhomographic ones.</Paragraph>
    <Paragraph position="2">  In order to analyze the effectiveness of our tagger in more detail, we split the main corpus into sub-corpora by grammatical category. In other words, we created four individual sub-corpora containing the ambiguous words which had been part-of-speech tagged as nouns, verbs, adjectives, and adverbs. The figures characterizing each of these corpora are shown in Table 7. The majority of the ambiguous words were nouns, with far fewer verbs and adjectives, and less than one thousand adverbs. The average polysemy for nouns, at both sense and homograph levels, is roughly the same as the overall corpus average although it is noticably higher for verbs at the sense level. At the sense level the average polysemy figures are much lower for adjectives and adverbs. This is because it is common for English words to act as either a noun or a verb and, since these are the most polysemous grammatical categories, the average polysemy count becomes large due to the cumulative effect of polysemy across grammatical categories. However, words that can act as adjectives or adverbs are unlikely to be nouns or verbs. This, plus the fact that adjectives and adverbs are generally less polysemous in LDOCE, means that their average polysemy in text is far lower than it is for nouns or verbs.</Paragraph>
    <Paragraph position="3"> Table 7 shows the accuracy of our system over the four subcorpora. We can see that the tagger achieves higher results at the homograph level than the sense level on each of the four subcorpora, which is consistent with the result over the whole corpus.</Paragraph>
    <Paragraph position="4"> There is quite a difference in the tagger's results across the different subcorpora-91% for nouns and 70% for adverbs. Perhaps the learning algorithm does not perform as well on adverbs because that corpus is significantly smaller than the other three. This hypothesis was checked by testing our system on portions of each of the three subcorpora that were roughly equal in size to the adverb subcorpus. We found that the reduced data caused a slight loss of accuracy on each of the three subcorpora; however, there was still a marked difference between the results for the adverb subcorpus and the other three. Further analysis showed that the differences in performance over different subcorpora seem linked to the behavior of different partial taggers when used in combination. In the following section we describe this behavior in more detail. null</Paragraph>
    <Section position="1" start_page="343" end_page="344" type="sub_section">
      <SectionTitle>
Computational Linguistics Volume 27, Number 3
6.1 Interaction of Knowledge Sources
</SectionTitle>
      <Paragraph position="0"> In order to gauge the contribution of each knowledge source separately, we implemented a set of simple disambiguation algorithms, each of which uses the output from a single partial tagger. Each algorithm takes the result of its partial tagger and checks it against the disambiguated text to see if it is correct. If the partial tagger returns more than one sense, as do the simulated annealing, subject code and selectional preference taggers, the first sense is taken to break the tie. For the partial tagger based on Yarowsky's subject-code algorithm, we choose the sense with the highest saliency value. If more than one sense has been assigned the maximum value, the tie is again broken by choosing the first sense. Therefore, each partial tagger returns a single sense and the exact match metric is used to determine the proportion of tokens for which that tagger returns the correct sense. The part-of-speech filter is run before the partial taggers make their decision and so they only consider the set of senses it did not remove. The results of each tagger, computed at both sense and homograph levels over the evaluation corpus and four subcorpora, are shown in Table 7.</Paragraph>
      <Paragraph position="1"> We can see that the partial taggers that are most effective are those based on the simulated annealing algorithm and Yarowsky's subject code approach. The success of these modules supports our decision to use existing disambiguation algorithms that have already been developed rather than creating new ones.</Paragraph>
      <Paragraph position="2"> The most successful of the partial taggers is the one based on Yarowsky's algorithm for modelling thesaural categories by wide contexts. This consistently achieves over 70% correct disambiguation and seems particularly successful when disambiguating adverbs (over 85% correct). It is quite surprising that this algorithm is so successful for adverbs, since it would seem quite reasonable to expect an algorithm based on subject codes to be more successful on nouns and less so on modifiers such as adjectives and adverbs.</Paragraph>
      <Paragraph position="3"> Yarowsky (1992) reports that his algorithm achieves 92% correct disambiguation, which is nearly 13% higher than achieved in our implementation. However, Yarowsky tested his implementation on a restricted vocabulary of 12 words, the majority of which were nouns, and used Roget large categories as senses. The baseline performance for this corpus is 66.5%, considerably higher than the 30.9% computed for the corpus used in our experiments. Another possible reason for the difference in results is the fact that Yarowsky used smoothing algorithms to avoid problems with the probability estimates caused by data sparseness. We did not employ these procedures and used simple corpus frequency counts when calculating the probabilities (see Section 4.5). It is not possible to say for sure that the differences between implementations did not lead to the differences in results, but it seems likely that the difference in the semantic granularity of LDOCE subject codes and Roget categories was an important factor.</Paragraph>
      <Paragraph position="4"> The second partial tagger based on an existing approach is the one which uses simulated annealing to optimize the overlap of words shared by the dictionary definitions for a set of senses. In Section 4.3 we noted that Cowie et al. (1992) reported 47% correct disambiguation to the sense level using this technique, while in our adaptation over 17% more words are correctly disambiguated. Our application filtered out senses with the incorrect part of speech in addition to using a different method to calculate overlap that takes account of short definitions. It seems likely that these changes are the source of the improved results.</Paragraph>
      <Paragraph position="5"> Our least successful partial tagger is the one based on selectional preferences.</Paragraph>
      <Paragraph position="6"> Although its overall result is slightly below the overall corpus baseline, it is very successful at disambiguating verbs. This is consistent with the work of Resnik (1997), who reported that many words do not have strong enough selectional restrictions to carry out WSD. We expected preferences to be successful for adjectives as well, although  this is not the case in our evaluation. This is because the sense discrimination of adjectives is carried out after that for nouns in our algorithm (see Section 4.4), and the former is hindered by the low results of the latter. Adverbs cannot be disambiguated by preference methods against LDOCE because it does not contain the appropriate information.</Paragraph>
      <Paragraph position="7"> Our analysis of the behavior of the individual partial taggers provides some clues to the behavior of the overall system, consisting of all taggers, on the different subcorpora, as shown in Table 7. The system performs to roughly the same level over the noun, verb, and adjective sub-corpora with only a 3% difference between the best and worst performance. The system's worst performance is on the abverb sub-corpus, where it disambiguates only slightly more than 70% of tokens successfully. This may be due to the fact that only two partial taggers provide evidence for this grammatical category. However, the system still manages to disambiguate most of the adverbs to the homograph level successfully, and this is probably because the part-of-speech filter has ruled out the incorrect homographs, not because the partial taggers performed well.</Paragraph>
      <Paragraph position="8"> One can legitimately wonder whether in fact the different knowledge sources for WSD are all ways of encoding the same semantic information, in a similar way that one might suspect transformation rules and statistics encode the same information about part-of-speech tag sequences in different formats. However, the fact that an optimized combination of our partial taggers yields a significantly higher figure than any one tagger operating independently, shows that they must be orthogonal information sources.</Paragraph>
    </Section>
    <Section position="2" start_page="344" end_page="344" type="sub_section">
      <SectionTitle>
6.2 The overall value of the part-of-speech filter
</SectionTitle>
      <Paragraph position="0"> We have already examined the usefulness of part-of-speech tags for semantic disambiguation in Section 3. However, we now want to know the effect it has within a system consisting of several disambiguation modules. It was found that accuracy at the sense level reduced to 87.87% and to 93.36% at the homograph level when the filter was removed. Although the system's performance did not decrease by a large amount, the part-of-speech filter brings the additional benefit of reducing the search space for the three partial taggers. In addition, the fact that these results are not affected much by the removal of the part-of-speech filter, shows that the WSD modules alone do a reasonable job of resolving part-of-speech ambiguity as a side-effect of semantic disambiguation.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML