File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1013_evalu.xml

Size: 4,929 bytes

Last Modified: 2025-10-06 13:59:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1013">
  <Title>Ensemble Methods for Unsupervised WSD</Title>
  <Section position="7" start_page="101" end_page="103" type="evalu">
    <SectionTitle>
5 Experiment 2: Ensembles for
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="101" end_page="101" type="sub_section">
      <SectionTitle>
Unsupervised WSD
5.1 Method and Parameter Settings
</SectionTitle>
      <Paragraph position="0"> We assess the performance of the different ensemble systems on the same set of SemCor nouns on which the individual methods were tested. For the best ensemble, we also report results on disambiguating all nouns in the Senseval-3 data set.</Paragraph>
      <Paragraph position="1"> We focus exclusively on nouns to allow comparisons with the results obtained from SemCor.</Paragraph>
      <Paragraph position="2"> We used the same parameters as in Experiment 1 for constructing the ensembles. As discussed earlier, token-based methods can disambiguate target words either in context or using the predominant sense. SSI was employed in the predominant sense setting in our arbiter experiment.</Paragraph>
    </Section>
    <Section position="2" start_page="101" end_page="103" type="sub_section">
      <SectionTitle>
5.2 Results
</SectionTitle>
      <Paragraph position="0"> Our results are summarized in Table 4. As can be seen, all ensemble methods perform signi cantly  moval of each method from the rank-based ensemble. null better than the best individual methods, i.e., Similarity and SSI. On the WSD task, the voting, probability mixture, and rank-based ensembles significantly outperform the arbiter-based one. The performances of the probability mixture, and rank-based combinations do not differ signi cantly but both ensembles are signi cantly better than voting. One of the factors contributing to the arbiter's worse performance (compared to the other ensembles) is the fact that in many cases (almost 30%), none of the senses suggested by the disagreeing methods is correct. In these cases, there is no way for the arbiter to select the correct sense. We also examined the relative contribution of each component to overall performance. Table 5 displays the drop in performance by eliminating any particular component from the rank-based ensemble (indicated by ). The system that contributes the most to the ensemble is SSI. Interestingly, Overlap and Similarity yield similar improvements in WSD accuracy (0.6 and 0.9, respectively) when added to the ensemble.</Paragraph>
      <Paragraph position="1"> Figure 1 shows the WSD accuracy of the best single methods and the ensembles as a function of the noun frequency in SemCor. We can see that there is at least one ensemble outperforming any single method in every frequency band and that the rank-based ensemble consistently outperforms Similarity and SSI in all bands. Although Similarity has an advantage over SSI for low and medium frequency words, it delivers worse performance for high frequency words. This is possibly due to the quality of neighbors obtained for very frequent words, which are not semantically distinct enough to reliably discriminate between different senses.</Paragraph>
      <Paragraph position="2"> Table 6 lists the performance of the rank-based ensemble on the Senseval-3 (noun) corpus. We also report results for the best individual method, namely SSI, and compare our results with the best unsupervised system that participated in Senseval3. The latter was developed by Strapparava et al. (2004) and performs domain driven disambiguation (IRST-DDD). Speci cally, the approach com- null pares the domain of the context surrounding the target word with the domains of its senses and uses a version of WordNet augmented with domain labels (e.g., economy, geography). Our baseline selects the rst sense randomly and uses it to disambiguate all instances of a target word. Our upper bound defaults to the rst sense from SemCor. We report precision, recall and Fscore. In cases where precision and recall gures coincide, the algorithm has 100% coverage.</Paragraph>
      <Paragraph position="3"> As can be seen the rank-based, ensemble out-performs both SSI and the IRST-DDD system.</Paragraph>
      <Paragraph position="4"> This is an encouraging result, suggesting that there may be advantages in developing diverse classes of unsupervised WSD algorithms for system combination. The results in Table 6 are higher than those reported for SemCor (see Table 4). This is expected since the Senseval-3 data set contains monosemous nouns as well. Taking solely polysemous nouns into account, SSI's Fscore is 53.39% and the ranked-based ensemble's 55.0%. We further note that not all of the components in our ensemble are optimal. Predominant senses for Lesk and LexChains were estimated from the Senseval-3 data, however a larger corpus would probably yield more reliable estimates.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML