File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/w02-1004_concl.xml

Size: 2,481 bytes

Last Modified: 2025-10-06 13:53:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1004">
  <Title>Modeling Consensus: Classifier Combination for Word Sense Disambiguation</Title>
  <Section position="10" start_page="9" end_page="9" type="concl">
    <SectionTitle>
7 Conclusion
</SectionTitle>
    <Paragraph position="0"> In conclusion, we have presented a comparative evaluation study of combining six structurally and procedurally different classifiers utilizing a rich common feature space. Various classifier combination methods, including count-based, rank-based and probability-based combinations are described and evaluated. The experiments encompass supervised lexical sample tasks in four diverse languages: English, Spanish, Swedish, and Basque.</Paragraph>
    <Paragraph position="1">  To evaluate systems on the full disambiguation task, it is appropriate to compare them on their accuracy at 100% test-data coverage, which is equivalent to system recall in the official SENSEVAL scores. However, it can also be useful to consider performance on only the subset of data for which a system is confident enough to answer, measured by the secondary measure precision. One useful byproduct of the CBV method is the confidence it assigns to each sample, which we measured by the number of classifiers that voted for the sample. If one restricts system output to only those test instances where all participating classifiers agree, consensus system performance is 83.4% precision at a recall of 43%, for an F-measure of 56.7 on the SENSEVAL2 English lexical sample task. This outperforms the two supervised SENSEVAL2 systems that only had partial coverage, which exhibited 82.9% precision at a recall of 28% (F=41.9) and 66.5% precision at 34.4% recall (F=47.9).</Paragraph>
    <Paragraph position="2">  The experiments show substantial variation in single classifier performance across different languages and data sizes. They also show that this variation can be successfully exploited by 10 different classifier combination methods (and their metavoting consensus), each of which outperforms both the single best classifier system and standard classifier combination models on each of the 4 focus languages. Furthermore, when the stacking consensus systems were frozen and applied once to the otherwise untouched test sets, they substantially outperformed all previously known SENSEVAL1 and SENSEVAL2 results on 4 languages, obtaining the best published results on these data sets.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML