File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/92/p92-1032_concl.xml

Size: 3,641 bytes

Last Modified: 2025-10-06 13:56:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="P92-1032">
  <Title>Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs</Title>
  <Section position="5" start_page="255" end_page="255" type="concl">
    <SectionTitle>
7. Conclusions
</SectionTitle>
    <Paragraph position="0"> We began this discussion with a review of our recent work on word-sense disambiguation, which extends the approach of using massive lexicographic resources (e.g., parallel corpora, dictionaries, thesauruses and encyclopedia) in order to attack the knowledgeacquisition bottleneck that Bar-Hillel identified over thirty years ago. After using both the monolingual and bilingual classifiers for a few months, we have convinced ourselves that the performance is remarkably good. Nevertheless, we would really like to be able to make a stronger statement, and therefore, we decided to try to develop some more objective evaluation measures.</Paragraph>
    <Paragraph position="1"> A survey of the literature on evaluation failed to identify an attractive role model. In addition, we found it particularly difficult to obtain a clear estimate of the state-of-the-art.</Paragraph>
    <Paragraph position="2"> In order to address this state of affairs, we decided to try to establish upper and lower bounds on the level of performance that we could expect to obtain. We estimated the lower bound by positing a simple baseline system which ignored context and simply assigned the most likely sense in all cases. Hopefully, most reasonable systems would outperform this system. The upper bound was approximated by trying to estimate the limit of our ability to measure performance. We assumed that this limit was largely dominated by the ability for the human judges to agree with one another.</Paragraph>
    <Paragraph position="3"> The estimate depends very much, not surprisingly, on the particular experimental design. Jorgensen, who was interested in highlighting differences among informants, found a very low estimate (68%), well below the baseline (75%), and also well below the level that Bar-Hillel asserted as not-good-enough. In our own work, we have attempted to highlight agreements, so that there would more dynamic range between the baseline and the limit of our ability to measure performance. In so doing, we were able to obtain a much more usable estimate of (96.8%) by redefining the task from a classification task tO a discrimination task. In addition, we also made use of the constraint that multiple instances of a polysemous word in the same discourse have a very strong tendency to take on the same sense. This constraint will probably prove useful for improving the performance of future word-sense disambiguation algorithms.</Paragraph>
    <Paragraph position="4"> Similar attempts to establish upper and lower bounds on performance have been made in other areas of computational linguistics, specifically part of speech tagging. For that application, it is generally accepted that the baseline part-of-speech tagging performance is about 90% (as estimated by a similar baseline system that ignores context and simply assigns the most likely part of speech to all instances of a word) and that the upper bound (imposed by the limit for judges to agree with one another) is about 95%. Incidentally, most part of speech algorithms are currently performing at or near the limit of our ability to measure performance, indicating that there may be room for refining the experimental conditions along similar lines to what we have done here, in order to improve the dynamic range of the evaluation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML