File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0804_metho.xml

Size: 6,889 bytes

Last Modified: 2025-10-06 14:09:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0804">
  <Title>SENSEVAL-3 TASK Word-Sense Disambiguation of WordNet Glosses</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Submissions
</SectionTitle>
    <Paragraph position="0"> Seven teams participated in the task with one team submitting two runs and one team submitting three runs. A submission contained an identifier (the part of speech of the gloss and its synset number) and a WordNet sense for each content word or phrase identified by the system. The answer key contains part of speech/synset identifier, the XWN quality assignment, the lemma and the word form from the XWN data, and the WordNet sense. The scoring program (a Perl script) stored the answers in three hashes according to quality (&amp;quot;gold&amp;quot;, &amp;quot;silver&amp;quot;, and &amp;quot;normal&amp;quot;) and then also stored the system's answers in a hash. The program then proceeded through the &amp;quot;gold&amp;quot; answers and determined if a system's answers included a match for that answer, equaling either the (lemma, sense) or (word form, sense). No system submitted more than one sense for each of its word forms. An exact match received a score of 1.0. If a 2The answer key contains all assignments, so it is possible that runs can be analyzed with these other sense assignments with a voting system. However, such an analysis has not yet been performed.</Paragraph>
    <Paragraph position="1"> system returned either the lemma or the word form, but had assigned an incorrect sense, the item was counted as attempted.</Paragraph>
    <Paragraph position="2"> Precision was computed as the number correct divided by the number attempted. Recall was computed as the number correct divided by the total number of &amp;quot;gold&amp;quot; items. The percent attempted was computed as the number attempted divided by the total number of &amp;quot;gold&amp;quot; items. Results for all runs are shown in Table 2.</Paragraph>
    <Paragraph position="3">  Systems 04a and 06b used the part of speech tags available in the XWN files, while the other runs did not.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Discussion
</SectionTitle>
    <Paragraph position="0"> During discussions on the SENSEVAL-3 mailing list and in interchanges assessing the scoring of the systems, several issues of some importance arose.</Paragraph>
    <Paragraph position="1"> Most of these concerned the nature of the XWN annotation process and the &amp;quot;correctness&amp;quot; of the &amp;quot;gold&amp;quot; quality assignments.</Paragraph>
    <Paragraph position="2"> Since glosses (or definitions) are only &amp;quot;sentence&amp;quot; fragments, parsing them poses some inherent difficulties. In theory, a proper lexicographically-based definition is one that contains a genus term (hypernym or superordinate) and differentiae. A gloss' hypernym is somewhat easily identified as the head of the first phrase, particularly in noun and verb definitions. Since most WordNet synsets have a hypernym, a heuristic for disambiguating the head of the first phrase would be to use the hypernym as the proper disambiguated sense. And, indeed, the instructions for the task encouraged participants to make use of WordNet relations in their disambiguation.</Paragraph>
    <Paragraph position="3"> However, the XWN annotators were not given this heuristic, but rather were presented with the set of WordNet senses without awareness of the WordNet relations. As a result, many glosses had &amp;quot;gold&amp;quot; assignments that seemed incorrect when considering WordNet's own hierarchy. For example, naught is defined as &amp;quot;complete failure&amp;quot;; in WordNet, its hypernym failure is sense 1 (&amp;quot;an act that fails&amp;quot;), but the XWN annotators tagged it with sense 2 (&amp;quot;an event that does not accomplish its intended purpose&amp;quot;).</Paragraph>
    <Paragraph position="4"> To investigate the use of WordNet relations heuristics, we considered a set of 313 glosses containing 867 &amp;quot;gold&amp;quot; assignments which team 06 submitted as highly reliant on these relations. As shown in Table 3 (scored on 8944 glosses with 14312 &amp;quot;gold&amp;quot; assignments), precision scores changed most for 03 (0.020), 06b (0.017), and 04a (0.016); these runs had correspondingly much lower scores for the 313 glosses in this set (results not shown).</Paragraph>
    <Paragraph position="5"> These differences do not appear to be significant. A more complete assessment of the significance of WordNet relations in disambiguation would require a more complete identification of glosses where systems relied on such information.</Paragraph>
    <Paragraph position="6">  Further discussion with members of the XWN project about the annotation process revealed some factors that should be taken into account when assessing the various systems' performances. Firstly, the annotations of the 9257 glosses with &amp;quot;gold&amp;quot; assignments were annotated using three different methods. The first group of 1032 glosses were fully hand-tagged by two graduate students, with 80 percent agreement and with the project leader choosing a sense when there was disagreement.</Paragraph>
    <Paragraph position="7"> For the remaining glosses in WordNet, two automated disambiguation programs were run. When both programs agreed on a sense, they were given a &amp;quot;silver&amp;quot; quality. In those glosses for which all but one or two words had been assigned a &amp;quot;silver&amp;quot; quality, the one or two words were hand-tagged by a graduate student, without any interannotator check or review.</Paragraph>
    <Paragraph position="8"> There are 4077 noun glosses in this second set.</Paragraph>
    <Paragraph position="9"> A third set, the remaining 4738 among the test set, were glosses for which all the words but one had been assigned a &amp;quot;silver&amp;quot; quality. The single word was then hand-tagged by a graduate student, and in some cases by the project leader (particularly when a word had been mistagged by the Brill tagger).</Paragraph>
    <Paragraph position="10"> To assess the effect of these three different styles of annotation, we ran the scoring program, restricting the items scored to those in each of the three annotation sets. The scores were changed much more significantly for the various teams for the different sets. For the first set, precision was down approximately 0.07 for three runs, with much lower changes for the other runs. For the second set, precision was up approximately 0.075 for two runs, down approximately 0.08 for two runs, and relatively unchanged for the remaining runs. For the third set, there was relatively little changes in the precision for all runs (with a maximum change of 0.03).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML