File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/w93-0101_intro.xml

Size: 5,283 bytes

Last Modified: 2025-10-06 14:05:28

<?xml version="1.0" standalone="yes"?>
<Paper uid="W93-0101">
  <Title>Word Sense Disambiguation by Human Subjects: Computational and Psycholinguistic Applications</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
3 Preliminary experiments
</SectionTitle>
    <Paragraph position="0"> The project described in this paper began when one of us (Ahlswede) wrote disambiguation programs based on those of Lesk \[1986\] and Ide and Veronis \[1990\] for application in dictionary and corpus research. Lesk claimed 50-70% accuracy on short samples of literary and journalistic input. Ide and Veronis claimed a 90% accuracy rate for their program, although they explained that they had tested it against strongly distinct definitions mainly homographs rather than senses.</Paragraph>
    <Paragraph position="1"> After running the programs on test data containing ambiguities at both homograph and sense level, and evaluating the results, Ahlswede doubted whether, given this subtler mix of ambiguities, even a single human judge would achieve 90% consistency on successive evaluations of the same output; moreover, the consistency among multiple judges might well be much lower. Ahlswede recruited seven colleagues and friends to evaluate the test data, then compared their disambiguations of the test data against each other. The level of agreement averaged only 66% among the various human informants, ranging from 31% to 88% between pairs of informants \[Ahlswede, forthcoming\].</Paragraph>
    <Paragraph position="2"> This figure was based on a simple pairwise comparison strategy. The informants rated each sense definition of a test word with a &amp;quot;1&amp;quot; indicating that it correctly represented the meaning of the word as used in the test text; &amp;quot;-1&amp;quot; if the definition did not correctly represent the meaning; and &amp;quot;0&amp;quot; if for any reason the informant could not decide one way or the other.</Paragraph>
    <Paragraph position="3"> Pairs of informants were then compared by matching their ratings of the sense definitions of each word. The pair were considered to agree on a test word if at least one sense received a &amp;quot;1&amp;quot; from both informants and if no sense receiving a &amp;quot;1&amp;quot; from either informant was given a &amp;quot;-1&amp;quot; by the other.</Paragraph>
    <Paragraph position="4"> This scoring method had the advantage of simplicity, but it did not reflect the agreement implicit in the rejection as well as the selection of senses by both informants. But the relative weight of common rejections and common selections among the senses of a given test word depends on the total number of senses, which varies widely. No discrete-valued scoring mechanism seems able to solve this problem.</Paragraph>
    <Paragraph position="5"> A pairwise scoring procedure that gives much more plausible results is the coefficient of correlation, applied to the parallel evaluations by the informants being compared. It clearly distinguishes the relatively high agreement expected from human subjects from the relatively low agreement predicted for primitive automatic disambiguation systems, and from the more or less random behavior of a control series of random &amp;quot;disambiguations.&amp;quot;  1. hl through h7 are human informants; h6 took the test twice.</Paragraph>
    <Paragraph position="6"> . ml and mla are implementations of Lesk's algorithm. In mla, the test texts were previously disambiguated for part of speech; senses of inappropriate parts of speech were assumed incorrect, and left out of the test data.</Paragraph>
    <Paragraph position="7"> 3. m2 is a spreading activation algorithm related to the Ide-Veronis algorithm.</Paragraph>
    <Paragraph position="8"> 4. al is a control in which all senses of all test words received a &amp;quot;1&amp;quot;. In our first scoring  strategy, al achieved absurdly high scores.</Paragraph>
    <Paragraph position="9"> 5. rand is a control created by randomly scrambling the sequence of answers in one of the human samples.</Paragraph>
    <Paragraph position="10"> These results suggested that a very high accuracy rate is not so much unrealistic as meaningless: which of the human informants should the computer agree with, if the humans cannot agree among themselves? For this reason, the informal experiment has led to the development of a larger and more formal test of human disambiguation performance. The main areas of innovation are (1) a much more systematically designed questionnaire, to be administered to hundreds of subjects rather than only seven, and (2) a user interface to facilitate both the completion of the questionnaire by this large number of human subjects, and our analysis of their performance. The biggest advantage of a computerized interface is that we can study the timing of subjects' responses: valuable information that could not be recorded in the original written test.</Paragraph>
    <Paragraph position="11"> Combined with the user interface, the questionnaire is adapted for administration to human informants, but it can be adapted with little effort for use with dictionary-based disambiguation programs, as was done with its written (but also machine-readable) predecessor.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML