File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/a00-3007_evalu.xml

Size: 7,086 bytes

Last Modified: 2025-10-06 13:58:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-3007">
  <Title>Word Sense Disambiguation for Cross-Language Information Retrieval</Title>
  <Section position="7" start_page="37" end_page="38" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> As suggested by the WSD literature, evaluation of word sense disambiguation systems is not yet standardized (Resnik and Yarowsky, 1997). Some WSD evaluations have been done using the Brown Corpus as training and testing resources and comparing the results against SemCor 3, the sense-tagged version of the Brown Corpus (Agirre and Rigau, 1996; Gonzalo et al., 1998). Others have used common test suites such as the 2094-word line data of Leacock et al.</Paragraph>
    <Paragraph position="1"> (1993). Still others have tended to use their own metrics. We chose an evaluation with a user-based component that allowed a ranked list of sense selection for each target word and enabled a comprehensive comparison between automatic and manual WSD results. In addition we wanted to base the disambiguation matrix on a corpus that we use for retrieval. This approach allows for a much richer evaluation than a simple hit-ormiss test. For vahdation purpose, we will conduct a fully automatic evaluation against SemCor in our future efforts.</Paragraph>
    <Paragraph position="2"> We use in vitro evaluation in this study, i.e.</Paragraph>
    <Paragraph position="3"> the WSD algorithm is tested independent of the retrieval system. The population consists of all the nouns in WordNet, after removal of monoseanous nouns, and after removal of a problematic class of polysemous nouns. 4 We drew a random sample of 87 polysemous nouns 5 from this population.</Paragraph>
    <Paragraph position="4"> In preparation, for each noun in our sample we identified all the documents containing that noun from the Associated Press (AP) newspaper corpus. The testing document set was then formed by randomly selecting 10 documents from the set of identified documents for each of the 87 nouns. In total, there are 867 documents in the 3 SemCor is a semantically sense-tagged corpus comprising approximately 250, 000 words. The reported error rate is around 10% for polysemous words.</Paragraph>
    <Paragraph position="5"> 4 This class of nouns refers to nouns that are in synsets in which they are the sole word, or in synsets whose words were subsets of other synsets for that noun. This situation makes disambiguation extremely problematic. This class of noun will be dealt with in a future version of our algorithm but for now it is beyond the scope of this evaluation.</Paragraph>
    <Paragraph position="6"> 5 A polysemous noun is defined as a noun that belongs to two or more synsets.</Paragraph>
    <Paragraph position="7"> testing set. The training document set consists of all the documents in the AP corpus excluding the above-mentioned 867 documents. For each noun in our sample, we selected all its corresponding WordNet noun synsets and randomly selected 10 sentence occurrences with each from one of the 10 random documents.</Paragraph>
    <Paragraph position="8"> After collecting 87 polysemous nouns with 10 noun sentences each, we had 870 sentences for disambiguation. Four human judges were randomly assigned to two groups with two judges each, and each judge was asked to disambiguate 275 word occurrences out of which 160 were unique and 115 were shared with the other judge in the same group. For each word occurrence, the judge put the target word's possible senses in rank order according to their appropriateness given the context (ties are allowed).</Paragraph>
    <Paragraph position="9"> Our WSD algorithm was also fed with the identical set of 870 word occurrences in the sense prediction phase and produced a ranked hst of senses for each word occurrence.</Paragraph>
    <Paragraph position="10"> Since our study has a matched-group design in which the subjects (word occurrences) receive both the treatments and control, the measurement of variables is on an ordinal scale, and there is no apparently applicable parametric statistical procedure available, two nonparametric procedures -the Friedman two-way analysis of variance and the Spearman rank correlation coefficient -were originally chosen as candidates for the statistical analysis of our results.</Paragraph>
    <Paragraph position="11"> However, the number of ties in our results renders the Spearman coefficient unreliable. We have therefore concentrated on the Friedman analysis of our experimental results. We use the two-alternative test with o~=0.05.</Paragraph>
    <Paragraph position="12"> The first tests of interest were aimed at estabhshing inter-judge reliability across the 115 shared sentenees by each pair of judges. The null hypothesis can be generalized as &amp;quot;There is no difference in judgments on the same word occurrences between two judges in the same group&amp;quot;. Following general steps of conducting a Friedman test as described by Siegel (1956), we cast raw ranks in a two-way table having 2 conditions/columns (K = 2) with each of the human judges in the pair serving as one condition and 365 subjects/rows (N = 365) which are all the senses of the 115 word occurrences that were judged by both human judges. We then ranked  N K Xr 2 df Rejection region Reject H0? First pair of judges 365 2 .003 1 3.84 No Second pair of judges 380 2 2.5289 1 3.84 No  and sense pooling (ct=.05, 2-alt. TesO the scores in each row from 1 to K (in this case K is 2), summed the derived ranks in each column, and calculated X \[ which is .003. For ct=0.05, degrees of freedom df = 1 (df = K -1), the rejection region starts at 3.84. Since .003 is smaller than 3.84, the null hypothesis is not rejected. Similar steps were used for analyzing reliability between the second pair of judges. In both cases, we did not find significant difference between judges (see Figure 3).</Paragraph>
    <Paragraph position="13"> Our second area of interest was the comparison of automatic WSD, manual WSD, and &amp;quot;sense pooling&amp;quot;. Sense pooling equates to no disambiguation, where each sense of a word is considered equally likely (a tie). The null hypothesis (H0) is &amp;quot;There is no difference among manual WSD, automatic WSD, and sense pooling (all the conditions come from the same population)&amp;quot;. The steps for Friedman analysis were similar to what we did for the inter-judge reliability test while the conditions and subjects were changed in each test according to what we would like to compare. Test results are summarized in Figure 4. In the three-way comparison shown in the first row of the table, we rejected H0 so there was at least one condition that was from a different population. By further conducting tests which examined each two of the above three conditions at a time we found that it was sense pooling that came from a different population while manual and automatic WSD were not significantly different. We can therefore conclude that our WSD algorithm is better than no disambiguation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML