File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-0209_evalu.xml
Size: 7,843 bytes
Last Modified: 2025-10-06 14:00:27
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0209"> <Title>Selectional Preference and Sense Disambiguation</Title> <Section position="7" start_page="53" end_page="55" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> Task and materials. Test and training materials were derived from the Brown corpus of American English, all of which has been parsed and manually verified by the Penn T~eebank project (Marcus et al., 1993) and parts of which have been manually sense-tagged by the WordNet group (Miller et al., 1993). A parsed, sense-tagged corpus was obtained by mergingthe WordNet sense-tagged corpus (approximately 200,000 words of source text from the Brown corpus, distributed across genres) with the corresponding Penn Treebank parses, a The rest of the Brown corpus (approximately 800,000 words of source text) remained as a parsed, but not sensetagged, training set.</Paragraph> <Paragraph position="1"> intervention for only 3 of 103 files.</Paragraph> <Paragraph position="2"> The test set for the verb-object relationship was constructed by first training a selectional preference model on the training corpus, using the T~eebank's tgrep utility to extract verb-object pairs from parse trees. The 100 verbs that select most strongly for their objects were identified, excluding verbs appearing only once in the training corpus; test instances of the form (verb, object, correct sense) were then extracted from the merged test corpus, including all triples where verb was one of the 100 test verbs. 4 Evaluation materials were obtained in the same manner for several other surface syntactic reiationships, including verb-subject (John ~ admires), adjective-noun (tall =~ building), modifier-head (river =~ bank), and head-modifier (river ~= bank).</Paragraph> <Paragraph position="3"> Baseline. Following Miller et al. (1994), disambiguation by random choice was used as a baseline: if a noun has one sense, use it; otherwise select at random among its senses.</Paragraph> <Paragraph position="4"> Results. Since both the algorithm and the base-line may involve random choices, evaluation involved multiple runs with different random seeds. Table 2 summarizes the results, taken over I0 runs, considering only ambiguous test cases. All differences between the means for algorithm and baseline were statistically significant.</Paragraph> <Paragraph position="5"> Discussion. The results of the experiment show that disambignation using automatically acquired selectional constraints leads to performance significantly better than random choice. Not surprisingly, though, the results are far from what one might expect to obtain with supervised training. In that respect, the most direct point of comparison is the performance of Miller et al.'s (1994) frequency heuristic always choose the most frequent sense of a word as evaluated using the full sense-tagged corpus, including nouns, verbs, adjectives, and adverbs. For ambiguous words, they report 58.2% correct, as compared to a random baseline of 26.8%.</Paragraph> <Paragraph position="6"> Crucially, however, the frequency heuristic requires sense-tagged training data (Miller et al. evaluated via cross-validation), and this paper starts from the assumption that such data are unavailable. A fairer comparison, therefore, considers al- null ternative unsupervised algorithms- though unfortunately the literature contains more proposed algorithms than quantitative evaluations of those algorithms. One experiment where results were reported was conducted by Cowie et al. (1992); their method involved using a stochastic search procedure to maximize the overlap in dictionary definitions (LDOCE) for alternative senses of words co-occurring in a sentence. They report an accuracy of 72% for disambiguation to the homograph level, and 47% for disambiguation to the sense level. Since the task here involved WordNet sense distinctions, which are rather fine grained, the latter value is more appropriate for comparison. Their experiment was more general in that they did not restrict themselves to nouns; on the other hand, their test set involved disambiguating words taken from full sentences, so the percentage correct may have been improved by the presence of unambiguous words.</Paragraph> <Paragraph position="7"> Sussna (1993) has also looked at unsupervised disambiguation of nouns using WordNet. Like Cowie et al., his algorithm optimizes a measure of semantic coherence over an entire sentence, in this case pair-wise semantic distance between nouns in the sentence as measured using the noun taxonomy. Comparison of results is somewhat difficult, however, for two reasons. First, Sussna used an earlier version of WordNet (version 1.2) having a significantly smaller noun taxonomy (35K nodes vs. 49K nodes). Second, and more significant, in creating the test data, Sussna's human sense-taggers (tagging articles from the Time IR test collection) were permitted to tag a noun with as many senses as they felt were &quot;good,&quot; rather than making a forced choice; Sussna develops a scoring metric based on that fact rather than requiring exact matches to a single best sense. This is quite a reasonable move (see discussion below), but unfortunately not an option in the present experiment. Nonetheless, some comparison is possible, since he reports a &quot;% correct,&quot; apparently treating a sense assignment as correct if any of the &quot;good&quot; senses is chosen -- his experiments have a lower bound (chance) of about 40% correct, with his algorithm performing at 53-55%, considering only ambiguous cases.</Paragraph> <Paragraph position="8"> The best results reported for an unsupervised sense disambiguation method are those of Yarowsky (1992), who uses evidence from a wider context (a window of 100 surrounding words) to build up a co-occurrence model using classes from Roget's thesaurus. He reports accuracy figures in the 72-99% range (mean 92%) in disambiguating test instances involving twelve &quot;interesting&quot; polysemons words. As in the experiments by Cowie et al., the choice of coarser distinctions presumably accounts in part for the high accuracy. By way of comparison, some words in Yarowsky's test set would require choosing among ten senses in WordNet, as compared to a maximum of six using the Roget's thesaurus categories; the mean level of polysemy for the tested words is a six-way distinction in WordNet as compared to a three-way distinction in Roget's thesaurus. null As an aside, a rich taxonomy like WordNet permits a more continuous view of the sense vs. homograph distinction. For example, town has three senses in WordNet, corresponding to an administrative district, a geographical area, and a group of people. Given town as the object of leave, selectional preference will produce a tie between the first two senses, since both inherit their score from a common ancestor, (location). In effect, the automatic selection of a class higher in the taxonomy as having the highest score provides the same coarse category that might be provided by a homograph/sense distinction in another setting. The choice of coarser category varies dynamically with the context: as the argument in rural town, the same two senses still tie, but with (region) (a subclass of (location)) as the common ancestor that determines the score.</Paragraph> <Paragraph position="9"> In other work, Yarowsky (1993) has shown that local collocational information, including selectional constraints, can be used to great effect in sense disambiguation, though his algorithm requires super- null vised training. The present work can be viewed as an attempt to take advantage of the same kind of information, but in an unsupervised setting.</Paragraph> </Section> class="xml-element"></Paper>