File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/92/p92-1032_intro.xml
Size: 10,404 bytes
Last Modified: 2025-10-06 14:05:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P92-1032"> <Title>Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs</Title> <Section position="3" start_page="251" end_page="254" type="intro"> <SectionTitle> 5. Upper and Lower Bounds </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="251" end_page="253" type="sub_section"> <SectionTitle> 5.1 Lower Bounds </SectionTitle> <Paragraph position="0"> We could be in a better position to address the question of the relative difficulty of interest if we could establish a rough estimate of the upper and lower bounds on the level of performance that can be expected. We will estimate the lower bound by evaluating the performance of a straw man system, which ignores context and simply assigns the most likely sense in all cases. One might hope that reasonable systems should generally 7.</Paragraph> <Paragraph position="1"> In fact, Zemik's 70% figure is probably significantly inferior to the 72% reported by Black and Yarowsky, because Zernik reports precision and recall separately, whereas the others report a single figure of merit which combines both Type I (false rejection) and Type II (false acceptance) errors by reporting precision at 100% recall. Gale et al. show that error rates for 70% recall were half of those for 100% recall, on their test sample.</Paragraph> <Paragraph position="2"> &quot;Let me state rather dogmatically that there exists at this moment no method of reducing the polysemy of the, say, twenty words of an average Russian sentence in a scientific article below a remainder of, I would estimate, at least five or six words with multiple English renderings, which would not seriously endanger the quality of the machine output. Many tend to believe that by reducing the number of initially possible renderings of a twenty word Russian sentence from a few tens of thousands (which is the approximate number resulting from the assumption that each of the twenty Russian words has two renderings on the average, while seven or eight of them have only one rendering) to some eighty (which would be the number of renderings on the assumption that sixteen words are uniquely rendered and four have three renderings apiece, forgetting now about all the other aspects such as change of word order, etc.) the main bulk of this kind of work has been achieved, the remainder requiring only some slight additional effort.&quot; (Bar-Hillel, 1960, p. 163) outperform this baseline system, though not all such systems actually do. In fact, Yarowsky (1992) falls below the baseline for one of the twelve words (issue), although perhaps, we needn't be too concerned about this one deviation. 8 There are, of course, a number of problems with this estimate of the baseline. First, the baseline system is not operational, at least as we have defined it. Ideally, the baseline system ought to try to estimate the most likely sense for each word in the vocabulary and then assign that sense to each instance of the word in the test set.</Paragraph> <Paragraph position="3"> Unfortunately, since it isn't clear just how this estimation should be accomplished, we decided to &quot;cheat&quot; and let the baseline system peek at the test set and &quot;estimate&quot; the most likely sense for each word as the more frequent sense in the test set. Consequently, the performance of the baseline cannot fall below chance (100/k% for a particular word with k senses). 9 In addition, the baseline system assumes that Type I (false rejection) errors are just as bad as Type II (false acceptance) errors. If one desires extremely high recall and is willing to sacrifice precision in order to obtain this level of recall, then it might be sensible to tune a system to produce behavior which might appear to fall below the baseline. We have run into such situations when we have attempted to help lexicographers find extremely unusual events. In such a case, a lexicographer might be quite happy receiving a long list of potential candidates, only a small fraction of which are actually the case of interest. One can come up with quite a number of other scenarios where the baseline performance could be somewhat misleading, especially when there is an unusual trade-off between the cost of a Type I error and the cost of a Type II error.</Paragraph> <Paragraph position="4"> Nevertheless, the proposed baseline does seem to provide a usable rough estimate of the lower bound on performance. Table 2 shows the baseline performance for each of the twelve words in Table 1. Note that performance is generally above the baseline as we would 8. Many of the systems mentioned in Table 2 including Yarowsky (1992) do not currently take advantage of the prior probabilities of the senses, so they would be at a disadvantage relative to the baseline if one of the senses had a very high prior, as is the case for the test word issue.</Paragraph> <Paragraph position="5"> 9. In addition, the baseline doesn't deal as well as it could with skewed distributions. One could almost certainly improve the model of the baseline by making use of a notion like entropy that could deal more effectively with skewed distributions.</Paragraph> <Paragraph position="6"> Nevertheless, we will stick with our simpler notion of the baseline for expository convenience.</Paragraph> <Paragraph position="7"> hope.</Paragraph> <Paragraph position="8"> As mentioned previously, the test words in Tables 1 and 2 were selected from the literature on polysemy, and therefore, tend to focus on the more difficult cases. In another experiment, we selected a random sample of 97 words; 67 of them were unambiguous and therefore had a baseline performance of 100%) 0 The remaining thirty words are listed along with the number of senses and baseline performance: virus (2, 98%), device (3, 97%), direction (2, 96%), reader (2, 96%), core (3, 94%), hull (2, 94%), right (5, 94%), proposition (2, 89%), deposit (2, 88%), hour (4, 87%), path (2, 86%), view (3, 86%), pyramid (3, 82%), antenna (2, 81%), trough (3, 77%), tyranny (2, 75%), figure (6, 73%), institution (4, 71%), crown (4, 64%), drum (2, 63%), pipe (4, 60%), processing (2, 59%), coverage (2, 58%), execution (2, 57%), rain (2, 57%), interior (4, 56%), campaign (2, 51%), output (2, 51%), gin (3, 50%), drive (3, 49%). In studying these 97 words, we found that the average baseline performance is much higher than we might have guessed (93% averaged over tokens, 92% averaged over types). In particular, note that this baseline is well above the 75% figure that we associated with Bar-Hillel above.</Paragraph> <Paragraph position="9"> Of course, the large number of unambiguous words contributes greatly to the baseline. If we exclude the unambiguous words, then the average baseline 10. The 67 unambiguous words were: acid, annexation, benzene, berry, capacity, cereal clock, coke, colon, commander, consort, contract, cruise, cultivation, delegate, designation, dialogue, disaster, equation, esophagus, fact, fear;, fertility, flesh, fox, gold, interface, interruption, intrigue, journey, knife, label landscape, laurel Ib, liberty, lily, locomotion, lynx, marine, memorial menstruation, miracle, monasticism, mountain, nitrate, orthodoxy, pest, planning, possibility, pottery, projector, regiment, relaxation, reunification, shore, sodium, specialty, stretch, summer, testing, tungsten, universe, variant, vigor, wire, worship.</Paragraph> <Paragraph position="10"> performance falls to 81% averaged over tokens and 75% averaged over types.</Paragraph> </Section> <Section position="2" start_page="253" end_page="254" type="sub_section"> <SectionTitle> 5.2 Upper Bounds </SectionTitle> <Paragraph position="0"> We will attempt to estimate an upper bound on performance by estimating the ability for human judges to agree with one another (or themselves). We will find, not surprisingly, that the estimate varies widely depending on a number of factors, especially the definition of the task. Jorgensen (1990) has collected some interesting data that may be relevant for estimating the agreement among judges. As part of her dissertation under George Miller at Princeton, she was interested in assessing &quot;the extent of psychologically real polysemy in the mental lexicon for nouns.&quot; Her experiment was designed to study one of the more commonly employed methods in lexicography for writing dictionary definitions, namely the use of citation indexes. She was concerned that lexicographers and computational linguists have tended to depend too much on the intuitions of a single informant. Not surprisingly, she found considerable variation across judgements, just as she had suspected. This finding could have serious implications for evaluation. How do we measure performance if we can't depend on the judges? Jorgensen selected twelve high frequency nouns at random from the Brown Corpus, six were highly polysemous (head, life, world, way, side, hand) and six were less so (fact, group, night, development, something, war). Sentences containing each of these words were drawn from the Brown Corpus and typed on filing cards.</Paragraph> <Paragraph position="1"> Nine subjects where then asked to cluster a packet of these filing cards by sense. A week or two later, the same nine subjects were asked to repeat the experiment, but this time they were given access to the dictionary definitions.</Paragraph> <Paragraph position="2"> Jorgensen reported performance in terms of the &quot;Agreement-Disagreement&quot; (A-D) ratio (Shipstone, 1960) for each subject and each of the twelve test words. We have found it convenient to transform the A-D ratio into a quantity which we call the percent agreement, the number of observed agreements over the total number of possible agreements. The grand mean percent agreement over all subjects and words is only 68%. In other words, at least under these conditions, there is considerable variation across judgements, perhaps so much so that it would be hard to show that a proposed system was outperforming the baseline system (75%, averaged over ambiguous types). Moreover, if we accept Bar-Hillel's argument that 75% is not-goodenough, then it would be hard to show that a system was doing well-enough.</Paragraph> </Section> </Section> class="xml-element"></Paper>