File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/92/p92-1032_abstr.xml
Size: 11,208 bytes
Last Modified: 2025-10-06 13:47:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P92-1032"> <Title>Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs</Title> <Section position="1" start_page="0" end_page="250" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> We have recently reported on two new word-sense disambiguation systems, one trained on bilingual material (the Canadian Hansards) and the other trained on monolingual material (Roget's Thesaurus and Grolier's Encyclopedia). After using both the monolingual and bilingual classifiers for a few months, we have convinced ourselves that the performance is remarkably good. Nevertheless, we would really like to be able to make a stronger statement, and therefore, we decided to try to develop some more objective evaluation measures. Although there has been a fair amount of literature on sense-disambiguation, the literature does not offer much guidance in how we might establish the success or failure of a proposed solution such as the two systems mentioned in the previous paragraph. Many papers avoid quantitative evaluations altogether, because it is so difficult to come up with credible estimates of performance.</Paragraph> <Paragraph position="1"> This paper will attempt to establish upper and lower bounds on the level of performance that can be expected in an evaluation. An estimate of the lower bound of 75% (averaged over ambiguous types) is obtained by measuring the performance produced by a baseline system that ignores context and simply assigns the most likely sense in all cases. An estimate of the upper bound is obtained by assuming that our ability to measure performance is largely limited by our ability obtain reliable judgments from human informants. Not surprisingly, the upper bound is very dependent on the instructions given to the judges. Jorgensen, for example, suspected that lexicographers tend to depend too much on judgments by a single informant and found considerable variation over judgments (only 68% agreement), as she had suspected. In our own experiments, we have set out to find word-sense disambiguation tasks where the judges can agree often enough so that we could show that they were outperforming the baseline system. Under quite different conditions, we have found 96.8% agreement over judges.</Paragraph> <Paragraph position="2"> 1. Introduction: Using Massive Lexicographic Resources Word-sense disambiguation is a long-standing problem in computational linguistics (e.g., Kaplan (1950), Yngve (1955), Bar-I-Iillel (1960), Masterson (1967)), with important implications for a number of practical applications including text-to-speech (TI'S), machine translation (MT), information retrieval (IR), and many others. The recent interest in computational lexicography has fueled a large body of recent work on this 40-year-old problem, e.g., Black (1988), Brown et al. (1991), Choueka and Lusignan (1985), Clear (1989), Dagan et al. (1991), Gale et al. (to appear), Hearst (1991), Lesk (1986), Smadja and McKeown (1990), Walker (1987), Veronis and Ide (1990), Yarowsky (1992), Zemik (1990, 1991). Much of this work offers the prospect that a disambiguation system might be able to input unrestricted text and tag each word with the most likely sense with fairly reasonable accuracy and efficiency, just as part of speech taggers (e.g., Church (1988)) can now input unrestricted text and assign each word with the most likely part of speech with fairly reasonable accuracy and efficiency.</Paragraph> <Paragraph position="3"> The availability of massive lexicographic databases offers a promising route to overcoming the knowledge acquisition bottleneck. More than thirty years ago, Bar-I-Iillel (1960) predicted that it would be &quot;futile&quot; to write expert-system-like rules by-hand (as they had been doing at Georgetown at the time) because there would be no way to scale up such rules to cope with unrestricted input. Indeed, it is now well-known that expert-system-like rules can be notoriously difficult to scale up, as Small and Reiger (1982) and many others have observed: &quot;The expert for THROW is currently six pages long.., but it should be 10 times that size.&quot; Bar-Hillel was very early in realizing the scope of the problem; he observed that people have a large set of facts at their disposal, and it is not obvious how a computer could ever hope to gain access to this wealth of knowledge.</Paragraph> <Paragraph position="4"> &quot; 'But why not envisage a system which will put this knowledge at the disposal of the translation machine?' Understandable as this reaction is, it is very easy to show its futility. What such a suggestion amounts to, if taken seriously, is the requirement that a translation machine should not only be supplied with a dictionary but also with a universal encyclopedia. This is surely utterly chimerical and hardly deserves any further discussion. Since, however, the idea of a machine with encyclopedic knowledge has popped up also on other occasions, let me add a few words on this topic. The number of facts we human beings know is, in a ceaain very pregnant sense, infinite.&quot; (Bar-Hillel, 1960) Ironically, much of the research cited above is taking exactly the approach that Bar-Hillel ridiculed as utterly chimerical and hardly deserving of any further discussion. Back in 1960, it may have been hard to imagine how it would be possible to supply a machine with both a dictionary and an encyclopedia. But much of the recent work cited above goes much further; not only does it supply a machine with a dictionary and an encyclopedia, but many other extensive references works as well, including Roget's Thesaurus and numerous large corpora. Of course, we are using these reference works in a very superficial way; we are certainly not suggesting that the machine should attempt to solve the &quot;AI Complete&quot; problem of &quot;understanding&quot; these reference works.</Paragraph> <Paragraph position="5"> 2. A Brief Summary of Our Previous Work Our own work has made use of many of these lexical resources. In particular, (Gale et al., to appear) achies'ed considerable progress by using well-understood statistical methods and very large datasets of tens of millions of words of parallel English and French text (e.g., the Canadian Hansards). By aligning the text as we have, we were able to collect a large set of examples of polysemous words (e.g., sentence) in each sense (e.g., judicial sentence vs. syntactic sentence), by extracting instances from the corpus that were translated one way or the other (e.g, peine or phrase). These data sets were then analyzed using well-understood Bayesian discrimination methods, which have been used very successfully in many other applications, especially author identification (Mosteller and Wallace, 1964, section 3.1) and information retrieval (IR) (van Rijsbergen, 1979, chapter 6; Salton, 1989, section 10.3), though their application to word-sense disambiguation is novel.</Paragraph> <Paragraph position="6"> In author identification and information retrieval, it is customary to split the discrimination process up into a testing phase and a training phase. During the training phase, we are given two (or more) sets of documents and are asked to construct a discriminator which can distinguish between the two (or more) classes of documents. These discriminators are then applied to new documents during the testing phase. In the author identification task, for example, the training set consists of several documents written by each of the two (or more) authors. The resulting discriminator is then tested on documents whose authorship is disputed. In the information retrieval application, the training set consists of a set of one or more relevant documents and a set of zero or more irrelevant documents. The resulting discriminator is then applied to all documents in the library in order to separate the more relevant ones from the less relevant ones.</Paragraph> <Paragraph position="7"> There is an embarrassing wealth of information in the collection of documents that could be used as the basis for discrimination. It is common practice to treat documents as &quot;merely&quot; a bag of words, and to ignore much of the linguistic structure, especially dependencies on word order and correlations between pairs of words. In other words, one assumes that there are two (or more) sources of word probabilities, rel and irrel, in the IR application, and author t and author 2 in the author identification application. During the training phase, we attempt to estimate Pr(wlsource) for all words w in the vocabulary and all sources. Then during the testing phase, we score all documents as follows and select high scoring documents as being relatively likely to have been generated by the source of interest.</Paragraph> <Paragraph position="9"> In the sense disambiguation application, the 100-word context surrounding instances of a polysemous word (e.g., sentence) are treated very much like a document. 1 Pr( w l sense t ) w in el~Iontext Pr(wlsensez) sense Disambiguation That is, during the testing phase, we are given a new instance of a polysemous word, e.g., sentence, and asked to assign it to one or more senses. We score the words in the 100-word context using the formula given above, and assign the instance to sense t if the score is large. I. It is common to use very small contexts (e.g., 5-words) based on the observation that people seem to be able to disambiguate word-senses based on very little context. We have taken a different approach. Since we have been able to find useful information out to 100 words (and measurable information out to 10,000 words), we feel we might as well make use of the the larger contexts. This task is very difficult for the machine; it needs all the help it can get. The conditional probabilities, Pr(wlsense), are determined during the training phase by counting the number of times that each word in the vocabulary was found near each sense of the polysemous word (and then smoothing these estimates in order to deal with the sparse-data problems). See Gale et al. (to appear) for further details.</Paragraph> <Paragraph position="10"> At first, we thought that the method was completely dependent on the availability of parallel corpora for training. This has been a problem since parallel text remains somewhat difficult to obtain in large quantity, and what little is available is often fairly unbalanced and unrepresentative of general language. Moreover, the assumption that differences in translation correspond to differences in word-sense has always been somewhat suspect. Recently, Yarowsky (1992) has found a way to extend our use of the Bayesian techniques by training on the Roget's Thesaurus (Chapman, 1977) 2 and G-rolier's Encyclopedia (1991) instead of the Canadian Hansards, thus circumventing many of the objections to our use of the Hansards. Yarowsky (1992) inputs a 100-word context surrounding a polysemous word and scores each of the 1042 Roget Categories by:</Paragraph> </Section> class="xml-element"></Paper>