File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/j94-4003_evalu.xml
Size: 50,757 bytes
Last Modified: 2025-10-06 14:00:12
<?xml version="1.0" standalone="yes"?> <Paper uid="J94-4003"> <Title>Word Sense Disambiguation Using a Second Language Monolingual Corpus</Title> <Section position="8" start_page="577" end_page="580" type="evalu"> <SectionTitle> 5. Evaluation </SectionTitle> <Paragraph position="0"> Two measurements, applicability and precision, are used to evaluate the performance of the algorithm. The applicability (coverage) denotes the proportion of cases for which the model performed a selection, i.e., those cases for which the bound B~ passed the threshold. The precision denotes the proportion of cases for which the model performed a correct selection out of all the applicable cases.</Paragraph> <Paragraph position="1"> We compare the precision of our method, which we term TWS (for Target Word Selection), with that of the Word Frequencies procedure, which always selects the most frequent target word. In other words, the Word Frequencies method prefers the alternative that has the highest a priori probability of appearing in the target language corpus. This naive &quot;straw-man&quot; is less sophisticated than other methods suggested in the literature, but it is useful as a common benchmark since it can be easily implemented. The success rate of the Word Frequencies procedure can serve as a measure for the degree of lexical ambiguity in a given set of examples, and thus different methods can be partly compared by their degree of success relative to this procedure.</Paragraph> <Paragraph position="2"> Out of the 103 ambiguous Hebrew words, for 33 the bound B~ did not pass the threshold, achieving an applicability of 68%. The remaining 70 examples were distributed according to Table 2. Thus the precision of the statistical model was 91% (64/70), 1deg whereas relying just on Word Frequencies yields 63% (44/70), providing an improvement of 28%. The table demonstrates that our algorithm corrects 22 erroneous decisions of the Word Frequencies method, but makes only 2 errors that the Word Frequencies method translates correctly. This implies that with high confidence our method greatly improves the Word Frequencies method.</Paragraph> <Paragraph position="3"> The number of Hebrew examples is large enough to permit a meaningful analysis of the statistical significance of the results. By computing confidence intervals for the distribution of proportions, we claim that with 95% confidence our method succeeds in at least 86% of the applicable examples. This means that though the figure of 91% might be due to a lucky selection of the random examples, there is only a 5% chance that the real figure is less than 86% (for the given domain and corpus). The confidence interval was computed as follows: p~f~_Zl_c~f~) 64 f-~4 . 6 -70 1&quot;65V 7deg-7-07deg - 0&quot;86' where a = 0.05 and the variance is estimated by \]~(1 - f))/n.</Paragraph> <Paragraph position="4"> With the same confidence, our method improves the Word Frequencies method by at least 18% (relative to the actual improvement of 28% in the given test set). Let Pl be the proportion of cases for which our method succeeds and the Word Frequencies method fails (Pl = 22/70) and P2 be the proportion of cases for which the Word Frequencies method succeeds and ours fails (P2 = 2/70). The confidence interval is for the difference of proportions in multinomial distribution and is computed as follows:</Paragraph> <Paragraph position="6"> Out of the 54 ambiguous German words, for 27 the bound B~ did not pass the threshold (applicability of 50%). The remaining 27 examples were distributed according to Table 3. Thus, the precision of the statistical model was 78% (21/27), whereas 10 An a posteriori observation showed that in three of the six errors the selection of the model was actually acceptable, and the a priori judgment of the human translator was too restrictive. For example, in one of these cases the statistics selected the expression 'to begin the talks,' whereas the human translator regarded this expression as incorrect and selected 'to start the talks.' If we consider these cases as correct, then there are only three selection errors, getting 96% precision.</Paragraph> <Paragraph position="7"> errors of the Word Frequencies method, without causing any new errors. We attribute the lower success rate for the German examples to the fact that they were not restricted to topics that are well represented in the corpus. This poor correspondence between the training and testing texts is reflected also by the low precision of the Word Frequencies method. This means that the a priori probability of the target words, as estimated from the training corpora, provides a very poor prediction of the correct selection in the test examples. Relative to the a priori probability, the precision of our method is still 22% higher.</Paragraph> <Section position="1" start_page="579" end_page="580" type="sub_section"> <SectionTitle> 5.1 Additional Results </SectionTitle> <Paragraph position="0"> Recently, Dagan, Marcus, and Markovitch have implemented a variant of the disambiguation method of the current paper. This variant was developed for evaluating a method that estimates the probability of word combinations which do not occur in the training corpus (Dagan, Marcus, and Markovitch 1993). In this section we quote their results, providing additional evidence for the effectiveness of the TWS method.</Paragraph> <Paragraph position="1"> The major difference between the TWS method, as presented in this paper, and the variant described by Dagan, Marcus, and Markovitch (1993), which we term TWS ~, is that the latter does not use any parsing for collecting the statistics from the corpus.</Paragraph> <Paragraph position="2"> Instead, the counts of syntactic tuples are approximated by counting co-occurrences of the given words of the tuple within a short distance in a sentence. The approximation takes into account the relative order between the words of the tuple, such that occurrences of a certain syntactic relation are approximated only by word co-occurrences that preserve the most frequent word order for that relation (e.g., an adjective precedes the noun it modifies).</Paragraph> <Paragraph position="3"> The TWS ~ method still assumes that the source sentence to be translated is being parsed, in order to identify the words that are syntactically related to an ambiguous word. This model is therefore relevant for translation systems that use a parser for the source language, but may not have available a robust target language parser.</Paragraph> <Paragraph position="4"> The corpus used for evaluating the TWS' method consists of articles posted to the USENET news system. The articles were collected from news groups that discuss computer-related topics. The length of the corpus is 8,871,125 words (tokens), and the lexicon size (distinct types, at the string level) is 95,559. The type of text in this corpus is quite noisy, including short and incomplete sentences as well as much irrelevant information, such as person and device names. The test set used for the experiment consists of 78 Hebrew sentences that were taken out of a book about computers. These sentences were processed as described in Section 4, obtaining a set of 269 ambiguous Hebrew words. The average number of alternative translations per ambiguous word in this set is 5.8, and there are 1.35 correct translations.</Paragraph> <Paragraph position="5"> Out of the 269 ambiguous Hebrew words, for 96 the bound B~ did not pass the threshold, achieving an applicability of 64.3%. The remaining 173 examples were distributed according to Table 4. For the words that are covered by the TWS' method, the Word Frequencies method has a precision of 71.1% (123/173), whereas the TWS' method has a precision of 85.5%(148/173). As can be seen in the table, the TWS' method is correct in almost all the cases it disagrees with the Word Frequencies method (28 out of 31). The applicability and precision figures in this experiment are somewhat lower than those achieved for the Hebrew set in our original evaluation of the TWS method (Table 2). We attribute this to the fact that the original results were achieved using a parsed corpus, which was about 2.5 times larger and of much higher quality than the one used in the second experiment. Yet, the new results give additional support for the usefulness of the TWS method, even for noisy data provided by a low quality corpus, without any parsing or tagging, u</Paragraph> </Section> </Section> <Section position="9" start_page="580" end_page="591" type="evalu"> <SectionTitle> 6. Analysis and Possible Enhancements </SectionTitle> <Paragraph position="0"> In this section we give a detailed analysis of the selections performed by the algorithm and, in particular, analyze the cases when it failed. The analysis of these modes suggests possible improvements of the model and indicates its limitations. As described earlier, the algorithm's failure includes either the cases for which the method was not applicable (no selection), or the cases for which it made an incorrect selection. The following paragraphs list various reasons for both types. At the end of the section, we discuss the possibility of adapting our approach to monolingual applications.</Paragraph> <Section position="1" start_page="580" end_page="580" type="sub_section"> <SectionTitle> 6.1 Correct Selection </SectionTitle> <Paragraph position="0"> In the cases that were treated correctly by our method, such as the examples given in the previous sections, the statistics succeeded in capturing two major types of disambiguating data. In preferring 'sign-treaty' upon 'seal-treaty' (in Example 1), the statistics reflect the relevant semantic constraint. In preferring 'peace-treaty' upon 'peacecontract,' the statistics reflect the lexical usage of 'treaty' in English which differs from the usage of 'contract.'</Paragraph> </Section> <Section position="2" start_page="580" end_page="581" type="sub_section"> <SectionTitle> 6.2 Inapplicability </SectionTitle> <Paragraph position="0"> In one of our examples, for instance, none of the alternative relations, 'an investigator of corruption' (the correct one) or 'researcher of corruption' (the incorrect one), 11 It should be mentioned that the work of Dagan, Marcus, and Markovitch (1993) includes further results, evaluating an enhancement of the TWS method using their similarity-based estimation method.</Paragraph> <Paragraph position="1"> This enhancement is beyond the scope of the current paper and is referred to in the next section.</Paragraph> <Paragraph position="2"> Computational Linguistics Volume 20, Number 4 was observed in the parsed corpus. In this case it is possible to perform the correct selection if we used only statistics about the co-occurrence of 'corruption' with either 'investigator' or 'researcher' in the same local context, without requiring any syntactic relation. Statistics on co-occurrence of words in a local context were used recently for monolingual word sense disambiguation (Gale, Church, and Yarowsky 1992b, 1993; Sch6tze 1992, 1993) (see Section 7 for more details and Church and Hanks 1990; Smadja 1993, for other applications of these statistics). It is possible to apply these methods using statistics of the target language and thus incorporate them within the framework proposed here for target word selection. Finding an optimal way of combining the different methods is a subject for further research. Our intuition, though, as well as some of our initial data, suggests that statistics on word co-occurrence in the local context can substantially increase the applicability of the selection method.</Paragraph> <Paragraph position="3"> Another way to deal with the lack of statistical data for the specific words in question is to use statistics about similar words. This is the basis for Sadler's Analogical Semantics (Sadler 1989), which according to his report has not proved effective. His results may be improved if more sophisticated methods and larger corpora are used to establish similarity between words (such as in Hindle 1990). In particular, an enhancement of our disambiguation method, using similarity-based estimation (Dagan, Marcus, and Markovitch 1993), was evaluated recently. In this evaluation the applicability of the disambiguation method was increased by 15%, with only a slight decrease in the precision. The increased applicability was achieved by disambiguating additional cases in which statistical data were not available for any of the alternative tuples, whereas data were available for other tuples containing similar words.</Paragraph> <Paragraph position="4"> by the statistical data, thus preventing a selection. In such cases, both alternatives are valid at the independent level of the syntactic relation, but may be inappropriate for the specific context. For instance, the two alternatives of 'to take a job' or 'to take a position' appeared in one of the examples, but since the general context was about the position of a prime minister, only the latter was appropriate. To resolve such ambiguities, it may be useful to consider also co-occurrences of the ambiguous word with other words in the broader context (e.g., Gale, Church, and Yarowsky 1993; Yarkowsky 1992). For instance, the word 'minister' seems to co-occur in the same context more frequently with 'position' than with 'job.' In another example both alternatives were appropriate also for the specific context.</Paragraph> <Paragraph position="5"> This happened with the German verb werfen, which may be translated (among other options) as 'throw,' 'cast,' or 'score.' In our example, werfen, appeared in the context of 'to throw/cast light,' and these two correct alternatives had equal frequencies in the corpus ('score' was successfully eliminated): In such situations any selection between the alternatives will be appropriate, and therefore, any algorithm that handles conflicting data would work properly. However, it is difficult to decide automatically when both alternatives are acceptable and when only one of them is.</Paragraph> </Section> <Section position="3" start_page="581" end_page="583" type="sub_section"> <SectionTitle> 6.3 Incorrect Selection 6.3.1 Using an Inappropriate Relation. One of the examples contained the Hebrew </SectionTitle> <Paragraph position="0"> word matzav. This word has several translations, two of which are 'state' and 'position.' The phrase that contained this word was 'to put an end to the {statelposition } of war'.</Paragraph> <Paragraph position="1"> The ambiguous word is involved in two syntactic relations, being a complement of 'put' and also modified by 'war'. The corresponding frequencies were The bound of the odds ratio (B~) for the first relation was higher than for the second, and therefore, this relation determined the translation as 'position'. However, the correct translation should be 'state', as determined by the second relation.</Paragraph> <Paragraph position="2"> These data suggest that while ordering the relations (or using any other Weighting mechanism) it may be necessary to give different weights to the different types of syntactic relations. For instance, it seems reasonable that the object of a noun should receive greater weight in selecting the noun's sense than the verb for which this noun serves as a complement.</Paragraph> <Paragraph position="3"> Further examination of the example suggests another refinement of our method: it turns out that most of the 320 instances of the tuple (verb-comp: put position) include the preposition 'in,' as part of the common phrase 'put in a position.' Therefore, these instances should not be considered for the current example, which includes the preposition 'to.' However, the distinction between different prepositions was lost by our program, as a result of using equivalence classes of syntactic tuples (see Section 2.3). This suggests that we should not use an equivalence class when there is enough statistical data for specific tuples. 12 the noun sikkuy, which means 'prospect' or 'chance.' The word qatann has several translations, two of which are 'small' and 'young.' In this Hebrew word combination, the correct sense of qatann is necessarily 'small.' However, the relation that was observed in the corpus was 'young prospect,' relating to the human sense of 'prospect' that appeared in sports articles (a promising young person). This borrowed sense of 'prospect' is necessarily inappropriate, since in Hebrew it is represented by the equivalent of 'hope' (tiqwa) and not by sikkuy.</Paragraph> <Paragraph position="4"> The source of this problem is Assumption 3: a target tuple T might be a translation of several source tuples, and while gathering statistics for T, we cannot distinguish between the different sources, since we use only a target language corpus.</Paragraph> <Paragraph position="5"> A possible solution is to use an aligned bilingual corpus, as suggested by Sadler (1989), Brown et al. (1991), and Gale et al. (1992a). In such a corpus the occurrences of the relation 'young prospect' will be aligned to the corresponding occurrences of the Hebrew word tiqwa and will not be used when the Hebrew word sikkuy is involved.</Paragraph> <Paragraph position="6"> Yet, it should be brought to mind that an aligned corpus is the result of manual translation, which can be viewed as including a manual tagging of the ambiguous words with their equivalent senses in the target language. This resource is much more expensive and less available than an untagged monolingual corpus, and it seems to be necessary only for relatively rare situations. Therefore, considering the trade-off between applicability and precision, it seems better to rely on a significantly larger monolingual corpus than on a smaller bilingual corpus. An optimal method may exploit both types of corpora, in which the somewhat more accurate, but more expensive, data of a bilingual corpus are augmented by the data of a much larger monolingual corpus. 13 quantities of shallow information. Thus, they are doomed to fail when disambiguation can rely only on deep understanding of the text and no other surface cues are available. This happened in one of the Hebrew examples, in which the two alternatives were either 'emigration law' or 'immigration law' (the Hebrew word hagira is used for both subsenses). While the context indicated that the first alternative is correct (emigration from the Soviet Union), the statistics (which were extracted from texts related to North America) preferred the second alternative. To translate the above phrase, the program would need deep knowledge, to an extent that seems to far exceed the capabilities of current systems. Fortunately, our results suggest that such cases are quite rare.</Paragraph> </Section> <Section position="4" start_page="583" end_page="584" type="sub_section"> <SectionTitle> 6.4 Monolingual Applications </SectionTitle> <Paragraph position="0"> The results of our experiments in the context of machine translation suggest the utility of a similar mechanism even for in word sense disambiguation within a single language. To select the right sense of a word, in a broad coverage application, it is useful to identify lexical relations between word senses. However, within corpora of a single language it is possible to identify automatically only relations at the word level, which are, of course, not useful for selecting word senses in that language. This is where other languages can supply the solution, exploiting the fact that the mapping between words and word senses varies significantly between different languages. For instance, the English words 'sign' and 'seal' (from Example 1 in the introduction) correspond to two distinct senses of the Hebrew word lahtom. These senses should be distinguished by most applications of Hebrew understanding programs. To make this distinction, it is possible to perform the same process that is performed for target word selection, by producing all the English alternatives for the lexical relations involving lahtom. Then the Hebrew sense that corresponds to the most plausible English lexical relations is preferred. This process requires a bilingual lexicon that maps each Hebrew sense separately into its possible translations, similar to a Hebrew-Hebrew-English lexicon (analogous to the Oxford English-English-Hebrew dictionary of Hornby et al.</Paragraph> <Paragraph position="1"> \[1986\], which lists the senses of an English word, along with the possible Hebrew translations for each of them).</Paragraph> <Paragraph position="2"> In some cases, different senses of a Hebrew word map to the same word also in English. In these cases, the lexical relations of each sense cannot be identified in an English corpus, and a third language is required to distinguish among these senses.</Paragraph> <Paragraph position="3"> Alternatively, it is possible to combine our method with other disambiguation methods that have been developed in a monolingual context (see the next section). As a long-term vision, one can imagine a multilingual corpora-based environment, which exploits the differences between languages to facilitate the acquisition of knowledge about word senses.</Paragraph> <Paragraph position="4"> 7. Comparative Analysis of Statistical Sense Disambiguation Methods Until recently, word sense disambiguation seemed to be a problem for which there is no satisfactory solution for broad coverage applications. Recently, several statistical methods have been developed for solving this problem, suggesting the possibility of robust, yet feasible, disambiguation. In this section we identify and analyze basic aspects of a statistical sense disambiguation method and compare several proposed corpus of moderate size can be valuable when constructing a bilingual lexicon, thus justifying the effort of maintaining such a corpus.</Paragraph> <Paragraph position="5"> Ido Dagan and Alon Itai Word Sense Disambiguation methods (including ours) along these aspects. TM This analysis may be useful for future research on sense disambiguation, as well as for the development of sense disambiguation modules in practical systems. The basic aspects that will be reviewed are Information sources used by the disambiguation method.</Paragraph> <Paragraph position="6"> Acquisition of the required information from training texts.</Paragraph> <Paragraph position="7"> The computational decision model.</Paragraph> <Paragraph position="8"> Performance evaluation.</Paragraph> <Paragraph position="9"> The first three aspects deal with the components of a disambiguation method, as would be implemented for a practical application. The fourth is a methodological issue, which is relevant for developing, testing, and comparing disambiguation methods.</Paragraph> </Section> <Section position="5" start_page="584" end_page="586" type="sub_section"> <SectionTitle> 7.1 Information Sources </SectionTitle> <Paragraph position="0"> We identify three major types of information that were used in statistical methods for sense disambiguation:</Paragraph> <Paragraph position="2"> Words appearing in the local, syntactically related, context of the ambiguous word.</Paragraph> <Paragraph position="3"> Words appearing in the global context of the ambiguous word. Probabilistic syntactic and morphological characteristics of the ambiguous word.</Paragraph> <Paragraph position="4"> The first type of information is the one used in the current paper, in which words that are syntactically related to an ambiguous word are used to indicate its most probable sense. Statistical data on the co-occurrence of syntactically related words with each of the alternative senses reflect semantic and lexical preferences and constraints of these senses. In addition, these statistics may provide information about the topics of discourse that are typical for each sense.</Paragraph> <Paragraph position="5"> Ideally, the syntactic relations between words should be identified using a syntactic parser, in both the training and the disambiguation phases. Since robust syntactic parsers are not widely available, and those that exist are not always accurate, it is possible to use various approximations to identify relevant syntactic relations between words. Hearst (1991) uses a stochastic part of speech tagger and a simple scheme for partial parsing of short phrases. The structures achieved by this analysis are used to identify approximated syntactic relations between words. Brown et al. (1991) make even weaker approximations, using only a stochastic part of speech tagger, and defining relations such as &quot;the first verb to the right&quot; or &quot;the first noun to the left.&quot; Finally, Dagan et al. (1993) (see Section 5.1) assume full parsing at the disambiguation phase, but no preprocessing at the training phase, in which a higher level of noise can be accommodated.</Paragraph> <Paragraph position="6"> A second type of information is provided by words that occur in the global context of the ambiguous word (Gale, Church, and Yarowsky 1992b, 1993; Yarowsky 1992; Sch6tze 1992). Gale et al. and Yarowsky use words that appear within 50 words in each 14 The reader is referred to some of these recent papers for thorough surveys of work on sense disambiguation (Hearst 1991; Gale, Church, and Yarowsky 1992a; Yarowsky 1992).</Paragraph> <Paragraph position="7"> Computational Linguistics Volume 20, Number 4 direction of the ambiguous word. is Statistical data are stored about the occurrence of words in the context of each sense and are matched against the context in the disambiguated sentence. Co-occurrence in the global context provides information about typical topics associated with each sense, in which a topic is represented by words that commonly occur in it. Schiitze (1992, 1993) uses a variant of this type of information, in which contextvectors are maintained for character four-grams, instead of words. In addition, the context of an occurrence of an ambiguous word is represented by co-occurrence information of a second order, as a set of context vectors (instead of a set of context words).</Paragraph> <Paragraph position="8"> Compared with co-occurrence within syntactic relations, information about the global context is less sensitive to fine semantic and lexical distinctions and is less useful when different senses of a word appear in similar contexts. On the other hand, the global context contains more words and is therefore more likely to provide enough disambiguating information, in cases in which this distinction can be based ~on the topic of discourse. From a general perspective, these two types of information represent a common trade-off in statistical language processing: the first type is related to a limited amount of deeper, and more precise linguistic information, whereas the second type provides a large amount of shallow information, which can be applied in a more robust manner. The two sources of information seem to complement each other and may both be combined in future disambiguation methods. 16 Hearst (1991) incorporates a third type of statistical information to distinguish between different senses of nouns (in addition to the first type discussed above).</Paragraph> <Paragraph position="9"> For each occurrence of a sense, several syntactic and morphological characteristics are recorded, such as whether the noun modifies or is modified by another word, whether it is capitalized, and whether it is related to certain prepositional phrases. Then, in the disambiguation phase, a best match is sought between the information recorded for each sense and the syntactic context of the current occurrence of the noun. This type of information resembles the information that is defined for lexical items in lexicalist approaches for grammars, such as possible subcategorization frames of a word. The major difference is that Hearst captures probabilistic preferences of senses for such syntactic constructs. Grammatical formalisms, on the other hand, usually specify only which constructs are possible and at most distinguish between optional and obligatory ones. Therefore the information recorded in such grammars cannot distinguish between different senses of a word that potentially have the same subcategorization frames, though in practice each sense might have different probabilistic preferences for different syntactic constructs.</Paragraph> <Paragraph position="10"> It is clear that each of the different types of information provides some information that is not captured by the others. However, as the acquisition and manipulation of each type of information requires different tools and resources, it is important to assess the relative contribution, and the &quot;cost effectiveness,&quot; of each of them. Such comparative evaluations are not available yet, not even for systems that incorporate several types of data (e.g., McRoy 1992). Further research is therefore needed to com15 The size of the context was determined experimentally, based on evaluations of different sizes of context. This optimization was performed for the Hansard corpus of the proceedings of the Canadian Parliament. In general, the size of the global context depends on the corpus and typically consists of a homogeneous unit of discourse. 16 See also Gale, Church, and Yarowsky 1992b (pp. 58-59), and Sch~itze, 1992, 1993, for methods of reducing the number of parameters when using global contexts and Dagan, Marcus, and Markovitch 1993, for increasing the applicability of the use of local context, in cases in which there is no direct statistical evidence. Ido Dagan and Alon Itai Word Sense Disambiguation pare the relative importance of different information types and to find optimal ways of combining them.</Paragraph> </Section> <Section position="6" start_page="586" end_page="588" type="sub_section"> <SectionTitle> 7.2 Acquisition of Training Information </SectionTitle> <Paragraph position="0"> When training a statistical model for sense disambiguation, it is necessary to associate the acquired statistics with word senses. This seems to require manual tagging of the training corpus with the appropriate sense for each occurrence of an ambiguous word.</Paragraph> <Paragraph position="1"> A similar approach is being used for stochastic part of speech taggers and probabilistic parsers, relying on the availability of large, manually tagged (or parsed), corpora for training. However, this approach is less feasible for sense disambiguation, for two reasons. First, the size of corpora required to acquire sufficient statistics on lexical co-occurrence is usually much larger than that used for acquiring statistics on syntactic constructs or sequences of parts of speech. Second, lexical co-occurrence patterns, as well as the definition of senses, may vary a great deal across different domains of discourse. Consequently, it is usually not sufficient to acquire the statistics from one widely available &quot;balanced&quot; corpus, as is common for syntactic applications. A sense disambiguation model should be trained on the same type of texts for which it will be applied, thus increasing the cost of manual tagging.</Paragraph> <Paragraph position="2"> The need to disambiguate a training corpus before acquiring a statistical model for disambiguation is often termed as the circularity problem. In the following paragraphs we discuss different methods that were proposed to overcome the circularity problem, without exhaustive manual tagging of the training corpus. In our opinion, this is the most critical issue in developing feasible sense disambiguation methods.</Paragraph> <Paragraph position="3"> 7.2.1 Bootstrapping. Bootstrapping, which is a general scheme for reducing the amount of manual tagging, was proposed also for sense disambiguation (Hearst 1991). The idea is to tag manually an initial set of occurrences for each sense in the lexicon, acquiring initial training statistics from these instances. Then, using these statistics, the system tries to disambiguate additional occurrences of ambiguous words. If such an occurrence can be disambiguated automatically with high confidence, the system acquires additional statistics from this occurrence, as if it were tagged by hand. Hopefully, the system will incrementally acquire all the relevant statistics, demanding just a small amount of manual tagging. The results of Hearst (1991) show that at least 10 occurrences of each sense have to be tagged by hand, and in most cases 20-30 occurrences are required to get high precision. These results, which were achieved for a small set of preselected ambiguous words, suggest that the cost of the bootstrapping approach is still very high.</Paragraph> <Paragraph position="4"> method that can be viewed as an efficient way of manual tagging. Instead of presenting all occurrences of an ambiguous word to a human, these occurrences are first clustered using automatic clustering algorithms. 17 Then a human is asked to assign one of the senses of the word to each cluster, by observing several members of the cluster. Each sense is thus represented by one or more clusters. At the disambiguation phase, a new occurrence of an ambiguous word is matched against the contexts that were recorded for these clusters, selecting the sense of that cluster which provides the best match.</Paragraph> <Paragraph position="5"> It is interesting to note that the number of occurrences that had to be observed by a human in the experiments of Sch/itze is of the same order as in the bootstrapping 17 Each occurrence is represented as a context vector, and the vectors are then clustered, Computational Linguistics Volume 20, Number 4 approach: 10-20 members of a cluster were observed, with an average of 2.8 clusters per sense. As both approaches were tested only on a small number of preselected words, further evaluation is necessary to predict the actual cost of their application to broad domains. The methods described below, on the other hand, rely on resources that were already available on a large scale, and it is therefore possible to estimate the expected cost of their broad application.</Paragraph> <Paragraph position="6"> 7.2.3 Word Classification. Yarowsky (1992) proposes a method that completely avoids manual tagging of the training corpus. This is achieved by estimating parameters for classes of words rather than for individual word senses. In his work, Yarowsky considered the semantic categories defined in Roget's Thesaurus as classes. He then mapped (manually) each of the senses of an ambiguous word to one or several of the categories under which this word is listed in the thesaurus. The task of sense disambiguation thus becomes the task of selecting the appropriate category for each occurrence of an ambiguous word. 18 When estimating the parameters of a category/9 any occurrence of a word that belongs to that category is counted as an occurrence of the category. This means that each occurrence of an ambiguous word is counted as an occurrence of all the categories to which the word belongs and not just the category that corresponds to the specific occurrence. A substantial amount of noise is introduced by this training method, which is a consequence of the circularity problem. To avoid the noise, it would be necessary to tag each occurrence of an ambiguous word with the appropriate category. As explained by Yarowsky, however, this noise can usually be tolerated. The &quot;correct&quot; parameters of a certain class are acquired from all its occurrences, whereas the &quot;incorrect&quot; parameters are distributed through occurrences of many different classes and usually do not produce statistically significant patterns. To reduce the noise further, Yarowsky uses a system of weights that assigns lower weights to frequent words, since such words may introduce more noise. 2deg The word class method thus overcomes the circularity problem by mapping word senses to classes of words. However, because of this mapping, the method cannot distinguish between senses that belong to the same class, and it also introduces some level of noise.</Paragraph> <Paragraph position="7"> 7.2.4 A Bilingual Corpus. Brown et al. (1991) were concerned with sense disambiguation for machine translation. Having a large aligned bilingual corpus available, they noticed that the target word which corresponds to an occurrence of an ambiguous source word can serve as a tag of the appropriate sense. This kind of tagging provides sense distinctions when different senses of a source word translate to different target words. For the purpose of translation, these are exactly the cases for which sense distinction is required. Conceptually, the use of a bilingual corpus does not eliminate (or reduce) manual tagging of the training corpus. Such a corpus is a result of manual translation, and it is the translator who provides tagging of senses as a side effect of the translation process. Practically, whenever a bilingual corpus is available, it pro18 In some cases, the Roget index was found to be incomplete, and a missing category had to be added to the list of possibilities for a word.</Paragraph> <Paragraph position="8"> 19 Yarowsky uses statistics on occurrences of specific words in the global context of the category, but the method can be used to collect other types of statistics, such as the co-occurrence of the category with other categories.</Paragraph> <Paragraph position="9"> 20 The method of acquiring parameters from ambiguous occurrences in a corpus, relying on the &quot;spreading&quot; of noise, can be used in many contexts. For example, it was used for acquiring statistics for disambiguating prepositional phrase attachments, counting ambiguous occurrences of prepositional phrases as representing both noun-pp and verb-pp constructs (Hindle and Rooth 1991).</Paragraph> <Paragraph position="10"> Ido Dagan and Alon Itai Word Sense Disambiguation vides a useful source of a sense tagged corpus. Gale, Church, and Yarowsky (1992a) have also exploited this resource for achieving large amounts of testing and training materials.</Paragraph> </Section> <Section position="7" start_page="588" end_page="588" type="sub_section"> <SectionTitle> 7.2.5 A Bilingual Lexicon and a Monolingual Corpus. The method of the current pa- </SectionTitle> <Paragraph position="0"> per also exploits the fact that different senses of a word are usually mapped to different words in another language. However, our work shows that the differences between languages enable us to avoid any form of manual tagging of the corpus (including translation). This is achieved by a bilingual lexicon that maps a source language word to all its possible equivalents in the target language. This approach has practical advantages for the purpose of machine translation, in which a bilingual lexicon needs to be constructed in any case, and very large bilingual corpora are not usually available. From a theoretical point of view, the difference between the two methods can be made clear if we assume that the bilingual lexicon contains exactly all the different translations of a word which occur in a bilingual corpus. For a given set of senses that need to be disambiguated, our method requires a bilingual corpus of size k, in which each sense occurs at least once, in order to establish its mapping to a target word. In addition, a larger monolingual corpus, of size n, is required, to provide enough training examples of typical contexts for each sense. On the other hand, using a bilingual corpus for training the disambiguation model would require a bilingual corpus of size n, which is significantly larger than k. The savings in resources is achieved since the mapping between the languages is done at the level of single words. The larger amount of information about word combinations, on the other hand, is acquired from an untagged monolingual corpus, after the mapping has been performed. Our results show that the precision of the selection algorithm is high despite the additional noise which is introduced by mapping single words independently of their context. As mentioned in Section 6.3, an optimal method may combine the two methods.</Paragraph> <Paragraph position="1"> In some sense, the use of a bilingual lexicon resembles the use of a thesaurus in Yarowsky's approach. Both rely on a manually established mapping of senses to other concepts (classes of words or words in another language) and collect information about the target concepts from an untagged corpus. In both cases, ambiguous words in the corpus introduce some level of noise: counting an occurrence of a word as an occurrence of all the classes to which it belongs, or counting an occurrence of a target word as an occurrence of all the source words to which it may correspond (a smaller amount of noise is introduced in the latter case, as a mapping to target words is much more finely grained than a mapping to Roget's categories). Also, both methods can distinguish only between senses that are distinguished by the mappings they use: either senses that belong to different classes, or senses that correspond to different target words. An interesting difference, though, relates to the feasibility of implementing the two methods for a new domain of texts (in particular technical domains). The construction of a bilingual lexicon for a new domain is relatively straightforward and is often carried out for translation purposes. The construction of an appropriate classification for the words of a new domain is more complex, and furthermore, it is not clear whether it is possible in every domain to construct a classification that is sufficient for the purpose of sense disambiguation.</Paragraph> </Section> <Section position="8" start_page="588" end_page="590" type="sub_section"> <SectionTitle> 7.3 The Computational Decision Model </SectionTitle> <Paragraph position="0"> Sense disambiguation methods require a decision model that evaluates the relevant statistics. Sense disambiguation thus resembles many other decision tasks, and not surprisingly, several common decision algorithms were employed in different works.</Paragraph> <Paragraph position="1"> These include a Bayesian classifier (Gale, Church, and Yarowsky 1993) and a distance Computational Linguistics Volume 20, Number 4 metric between vectors (Schiitze 1993), both inspired from methods in information retrieval; the use of the flip-flop algorithm for ordering possible informants about the preferred sense, trying to maximize the mutual information between the informant and the ambiguous word (Brown et al. 1991); and the use of confidence intervals to establish the degree of confidence in a certain preference, combined with a constraint propagation algorithm (the current paper). At the present stage of research on sense disambiguation, it is difficult to judge whether a certain decision algorithm is significantly superior to others. 21 Yet, these decision models can be characterized by several criteria, which clarify the similarities and differences between them. As will be explained below, many of the differences are correlated with the different information sources employed by these models.</Paragraph> <Paragraph position="2"> * Combining several informants: The methods described by Brown et al.</Paragraph> <Paragraph position="3"> (1991) and in the current paper combine several informants (i.e., statistics about several context words) by choosing the informant that seems most indicative for the selection. The effect of other, less significant, informants is then discarded. The Bayesian classifier and the vector distance metric combine all informants simultaneously, in a multiplicative or additive manner, possibly assigning a certain weight to each informant.</Paragraph> <Paragraph position="4"> * Reducing the number of parameters: Since sense disambiguation relies on statistics about lexical co-occurrence, the number of relevant parameters is very high, especially when co-occurrence in the global context is considered. For this reason, Schiitze uses two compaction methods: First, 5000 &quot;informative&quot; four-grams are used instead of words. Second, the 5000 dimensions are decomposed to 97 dimensions, using singular value decomposition. This method reduces the number of parameters significantly, but has the disadvantage that it is impossible to trace the meaning of the entries in the resulting vectors or to associate them directly with the original co-occurrence statistics. Gale, Church, and Yarowsky (1992b, pp. 58-59) propose another approach and reduce the number of parameters by selecting the most informative context words for each sense. The selection of context words is based on a theoretically motivated criterion, borrowed from Mosteller and Wallace (1964, pp. 55-56). Finally, Yarowsky's method further reduces the number of parameters, as it records co-occurrences between individual words and word classes.</Paragraph> <Paragraph position="5"> * Statistical significance of the selection: In the current paper, we use confidence intervals to test whether the statistical preference for a certain sense is significant. In a simple multiplicative preference score, on the other hand, it is not possible to distinguish whether preferences rely on small or large counts. The method of Gale et al. remedies this problem indirectly (in most cases) by introducing a sophisticated interpolation between the actual counts of the co-occurrence parameters and the frequency counts of the individual words (see Gale, Church, and Yarowsky 1993, for details). In Schiitze's method it is not possible to trace the statistical significance of the parameters since they are the result of extensive processing and compaction of the original statistical data.</Paragraph> <Paragraph position="6"> 21 Once the important information sources for sense selection have been identified, it is possible that different decision algorithms would achieve comparable results.</Paragraph> <Paragraph position="7"> Ido Dagan and Alon Itai Word Sense Disambiguation Resolving all ambiguities simultaneously: In the current paper, the selection of a sense for one word affects the selection for another word through a constraint propagation algorithm. This property is absent in most other methods.</Paragraph> <Paragraph position="8"> The differences between various disambiguation methods correlate with the difference in information sources, in particular, whether they use local or global context. When local context is used, only few syntactically related informants may provide reliable information about the selection. It is therefore reasonable to base the selection on only one, the most informative informant, and it is also important to test the statistical significance of that informant. The problem of parameter explosion is less severe, and the number of parameters is comparable to that of a bi-gram language model (and even smaller). When using the global context, on the other hand, the number of potential parameters is significantly larger, but each of them is usually less informative. It is therefore important to take into account as many parameters as possible in each ambiguous case, but it is less important to test for detailed statistical significance, or to worry about the mutual effects of sense selections for adjacent words.</Paragraph> </Section> <Section position="9" start_page="590" end_page="591" type="sub_section"> <SectionTitle> 7.4 Performance Evaluation </SectionTitle> <Paragraph position="0"> In most of the above-mentioned papers, experimental results are reported for a small set of up to 12 preselected words, usually with two or three senses per word. In the current paper we have evaluated our method using a random set of example sentences, with no a priori selection of the words. This standard evaluation method, which is commonly used for other natural language processing tasks, provides a direct prediction for the expected success rate of the method when employed in a practical application.</Paragraph> <Paragraph position="1"> To compare results on different test data, it is useful to compare the precision of the disambiguation method with some a priori figure that reflects the degree of ambiguity in the text. Reporting the number of senses per example word corresponds to the expected success rate of random selection. A more informative figure is the success rate of a naive method that always selects the most frequent sense (the Word Frequencies method in our evaluations). The success rate of this naive method is higher than that of random selection and thus provides a tighter lower bound for the desired precision of a proposed disambiguation method.</Paragraph> <Paragraph position="2"> An important practical issue in evaluation is how to get the test examples, which should be tagged with the correct sense. In most papers (including ours) the tagging of the test data was done by hand, which limits the size of the testing set. Preparing one test set by hand may still be reasonable, though time consuming. However, it is useful to have more than one set, such that results will be reported on a new, unseen, set, while another set is used for developing and tuning the system. One useful source of tagged examples is an aligned bilingual corpus, which can be used for testing any sense disambiguation method, including methods that do not use bilingual material for training. Gale proposes to use &quot;pseudo-words&quot; as another practical source of testing examples (Gale, Church, and Yarowsky 1992b) (equivalently, Schfitze \[1992\] uses &quot;artificial ambiguous words&quot;). Pseudo-words are constructed artificially as a union of several different words (say, wl, w2, and w3 define three &quot;senses&quot; of the pseudo-word x). The disambiguation method is presented with texts in which all occurrences of wl, w2, and w3 are considered as occurrences of x and should then select the original word (sense) for each occurrence. Though testing with this method does not provide results for real ambiguities that occur in the text, it can be very useful while develop- null Computational Linguistics Volume 20, Number 4 ing and tuning the method (Gale shows high correlation between the performance of his method on real sense ambiguities and pseudo-words).</Paragraph> </Section> </Section> class="xml-element"></Paper>