XML Viewer - j94-4003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/94/j94-4003_concl.xml
Size: 9,560 bytes
Last Modified: 2025-10-06 13:57:21
<?xml version="1.0" standalone="yes"?>
<Paper uid="J94-4003">
  <Title>Word Sense Disambiguation Using a Second Language Monolingual Corpus</Title>
  <Section position="10" start_page="591" end_page="594" type="concl">
    <SectionTitle>
8. Conclusions
</SectionTitle>
    <Paragraph position="0"> The method presented in this paper takes advantage of two linguistic phenomena, both proven to be very useful for sense disambiguation: the different mapping between words and word senses among different languages, and the importance of lexical co-occurrence within syntactic relations. The first phenomenon provides the solution for the circularity problem in acquiring sense disambiguation data. Using a bilingual lexicon and a monolingual corpus of the target language, we can acquire statistics on word senses automatically, without manual tagging. As explained in Section 7, this method has significant practical and theoretical advantages over the use of aligned bilingual corpora. We pay for these advantages by introducing an additional level of noise, in mapping individual words independently to the other language. Our results show, however, that the precision of the selection algorithm is high despite this additional noise.</Paragraph>
    <Paragraph position="1"> This work also emphasizes the importance of lexical co-occurrence within syntactic relations for the resolution of lexical ambiguity. Co-occurrences found in a large corpus reflect a huge amount of semantic knowledge, which was traditionally constructed by hand. Moreover, frequency data for such co-occurrences reflect both linguistic and domain-specific preferences, thus indicating not only what is possible, but also what is probable. It is important to notice that frequency information on lexical co-occurrence was found to be much more predictive than single word frequency. In the three experiments we reported, there were 61 cases in which the two types of information contradicted each other, favoring different target words. In 56 of these cases (92%), it was the most frequent lexical co-occurrence, and not the most frequent word, that predicted the correct translation. This result may raise relevant hypotheses for psycholinguistic research, which has indicated the relevance of word frequencies to human sense disambiguation (e.g., Simpson and Burgess 1988).</Paragraph>
    <Paragraph position="2"> We suggest that the high precision achieved in the experiments relies on two characteristics of the ambiguity phenomena, namely the sparseness and redundancy of the disambiguating data. By sparseness we mean that within the large space of alternative interpretations produced by ambiguous utterances, only a small portion is commonly used. Therefore, the chance that an inappropriate interpretation is observed in the corpus (in other contexts) is low. Redundancy relates to the fact that different informants (such as different lexical relations or deep understanding) tend to support rather than contradict one another, and therefore the chance of picking a &amp;quot;wrong&amp;quot; informant is low.</Paragraph>
    <Paragraph position="3"> It is interesting to compare our method with some aspects of the statistical machine translation system of Brown et al. (1990). As mentioned in the introduction, this system also incorporates target language statistics in the translation process. To translate a French sentence, f, they choose the English sentence, e, that maximizes the term Pr(e) * Pr(f I e). The first factor in this product, which represents the target language model, may thus affect any aspect of the translation, including target word selection.</Paragraph>
    <Paragraph position="4"> It seems, however, that Brown et al. expect that target word selection would be determined mainly by translation probabilities (the second factor in the above term), which should be derived from a bilingual corpus (Brown et al. 1990, p. 79). This view is reflected also in their elaborate method for target word selection (Brown et al.</Paragraph>
    <Paragraph position="5"> 1991), in which better estimates of translation probabilities are achieved as a result of word sense disambiguation. Our method, on the other hand, incorporates only  Ido Dagan and Alon Itai Word Sense Disambiguation target language probabilities and ignores any notion of translation probabilities. It thus demonstrates a possible trade-off between these two types of probabilities: using more informative statistics of the target language may compensate for the lack of translation probabilities. For our system, the more informative statistics are achieved by syntactic analysis of both the source and target languages, instead of the simple tri-gram model used by Brown et al. In a broader sense, this can be viewed as a trade-off between the different components of a translation system: having better analysis and generation models may reduce some burden from the transfer model.</Paragraph>
    <Paragraph position="6"> In our opinion, the method proposed in this paper may have immediate practical value, beyond its theoretical aspects. As we argue below, we believe that the method is feasible for practical machine translation systems and can provide a cost-effective improvement on target word selection methods. The identification of syntactic relations in the source sentence is available in any machine translation system that uses some form of syntactic parsing. Trivially, a bilingual lexicon is available. A parser for the target language becomes common in many systems that offer bidirectional translation capabilities, requiring parsers for several languages (see Miller 1993, for available language pairs in several commercial machine translation systems). If a parser for the target language corpus is not available, it is possible to approximate the statistics using word co-occurrence in a window, as was demonstrated by a variant of our method (Dagan, Marcus, and Markovitch 1993) (see Section 5.1). In both cases, the statistical model was shown to handle successfully the noise produced in automatic acquisition of the data. Substantial effort may be required for collecting a sufficiently large target language corpus. We have not studied the relation between the corpus size and the performance of the algorithm, but it is our impression that a corpus of several hundred thousand words will prove useful for translation in a well-defined domain.</Paragraph>
    <Paragraph position="7"> With current availability of texts in electronic form, = a corpus of this size is feasible in many domains. The effort of assembling this corpus should be compared with the effort of manually coding sense disambiguation information. Finally, our method was evaluated by simulating realistic machine translation lexicons, on randomly selected examples, and yielded high performance in two different broad domains (foreign news articles and a software manual). It is therefore expected that the results reported here will be reproduced in other domains and systems.</Paragraph>
    <Paragraph position="8"> To improve the performance of target word selection further, our method may be combined with other sense disambiguation methods. As discussed in Section 6.2, it is possible to increase the applicability (coverage) of the selection method by considering word co-occurrence in a limited context and/or by using similarity-based methods that reduce the problem of data sparseness. To a lesser extent, the use of a bilingual corpus may further increase the precision of the selection (see Section 6.3). A practical strategy may be to use a bilingual corpus for enriching the bilingual lexicon, while relying mainly on co-occurrence statistics from a larger monolingual corpus for disambiguation.</Paragraph>
    <Paragraph position="9"> In a broader context, this paper promotes the combination of statistical and linguistic models in natural language processing. It provides an example of how a problem can be first defined in detailed linguistic terms, using an implemented linguistic tool (a syntactic parser, in our case). Then, having a well-defined linguistic scenario, we apply a suitable statistical model to highly informative linguistic structures. According to this view, a complex task, such as machine translation, should be first decomposed 22 Optical character recognition can also be used to acquire relevant texts in electronic form. In this case, it may be necessary to approximate the statistics using word co-occurrence in a window, since parsing noisy output from optical character recognition is difficult.</Paragraph>
    <Paragraph position="10">  Computational Linguistics Volume 20, Number 4 on a linguistic basis. Then, appropriate statistical models can be developed for each sub-problem. We believe that this approach provides a beneficial compromise between two extremes in natural language processing: either using linguistic models that ignore quantitative information, or using statistical models that are linguistically ignorant.</Paragraph>
    <Paragraph position="11"> Appendix Approximatingvar\[ln(~)l To approximate var \[In (~)\], we first approximate In (~)by the first order derivatives (the first term of the Taylor series): (\]91) ~__ ln( pI )__ \[~Xl (X1/\]~22 In ~ ~ '}- (Pl -- ,1 ) In pl ,p2 q-(\]92--P2) \[~-~21n(X~22)\]pl,p2 : ln(P~2) q- fil--p~lpl \]92--P2p2 : ln(P~2)q-\]91 --\]92&amp;quot;pl P2 (5) We use the following equations (see Agresti 1990):</Paragraph>
    <Paragraph position="13"> Ido Dagan and Alon Itai Word Sense Disambiguation</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML