XML Viewer - j94-4003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/j94-4003_metho.xml
Size: 54,880 bytes
Last Modified: 2025-10-06 14:13:53
<?xml version="1.0" standalone="yes"?>
<Paper uid="J94-4003">
  <Title>Word Sense Disambiguation Using a Second Language Monolingual Corpus</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
AT&amp;T Bell Laboratories
</SectionTitle>
    <Paragraph position="0"> This paper presents a new approach for resolving lexical ambiguities in one language using statistical data from a monolingual corpus of another language. This approach exploits the differences between mappings of words to senses in different languages. The paper concentrates on the problem of target word selection in machine translation, for which the approach is directly applicable.</Paragraph>
    <Paragraph position="1"> The presented algorithm identifies syntactic relations between words, using a source language parser, and maps the alternative interpretations of these relations to the target language, using a bilingual lexicon. The preferred senses are then selected according to statistics on lexical relations in the target language. The selection is based on a statistical model and on a constraint propagation algorithm, which simultaneously handles all ambiguities in the sentence. The method was evaluated using three sets of Hebrew and German examples and was found to be very useful for disambiguation. The paper includes a detailed comparative analysis of statistical sense disambiguation methods.</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="564" type="metho">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> The resolution of lexical ambiguities in nonrestricted text is one of the most difficult tasks of natural language processing. A related task in machine translation, on which we focus in this paper, is target word selection. This is the task of deciding which target language word is the most appropriate equivalent of a source language word in context. In addition to the alternatives introduced by the different word senses of the source language word, the target language may specify additional alternatives that differ mainly in their usage.</Paragraph>
    <Paragraph position="1"> Traditionally, several linguistic levels were used to deal with this problem: syntactic, semantic, and pragmatic. Computationally, the syntactic methods are the most affordable, but are of no avail in the frequent situation when the different senses of the word show the same syntactic behavior, having the same part of speech and even the same subcategorization frame. Substantial application of semantic or pragmatic knowledge about the word and its context requires compiling huge amounts of knowledge, the usefulness of which for practical applications in broad domains has not yet been proven (e.g., Lenat et al. 1990; Nirenburg et al. 1988; Chodorow, Byrd, and Heidron 1985). Moreover, such methods usually do not reflect word usages.</Paragraph>
    <Paragraph position="2"> Statistical approaches, which were popular several decades ago, have recently reawakened and were found to be useful for computational linguistics. Within this framework, a possible (though partial) alternative to using manually constructed * AT&amp;T Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974, USA. E-mail: dagan@research.att.com. The work reported here was done while the author was at the Technion--Israel Institute of Technology.</Paragraph>
    <Paragraph position="3"> t Department of Computer Science, Technion--Israel Institute of Technology, Haifa 32000, Israel. E-maih itai@cs.technion.ac.il.</Paragraph>
    <Paragraph position="4"> (~) 1994 Association for Computational Linguistics Computational Linguistics Volume 20, Number 4 knowledge can be found in the use of statistical data on the occurrence of lexical relations in large corpora (e.g., Grishman, Hirschman, and Nhan 1986). The use of such relations (mainly relations between verbs or nouns and their arguments and modifiers) for various purposes has received growing attention in recent research (Church and Hanks 1990; Zernik and Jacobs 1990; Hindle 1990; Smadja 1993). More specifically, two recent works have suggested using statistical data on lexical relations for resolving ambiguity of prepositional phrase attachment (Hindle and Rooth 1991) and pronoun references (Dagan and Itai 1990, 1991).</Paragraph>
    <Paragraph position="5"> Clearly, statistics on lexical relations can also be useful for target word selection. Consider, for example, the Hebrew sentence extracted from the foreign news section of the daily Ha-Aretz, September 1990 (transcripted to Latin letters): (1) Nose ze mana&amp;quot; mi-shtei ha-mdinot mi-lahtom &amp;quot;al hoze shalom.</Paragraph>
    <Paragraph position="6"> issue this prevented from-two the-countries from-signing on treaty peace \[ This sentence would translate into English as (2) This issue prevented the two countries from signing a peace treaty.</Paragraph>
    <Paragraph position="7"> The verb lahtom has four senses: 'sign,' 'seal,' 'finish,' and 'close.' The noun hoze means both 'contract' and 'treaty,' where the difference is mainly in usage rather than in the meaning (in Hebrew the word h.oze is used for both sub-senses).</Paragraph>
    <Paragraph position="8"> One possible solution is to consult a Hebrew corpus tagged with word senses, from which we would probably learn that the sense 'sign' of lahtom appears more frequently with hoze as its object than all the other senses. Thus we should prefer that sense. However, the size of corpora required to identify lexical relations in a broad domain is very large, and therefore it is usually not feasible to have such corpora manually tagged with word senses) The problem of choosing between 'treaty' and 'contract' cannot be solved using only information on Hebrew, because Hebrew does not distinguish between them.</Paragraph>
    <Paragraph position="9"> The solution suggested in this paper is to identify the lexical relations in corpora of the target language, instead of the source language. We consider word combinations and count how often they appear in the same syntactic relation as in the ambiguous sentence. For the above example, the noun compound 'peace treaty' appeared 49 times in our corpus (see Section 4.3 for details on our corpus), whereas the compound 'peace contract' did not appear at all; the verb-obj combination 'to sign a treaty' appeared 79 times, whereas none of the other three alternatives appeared more than twice. Thus, we first prefer 'treaty' to 'contract' because of the noun compound 'peace treaty' and then proceed to prefer 'sign' since it appears most frequently having the object 'treaty.' The order of selection is determined by a constraint propagation algorithm. In both cases, the correctly selected word is not the most frequent one: 'close' is more frequent in our corpus than 'sign' and 'contract' is more frequent than 'treaty.' Also, by using a model of statistical confidence, the algorithm avoids a decision in cases in which no alternative is significantly better than the others.</Paragraph>
    <Paragraph position="10"> Our approach can be analyzed from two different points of view. From that of monolingual sense disambiguation, we exploit the fact that the mapping between words and word senses varies significantly among different languages. This enables 1 Hearst (1991) suggests a sense disambiguation scheme along this line. See Section 7 for a comparison of several sense disambiguation methods.</Paragraph>
    <Paragraph position="11">  Ido Dagan and Alon Itai Word Sense Disambiguation US to map an ambiguous construct from one language to another, obtaining representations in which each sense corresponds to a distinct word. Now it is possible to collect co-occurrence statistics automatically from a corpus of the other language, without requiring manual tagging of senses. 2 From the point of view of machine translation, we suggest that some ambiguity problems are easier to solve at the level of the target language than the source language. The source language sentences are considered a noisy source for target language sentences, and our task is to devise a target language model that prefers the most reasonable translation. Machine translation is thus viewed in part as a recognition problem, and the statistical model we use specifically for target word selection may be compared with other language models in recognition tasks (e.g., Katz 1987; Jelinek 1990, for speech recognition). To a limited extent, this view is shared with the statistical machine translation system of Brown et al. (1990), which employs a target language n-gram model (see Section 8 for a comparison with this system). In contrast to this view, previous approaches in machine translation typically resolve examples like (1) by stating various constraints in terms of the source language (Nirenburg 1987). As explained above, such constraints cannot be acquired automatically and therefore are usually limited in their coverage.</Paragraph>
    <Paragraph position="12"> The experiments we conducted clearly show that statistics on lexical relations are very useful for disambiguation. Most notable is the result for the set of examples of Hebrew to English translation, which was picked randomly from foreign news sections in the Israeli press. For this set, the statistical model was applicable for 70% of the ambiguous words, and its selection was then correct for 91% of the cases. We cite also the results of a later experiment (Dagan, Marcus, and Markovitch 1993) that tested a weaker variant of our method on texts in the computer domain, achieving a precision of 85%. Both results significantly improve upon a naive method that uses only a priori word probabilities. These results are comparable to recent reports in the literature (see Section 7). It should be emphasized, though, that our results were achieved for a realistic simulation of a broad coverage machine translation system, on randomly selected examples. We therefore believe that our figures reflect the expected performance of the algorithm in a practical implementation. On the other hand, most other results relate to a small number of words and senses that were determined by the experimenters.</Paragraph>
    <Paragraph position="13"> Section 2 of the paper describes the linguistic model we use, employing a syntactic parser and a bilingual lexicon. Section 3 presents the statistical model, assuming a multinomial model for a single lexical relation and then using a constraint propagation algorithm to account simultaneously for all relations in the sentence. Section 4 describes the experimental Setting. Section 5 presents and analyzes the results of the experiment and cites additional results (Dagan, Marcus, and Markovitch 1993). In Section 6 we analyze the limitations of the algorithm in different cases and suggest enhancements to improve it. We also discuss the possibility of adopting the algorithm for monolingual applications. Finally, in Section 7 we present a comparative analysis of statistical sense disambiguation methods and then conclude in Section 8.</Paragraph>
  </Section>
  <Section position="3" start_page="564" end_page="565" type="metho">
    <SectionTitle>
2 A similar observation underlies the use of parallel bilingual corpora for sense disambiguation (Brown
</SectionTitle>
    <Paragraph position="0"> et al. 1991; Gale, Church, and Yarowsky 1992). As we explain in Section 7, these corpora are a form of a manually tagged corpus and are more difficult to obtain than monolingual corpora. Erroneously, the preliminary publication of our method (Dagan, Itai, and Schwall 1991) was cited several times as requiring a parallel bilingual corpus,</Paragraph>
  </Section>
  <Section position="4" start_page="565" end_page="569" type="metho">
    <SectionTitle>
2. The Linguistic Model
</SectionTitle>
    <Paragraph position="0"> Our approach is first to use a bilingual lexicon to find all possible translations of each lexically ambiguous word in the source sentence and then use statistical information gathered from target language corpora to choose the most appropriate alternative. To carry out this task we need the following linguistic tools, which are discussed in detail in the following sections: Section 2.1: Parsers for both the source language and the target language.</Paragraph>
    <Paragraph position="1"> These parsers should be capable of locating relevant syntactic relations, such as subj-verb, verb-obj, etc.</Paragraph>
    <Paragraph position="2"> Section 2.2: A bilingual lexicon that lists alternative translations for each source language word. If a word belongs to several syntactic categories, there should be a separate list for each one.</Paragraph>
    <Paragraph position="3"> Section 2.3: A procedure for mapping the source language syntactic relations to those of the target language.</Paragraph>
    <Paragraph position="4"> Such tools have been implemented within the framework of many computational linguistic theories. We have used McCord's implementation of Slot Grammars (McCord 1990, 1991). However, our method could have proceeded just as well using other linguistic models.</Paragraph>
    <Paragraph position="5"> The linguistic model will be illustrated by the following Hebrew example, taken from the Ha-Aretz daily newspaper from September, 1990 (transcripted to Latin letters): (3) Diplomatim svurim ki hitztarrfuto shell Hon Sun magdila diplomats believe that the joining of Hon Sun increases et ha-sikkuyim l-hassagat hitqaddmut ba-sihot.</Paragraph>
    <Paragraph position="6"> the-chances for-achieving progress in the-talks Here, the ambiguous words in translation to English are magdila, hitqaddmut, and sihot. To facilitate the reading, we give the translation of the sentence into English, and in each case of an ambiguous selection, all the alternatives are listed within curly brackets, the first alternative being the correct one.</Paragraph>
    <Paragraph position="7"> (4) Diplomats believe that the joining of Hon Sun {increases I enlarges I magnifies} the chances for achieving {progress I advance I advancement} in the {talks I conversations I calls}.</Paragraph>
    <Paragraph position="8"> The following subsections describe in detail the processing steps of the linguistic model. These include locating the ambiguous words and the relevant syntactic relations among them in the source language sentence, mapping these relations to alternative relations in the target language, and finally, counting occurrences of these alternatives in a target language corpus.</Paragraph>
    <Section position="1" start_page="565" end_page="566" type="sub_section">
      <SectionTitle>
2.1 Locating the Ambiguous Words in the Source Language
</SectionTitle>
      <Paragraph position="0"> Our model defines the different &amp;quot;senses&amp;quot; of a source word to be all its possible translations to the target language, as listed in a bilingual lexicon. Some translations can be eliminated by the syntactic environment of the word in the source language. For example, in the following two sentences the word 'consider' should be translated  Ido Dagan and Alon Itai Word Sense Disambiguation differently into Hebrew, owing to the different subcategorization frame in each case: (5) I consider him smart.</Paragraph>
      <Paragraph position="1"> (6) I consider going to Japan.</Paragraph>
      <Paragraph position="2">  In these examples, the different syntactic subcategorization frames determine two different translations to Hebrew (mah.shiv versus shoqel), thus eliminating some of the ambiguity. Such syntactic rules that allow us to resolve some of the ambiguities may be encoded in the lexicon (e.g., Golan, Lappin, and Rimon 1988). However, many ambiguities cannot be resolved on syntactic grounds. The purpose of this work is to resolve the remaining ambiguities using lexical co-occurrence preferences, obtained by statistical methods.</Paragraph>
    </Section>
    <Section position="2" start_page="566" end_page="567" type="sub_section">
      <SectionTitle>
2.2 Locating Syntactic Tuples in Source Language Sentences
</SectionTitle>
      <Paragraph position="0"> Our basic concept is the syntactic tuple, which denotes a syntactic relation between two or more words. It is denoted by the name of the syntactic relation followed by a sequence of words that satisfies the relation, appearing in their base form (without morphological inflections). For example (subj-verb: man walk) is a syntactic tuple, which occurs in the sentence 'The man walked home.' We assume that our parser (or an auxilliary program) can locate the syntactic relation corresponding to a given syntactic tuple in a sentence. The use of the base form of words is justified by the additional assumption that morphological inflections do not affect the probability of syntactic tuples. This assumption is not entirely accurate, but it has proven practically useful and reduces the number of distinct tuples.</Paragraph>
      <Paragraph position="1"> In our experience, the following syntactic relations proved useful for resolving ambiguities: * Relations between a verb and its subject, complements, and adjuncts, including direct and indirect objects, adverbs, and modifying prepositional phrases.</Paragraph>
      <Paragraph position="2"> * Relations between a noun and its complements and adjuncts, including adjectives, modifying nouns in noun compounds, and modifying prepositional phrases.</Paragraph>
      <Paragraph position="3"> * Relations between adjectives or adverbs and their modifiers.</Paragraph>
      <Paragraph position="4">  As mentioned earlier, the full list of syntactic relations depends on the syntactic theory of the parser. Our model is general and does not depend on any particular list. However, we have found some desired properties in defining the relevant syntactic relations. One such property is the use of deep, or canonical, relations, as was already identified by Grishman, Hirschman, and Nhan (1986). This property was directly available from the ESG parser (McCord 1990, 1991), which identifies the underlying syntactic function in constructs such as passives and relative clauses. We have also implemented an additional routine, which modified or filtered some of the relations received from the parser. This postprocessing routine dealt mainly with function words and prepositional phrases to get a set of more informative relations. For example, it combined the subject and complement of the verb 'be' (as in 'the man is happy') into a single relation. Likewise, a verb with its preposition and the head noun of a modifying prepositional phrase (as in sit on the chair) were also combined. The routine was designed to choose relations that impose considerable restrictions on the possible  Computational Linguistics Volume 20, Number 4 (or probable) syntactic tuples. On the other hand, these relations should not be too specific, to allow statistically meaningful samples.</Paragraph>
      <Paragraph position="5"> The first step in resolving an ambiguity is to find all the syntactic tuples containing the ambiguous words. For (3) we get the following syntactic tuples:  (subj-verb: hitztarrfut higdil) (verb-obj: higdil sikkuy) (verb-obj: hissig hitqaddmut) (noun-pp: hitqaddmut b- sih.a) (these tuples translate as joining-increase, increase-chance, achieve-progress, and progressin-talks). In using these tuples, we expect to capture lexical constraints that are imposed by syntactic relations.</Paragraph>
    </Section>
    <Section position="3" start_page="567" end_page="568" type="sub_section">
      <SectionTitle>
2.3 Mapping Syntactic Tuples to the Target Language
</SectionTitle>
      <Paragraph position="0"> The set of syntactic tuples in the source language sentence is reflected in its translation to the target language. As a syntactic tuple is defined by both its syntactic relation and the words that appear in it, we need to map both components to the target language.</Paragraph>
      <Paragraph position="1"> By definition, every ambiguous source language word maps to several target language words. We thus get several alternative target language tuples for each source language tuple that involves an ambiguous word. For example, for tuple 3 in (7) we obtain three alternatives, corresponding to the three different translations of the word hitqaddmut. For tuple 4 we obtain nine alternative target tuples, since each of the words hitqaddmut and siha maps to three different English words. The full mapping of the Hebrew tuples in (7) to English tuples appears in Table 1 (the rightmost column should be ignored for the moment). Each of the tuple sets (a-d) in this table denotes the alternatives for translating the corresponding Hebrew tuple.</Paragraph>
      <Paragraph position="2"> From a theoretical point of view, the mapping of syntactic relations is more problematic. There need not be a one-to-one mapping from source language relations to target language ones. In many cases the mapping depends on the words of the syntactic tuple, as seen in the following example of translating from German to English.</Paragraph>
      <Paragraph position="3"> (8) Der Tisch gefaellt mir.--I like the table.</Paragraph>
      <Paragraph position="4"> In this example the source language subject (Tisch) becomes the direct object (table) in the target, whereas the direct object (mir) in the source language becomes the subject  (I) in the target. Therefore, the German syntactic tuples (9) (subj-verb: Tisch gefaellt) (verb-obj: gefaellt mir) are mapped to the following English syntactic tuples (10) (verb-obj: like table)  (subj-verb: I like) (The Hebrew equivalent is similar to the German structure).</Paragraph>
      <Paragraph position="5"> In practice this is less of a problem. In most cases, the source language relation has a direct equivalent in the target language. In many other cases, transformation rules can be encoded, either in the lexicon (if they are word dependent) or as syntactic transformations. These rules are usually available in machine translation systems that  Ido Dagan and Alon Itai Word Sense Disambiguation Table 1 The alternative target syntactic tuples with their counts in the target language corpus Source Tuples Target Tuples Counts a. (subj-verb: hitztarrfut higdil) (subj-verb: joining increase) 0  (subj-verb: joining enlarge) 0 (subj-verb: joining magnify) 0 b.</Paragraph>
      <Paragraph position="6"> C.</Paragraph>
      <Paragraph position="7"> d.</Paragraph>
      <Paragraph position="8"> (verb-obj: higdil sikkuy) (verb-obj: hissig hitqaddmut) (noun-pp: hitqaddmut b- sih.a) (verb-obj: increase chance) 20 (verb-obj: enlarge chance) 0 (verb-obj: magnify chance) 0  (verb-obj: achieve progress) 29 (verb-obj: achieve advance) 5 (verb-obj: achieve advancement) 1 (noun-pp: progress in talk) 7 (noun-pp: progress in conversation) 0 (noun-pp: progress in call) 0 (noun-pp: advance in talk) 2 (noun-pp: advance in conversation) 0 (noun-pp: advance in call) 2 (noun-pp: advancement in talk) 0 (noun-pp: advancement in conversation) 0 (noun-pp: advancement in call) 0 use the transfer method, as this knowledge is required to generate target language structures.</Paragraph>
      <Paragraph position="9"> To facilitate further the mapping of syntactic relations and to avoid errors due to fine distinctions between them, we grouped related syntactic relations into a single &amp;quot;general class&amp;quot; and mapped this class to the target language. The important classes used were relations between a verb and its arguments and modifiers (counting as one class all objects, indirect objects, complements, and nouns in modifying prepositional phrases) and between a noun and its arguments and modifiers (counting as one class all modifying nouns in compounds and nouns in modifying prepositional phrases). The classification enables us to get more statistical data for each class, as it reduces the number of relations. The success of using this general level of syntactic relations indicates that even a rough mapping of source to target language relations is useful for the statistical model.</Paragraph>
    </Section>
    <Section position="4" start_page="568" end_page="569" type="sub_section">
      <SectionTitle>
2.4 Counting Syntactic Tuples in the Target Language Corpus
</SectionTitle>
      <Paragraph position="0"> We now wish to determine the plausibility of each alternative target word being the translation of an ambiguous source word. In our model, the plausibility of selecting a target word is determined by the plausibility of the tuples that are obtained from it. The plausibility of alternative target tuples is in turn determined by their relative frequency in the corpus.</Paragraph>
      <Paragraph position="1"> Target syntactic tuples are identified in the corpus similarly to source language tuples, i.e., by a target language parser and a companion routine as described in Section 2.1. The right column of Table 1 shows the counts obtained for the syntactic tuples of our example in the corpora we used. The table reveals that the tuples containing the correct target word ('talk,' 'progress,' and 'increase') are indeed more frequent.</Paragraph>
      <Paragraph position="2">  Computational Linguistics Volume 20, Number 4 However, we still need a decision algorithm to analyze the statistical significance of the data and choose the appropriate word accordingly.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="569" end_page="569" type="metho">
    <SectionTitle>
3. The Statistical Model
</SectionTitle>
    <Paragraph position="0"> As seen in the previous section, the linguistic model maps each source language syntactic tuple to several alternative target tuples, in which each alternative corresponds to a different selection of target words. We wish to select the most plausible target language word for each ambiguous source language word, basing our decision on the counts obtained from the target corpus, as illustrated in Table 1. To that end, we should define a selection algorithm whose outcome depends on all the syntactic tuples in the sentence. If the data obtained from the corpus do not substantially support any one of the alternatives, the algorithm should notify the translation system that it cannot reach a statistically meaningful decision.</Paragraph>
    <Paragraph position="1"> Our algorithm is based on a statistical model. However, we wish to point out that we do not see the statistical considerations, as expressed in the model, as fully reflecting the linguistic considerations (syntactic, semantic, or pragmatic) that determine the correct translation. The model reflects only part of the relevant data and in addition makes statistical assumptions that are only partially satisfied. Therefore, a statistically based model need not make the correct linguistic choices. The performance of the model can only be empirically evaluated, the statistical considerations serve only as heuristics. The role of the statistical considerations is therefore to guide us in constructing heuristics that make use of the linguistic data of the sample (the corpus). Our experience shows that the statistical methods are indeed very helpful in establishing and comparing useful decision criteria that reflect various linguistic considerations.</Paragraph>
    <Section position="1" start_page="569" end_page="569" type="sub_section">
      <SectionTitle>
3.1 The Probabilistic Model
</SectionTitle>
      <Paragraph position="0"> First we discuss decisions based on a single syntactic tuple (as when only one syntactic tuple in the sentence contains an ambiguous word). Denote the source language syntactic tuple T and let there be k alternative target tuples for T, denoted by T1,.. *, Tk.</Paragraph>
      <Paragraph position="1"> Let the counts obtained for the target tuples be nl,. *., nk. For notational convenience, we number the tuples by decreasing frequency, i.e., nl ~ y/2 ~ &amp;quot;'&amp;quot; ~ nk-Since our goal is to choose for T one of the target tuples Ti, we can consider T a discrete random variable with multinomial distribution, 3 whose possible values are T1,..., Tk. Let Pi be the probability of obtaining Ti, i.e., the probability that Ti is the correct translation for T. We estimate the probabilities Pi by the counts ni in the obvious way, using the maximum likelihood estimator (Agresti 1990, pp. 40-41). The estimator \]9i</Paragraph>
      <Paragraph position="3"> The precision of the estimator depends, of course, on the size of the counts in the computation. We will incorporate this consideration into the decision algorithm by using confidence intervals. 4</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="569" end_page="575" type="metho">
    <SectionTitle>
3 A variable that can have one of a finite set of values, each of them having a fixed probability.
4 The maximum likelihood estimator is known to give poor estimates when small counts are involved,
</SectionTitle>
    <Paragraph position="0"> and there are several methods to improve it (see Church and Gale 1991, for a presentation and discussion of several methods). For our needs this is not necessary in most cases, since we are not going to use the estimate itself, but rather a confidence interval for the ratio between two estimations (see below).</Paragraph>
    <Paragraph position="1">  Ido Dagan and Alon Itai Word Sense Disambiguation We now have to establish the criterion for choosing the preferred target language syntactic tuple. The most reasonable assumption is to choose the tuple with the highest estimated probability, that is Tl--the tuple with the largest observed frequency. According to the model, the probability that T1 is the right choice is estimated as Pl. This criterion should be subject to the condition that the difference between the alternative probabilities is significant. For example, if/Yl = 0.51 and/52 = 0.49, the expected success rate in choosing T1 is approximately 0.5. To prevent the system from making a decision in such cases, we need to impose some conditions on the probabilities Pi. One possible such condition is that \]Jl exceeds a prespecified threshold (or, as we shall describe below, that the threshold requirement be applied to a confidence interval). According to the model, this requirement ensures that the success probability of every decision exceeds the threshold. Even though this method satisfies the probabilistic model, it is vulnerable to noise in the data, which often causes some relatively small counts to be larger than their true value in the sample. The noise is introduced in part by inaccuracies in the model and in part because of errors during the automatic collection of the statistical data. Consequently, the estimated value of Pl may be smaller than its true value, because other counts in Equation 1 are too large, thus, preventing Pl from passing the threshold.</Paragraph>
    <Paragraph position="2"> To deal with this problem, we have chosen another criterion for significance--the odds ratio. We choose the alternative T1 only if all the ratios r;2' exceed a prespecified threshold. Note that 15i/lfij -- ni/nj, and since nl _~ n2 _) ... ~_ nk, the ratio tYl/lY2 is less than or equal to all the other ratios. Therefore, it suffices to check the odds ratio only for ill/P2. This criterion is less sensitive to noise of the above-mentioned type than/)1, since it depends only on the two largest counts.</Paragraph>
    <Paragraph position="3"> 3.1.1 Underlying Assumptions. The use of a probabilistic model necessarily introduces several assumptions on the structure of the corresponding linguistic data. It is important to point out these assumptions, in order to be aware of possible inconsistencies between the model and the linguistic phenomena for which it is used.</Paragraph>
    <Paragraph position="4"> The first assumption is introduced by the use of a multinomial model, which presupposes the following: Assumption 1 The events Ti are mutually disjoint.</Paragraph>
    <Paragraph position="5"> This assumption is not entirely valid, since sometimes it is possible to translate a source language word to several target language words, such that all the translations are valid. For example, consider the Hebrew sentence (from the Ha-Aretz daily newspaper, November 27, 1990) whose English translation is (11) The resignation of Thatcher is not {related I connected} to the negotiations with Damascus.</Paragraph>
    <Paragraph position="6"> In this sentence (but not in others), the ambiguous word qshura can equally well be translated to either 'related' or 'connected.' In terms of the probabilistic model, the two corresponding events, i.e., the two alternative English tuples that contain these words, T1 -- (verb-comp: relate to negotiation) and T2 = (verb-comp: connect to negotiation) are  Computational Linguistics Volume 20, Number 4 both correct, thus the events T1 and T2 both occur (they are not disjoint). However, we have to make this assumption, since the counts we have, ni, from which we estimate the probabilities of the Ti values, count actual occurrences of single syntactic tuples. In other words, we count the number of times that each of Zl and T2 actually occur, not the number of times in which each of them could occur.</Paragraph>
    <Paragraph position="7"> Two additional assumptions are introduced by using counts of the occurrences of syntactic tuples of the target language in order to estimate the translation probabilities of source language tuples: Assumption 2 An occurrence of the source language syntactic tuple T can indeed be translated to one of Zl~...~ Tk.</Paragraph>
    <Paragraph position="8"> Assumption 3 Every occurrence of the target tuple Ti can be the translation of only the source tuple T. Assumption 2 is an assumption on the completeness of the linguistic model. It is rather reasonable and depends on the completeness of our bilingual lexicon: if the lexicon gives all possible translations of each ambiguous word, then this assumption will hold, since for each syntactic tuple T we will produce all possible translations3 Assumption 3, which may be viewed as a soundness assumption, does not always hold, since a target language word may be the translation of several source language words. Consider, for example, the Hebrew tuple T = (verb-obj: heh.ziq lul). Lul is ambiguous, meaning either a playpen or a chicken pen. Accordingly, T can be translated to either T1 = (verb-obj: hold playpen) or T2 = (verb-obj: hold pen). In the context of 'hold' the first translation is more likely, and we can therefore expect our model to prefer T1. However, this might not be the case because Assumption 3 is contradicted. 'Pen' can also be the translation of the Hebrew word &amp;quot;et (the writing instrument), and thus T2 can be the translation of another Hebrew tuple, T' = (verb~bj: heh.ziq 'et). This means that when translating T we are counting occurrences of T2 that correspond to both T and T', &amp;quot;misleading&amp;quot; the selection criterion. Section 6.3 illustrates another example in which the assumption is not valid, causing the algorithm to fail to select the correct translation.</Paragraph>
    <Paragraph position="9"> We must make this assumption since we use only a target language corpus, which is not related to any source language information. 6 Therefore, when seeing an occurrence of the target language word w, we do not know which source language word is appropriate in the current context. Consequently, we count its occurrence as a translation of all the source language words for which w is a possible translation. This implies that sometimes we use inaccurate data, which introduce noise into the statistical model (see Section 6.3 for a discussion of an alternative, but expensive, solution, using a bilingual corpus). As we shall see, even though the assumption does not always hold, in most cases this noise does not interfere with the decision algorithm.</Paragraph>
    <Paragraph position="10"> 5 The problem of constructing a bilingual lexicon that is as complete as possible is beyond the scope of this paper. A promising approach may be to use aligned bilingual corpora, especially for augmenting existing lexicons with domain-specific terminology (Brown et al. 1993; Dagan, Church, and Gale 1993). In any case, it seems that any translation system is limited by the completeness of its bilingual lexicon, which makes our assumption a reasonable one.</Paragraph>
    <Paragraph position="11">  6 As explained in the introduction, this is a very important advantage of our method over other methods that use bilingual corpora.</Paragraph>
    <Paragraph position="12">  Ido Dagan and Alon Itai Word Sense Disambiguation</Paragraph>
    <Section position="1" start_page="572" end_page="574" type="sub_section">
      <SectionTitle>
3.2 Statistical Significance of the Decision
</SectionTitle>
      <Paragraph position="0"> Another problem we should address is the statistical significance of the data--what confidence do we have that the data indeed reflect the phenomenon. If the decision is based on small counts, then the difference in the counts might be due to chance. For example, we should have more confidence in the odds ratio 151/152 = 3 when nl = 30 and //2 = 10 than when nl = 3 and n2 = 1. Consequently, we shall use a dynamic threshold for 151/152, which is large when the counts are small and decreases as the counts increase.</Paragraph>
      <Paragraph position="1"> A common method for determining the statistical significance of estimates is the use of confidence intervals. Rather than finding a confidence interval for 151/152, we will bound the log odds ratio, ln(151/152). Since the variance Of the log odds ratio is independent of the mean, it converges to the normal distribution faster than the odds ratio itself (Agresti 1990). We use a one-tailed interval, as we want only to decide whether ln(151/152) is greater than a specific threshold (i.e., we need only a lower bound for ln(151/152)). Using this method, for each desired error probability 0 &lt; ~ &lt; 1, we may determine a value B~ and state that with a probability of at least 1 - c~ the true value, ln(pl/p2), is greater than B~.</Paragraph>
      <Paragraph position="2"> The confidence interval of a random variable X with normal distribution is ZI-~, where ZI-~ is the confidence coefficient, which may be found in statistical tables, and var is the variance. In our case, the size of the confidence interval  be the right-hand side of Equation 2. The meaning of the inequality is that for every given pair nl~ n2 we know with confidence 1 - c~ that</Paragraph>
      <Paragraph position="4"> or in other words, B,~ is a lower bound for ln(pl/P2) with this confidence level.</Paragraph>
      <Paragraph position="5"> To obtain a decision criterion, we choose a threshold 0, for B~, and decide to</Paragraph>
      <Paragraph position="7"> Computational Linguistics Volume 20, Number 4 If Equation 4 does not hold, the algorithm makes no decision. The meaning of this criterion is that only if we know with confidence of at least 1 - ~ that ln(pl/p2) &gt; O, will we select the most frequent tuple T1 as the appropriate one. In terms of statistical decision theory, we say that our null hypothesis is that ln(pl/P2) &lt; 0, and we will make a decision only if we can reject this hypothesis with confidence at least 1 - ~.</Paragraph>
      <Paragraph position="8"> Note that we cannot compute B~ when one of the counts is zero. In this case we have used the common correction method of adding 0.5 to each of the counts (Agresti 1990, p. 249). 7 We shall now demonstrate the use of the decision criterion. In the experiment we conducted we chose the parameters ~ = 0.1, for which Z~ = 1.282, and 0 = 0.2.</Paragraph>
      <Paragraph position="9"> Thus, to choose T1 we require that with confidence level of at least 90% the hypothesis should satisfy ln(pl/P2) &gt; 0.2 (i.e., Pl/P2 &gt;_ e 02 = 1.22). For the alternative translations of tuple c in Table 1 we got nl = 29 and n2 = 5. For these values Be = 1.137. In this case Equation 4 is satisfied for 0 = 0.2, and the algorithm selects the word 'progress' as the translation of the Hebrew word hitqaddmut.</Paragraph>
      <Paragraph position="10"> In another case we had to translate the Hebrew word ro'sh, which can be translated  to either 'top' or 'head,' in the sentence whose translation is (12) Sihanuk stood at the {top \] head} of a coalition of underground groups.</Paragraph>
      <Paragraph position="11"> The two alternative syntactic tuples were (a) (verb-pp: standat head) 10 (b) (verb-pp: stand at top) 5  For nl = 10 and n2 = 5, we get Be = -0.009 (a negative value means that it is impossible to ensure with a 90% confidence level that Pl &gt; P2). Since Be G 0.2, the algorithm will refrain from making a decision in this case. This abstention reflects the fact that the difference between the counts is not statistically significant, and choosing the first alternative can be wrong in many of the cases (as seen in the five cases that were observed in the corpus).</Paragraph>
      <Paragraph position="12"> As mentioned above, our motivation was to find a criterion that depends on a dynamic threshold for ~1/\]Y2 (or alternatively nl/n2) , so that the threshold will be higher when nl and n2 are smaller. Our criterion indeed satisfies this requirement. If we substitute B~ in Equation 4, we get the following equivalent criterion:</Paragraph>
      <Paragraph position="14"> The above inequality clarifies the roles of the two parameters, ~ and 0:0 specifies a lower bound on In(nl/n2), which is independent of the sample size; c~ reflects the statistical significance. If c~ is decreased (i.e., we require more confidence), ZI_~ will increase, and therefore, the component dependent on the sample size will increase.</Paragraph>
      <Paragraph position="15"> Since this component is in inverse relation to nl and n2, the penalty for decreasing c~ increases when the sample size decreases. From this analysis we can derive the criterion for choosing the parameters: if we wish to use small counts, then c~ should be small, and 0 depends on the required ratio between nl and n2. The optimal values of the parameters should be determined empirically and might depend on the corpora and parsers we use.</Paragraph>
      <Paragraph position="16">  Ido Dagan and Alon Itai Word Sense Disambiguation</Paragraph>
    </Section>
    <Section position="2" start_page="574" end_page="575" type="sub_section">
      <SectionTitle>
3.3 Sentences with Several Syntactic Relations
</SectionTitle>
      <Paragraph position="0"> In the previous section, we assumed that the source sentence contains only one ambiguous syntactic tuple. In general there may be several ambiguous words that appear in several tuples. We should take advantage of the occurrence patterns of all of the tuples to reach a decision. Since different relations may favor different translations for an ambiguous word, we should devise a strategy for selecting a consistent translation for all words in the sentence. We have used the following constraint propagation algorithm, which receives as input the list of all source tuples along with their alternative translations to target tuples:</Paragraph>
      <Paragraph position="2"> Compute B~ of each source tuple. If the largest B~ is less than the threshold, 8, then stop.</Paragraph>
      <Paragraph position="3"> Let T be the source tuple for which B~ is maximal. Select the translation for the ambiguous words (or word) in T according to T1 (the most frequent target alternative for T). Remove T from the list of source tuples. Propagate the constraint: eliminate target tuples that are inconsistent with this decision. If now some source tuples become unambiguous, remove them from the list of source tuples.</Paragraph>
      <Paragraph position="4"> Repeat this procedure for the remaining list of source tuples, until all ambiguities have been resolved, or the maximal B~ is less than 8.</Paragraph>
      <Paragraph position="5"> To illustrate the algorithm, we consider Table 1 using the parameters c~ = 0.1 and 0 = 0.2. The largest value of B~ occurs for the tuple (verb-obj: higdil sikkuy), for which higdil can be translated to ;increase,' 'magnify,' or 'enlarge.' The first alternative appeared nl = 20 times, and the other alternatives did not appear at all, (n2 = n3 = 0). Adding the correction factor and computing B~ yields B~(nl + 0.5~n2 q-0.5) = B,~(20.5, 0.5) = 1.879 &gt; 0.2 = 8. Therefore, the word 'increase' was chosen as the translation of higdil. Since this word appears also in the tuple (subj-verb: hitztarrfut higdil), the' target tuples that include alternative translations of higdil were deleted. Thus (13) (subj-verb: joining enlarge) (subj-verb: joining magnify) were deleted. This leaves us with only one alternative (subj-verb: joining increase) as a possible translation of this Hebrew tuple, which is therefore removed from the input list.</Paragraph>
      <Paragraph position="6"> We now recompute the values of B~ for the remaining tuples. The maximal value is obtained for the tuple (14) (verb-obj: hissig hitqaddmut) where B~ (29, 5) = 1.137 &gt; 8. We, therefore, choose the word 'progress' as a translation for hitqaddmut. Since this word, hitqaddmut, also appears in the tuple (noun-pp: hitqaddmut b- sih.a), we delete the Six target tuples that are inconsistent with the selection of 'progress' (those containing the words 'advance' and 'advancement'). There now remain only three alternative target tuples for hitqaddmut b- sih.a.</Paragraph>
      <Paragraph position="7"> We now recompute the values of B~. The maximum value is B~ (7.5~ 0.5) = 0.836 &gt; 0 (note that because tuples inconsistent with the previous decisions were eliminated,  Computational Linguistics Volume 20, Number 4 n2 dropped from 2 to 0, thus increasing B~). Thus, 'talk' is selected as the translation of siha. Now all the ambiguities have been resolved and the procedure stops. In the above example all the ambiguities were resolved since in each stage the value of B~ exceeded the threshold 0 = 0.2. In some cases not all ambiguities are resolved, though the number of ambiguities may decrease.</Paragraph>
      <Paragraph position="8"> It should be noted that other methods may be proposed for combining the statistics of several syntactic relations. For example, it may make sense to multiply estimates of conditional probabilities of tuples in different relations, in a way that is analogous to n-gram language modeling (Jelinek, Mercer, and Roukos 1992). However, such an approach will make it harder to take into account the statistical significance of the estimate (a criterion that is missing in standard n-gram models). In our set of examples, the constraint propagation method proved to be successful and did not seem to introduce any errors. Further experimentation, on much larger data sets, is needed to determine which of the two methods (if any) is substantially superior to the other.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="575" end_page="577" type="metho">
    <SectionTitle>
4. The Experiment
</SectionTitle>
    <Paragraph position="0"> To evaluate the proposed disambiguation method, we implemented and tested the method on a random set of examples. The examples consisted of a set of Hebrew paragraphs and a set of German paragraphs. In both cases the target language was English. The Hebrew examples consisted of ten paragraphs picked at random from foreign news sections of the Israeli press. The paragraphs were selected from several news items and articles that appeared in several daily newspapers. The target language corpus consisted of American newspaper articles, and the Hansard corpus of the proceedings of the Canadian Parliament. The domain of foreign news articles was chosen to correspond to some of the topics that appear in the English corpus, s The German examples were chosen at random from the German press, without restricting the topic. 9 Since we did not have a translation system from Hebrew or German to English, we simulated the steps such a system would perform. Hence, the results we report measure the performance of just the target word selection module and not the performance of a complete translation system. The latter can be expected to be somewhat lower for a real system, depending on the performance of its other components. Note, however, that since the disambiguation module is highly immune to noise, it might be more useful in a real system: in such a system some of the alternatives would be totally erroneous. Since the corresponding syntactic tuples would typically not be found in the corpora, they would be eliminated by our module.</Paragraph>
    <Paragraph position="1"> The experiment is described in detail in the following subsections. It provides an example for a thorough evaluation that is carried out without having a complete system available. We specifically describe the processing of the Hebrew data, which was performed by a professional translator, supervised by the authors. The German examples were processed very similarly.</Paragraph>
    <Section position="1" start_page="575" end_page="576" type="sub_section">
      <SectionTitle>
4.1 Locating Ambiguous Words
</SectionTitle>
      <Paragraph position="0"> To locate ambiguous words, we simulated a bilingual lexicon and syntactic filters of a translation system. For every source language word, the translator searched all possible  Ido Dagan and Alon Itai Word Sense Disambiguation translations using a Hebrew-English dictionary (Alcalay 1990). The list of translations proposed by the dictionary was modified according to the following guidelines, to reflect better the lexicon of a practical translation system:</Paragraph>
      <Paragraph position="2"> Eliminate translations that would be ruled out for syntactic reasons, as explained in Section 2.1.</Paragraph>
      <Paragraph position="3"> Consider only content words, ignoring function words and proper nouns. Assume that multi-word terms, such as 'prime minister,' appear in the lexicon as complete terms. Thus we did not consider each of their constituents separately. Also, we did not consider source language words that should be translated to a multi-word target phrase.</Paragraph>
      <Paragraph position="4"> Eliminate rare and archaic translations that are not expected in the context of foreign affairs in the current press.</Paragraph>
      <Paragraph position="5"> The professional translator added translations that were missing in the dictionary.</Paragraph>
      <Paragraph position="6"> In addition, each of the remaining target alternatives for each source word was evaluated as to whether it is a suitable translation in the current context. This evaluation was later used to judge the selections of the algorithm. If all the alternatives were considered suitable, then the source word was eliminated from the test set, since any decision for it would have been considered successful.</Paragraph>
      <Paragraph position="7"> We ended up with 103 Hebrew and 54 German ambiguous words. For each Hebrew word we had an average of 3.27 alternative translations and an average of 1.44 correct translations. The average number of translations of a German word was 3.26, and there were 1.33 correct translations.</Paragraph>
    </Section>
    <Section position="2" start_page="576" end_page="576" type="sub_section">
      <SectionTitle>
4.2 Determining the Syntactic Tuples and Mapping Them to English
</SectionTitle>
      <Paragraph position="0"> Since we did not have a Hebrew parser, we have simulated the two steps of determining the source syntactic tuples and mapping them to English by reversing the order of these steps, in the following way: First, the sample sentences were translated manually, as literally as possible, into English. Then, the resulting English sentences were analyzed, using the ESG parser and the postprocessing routine (see Section 2.2), to identify the relevant syntactic tuples. The tuples were further classified into &amp;quot;general classes,&amp;quot; as described in Section 2.3. The use of these general classes, which was intended to facilitate the mapping of syntactic relations from one language to another, also facilitated our simulation method and caused it to produce realistic output.</Paragraph>
      <Paragraph position="1"> At the end of the procedure, we had, for each sample sentence, a data structure similar to Table 1 (without the counts).</Paragraph>
    </Section>
    <Section position="3" start_page="576" end_page="577" type="sub_section">
      <SectionTitle>
4.3 Acquiring the Statistical Data
</SectionTitle>
      <Paragraph position="0"> The statistical data were acquired from the following corpora:  Computational Linguistics Volume 20, Number 4 However, the effective size of the corpora was only about 25 million words, owing to two filtering criteria. First, we considered only sentences whose length did not exceed 25 words, since longer sentences required excessive parse time and contained many parsing errors. Second, even 35% of the shorter sentences failed to parse and had to be eliminated. The syntactic tuples were located by the ESG parser and the postprocessing routine mentioned earlier.</Paragraph>
      <Paragraph position="1"> For the purpose of evaluation, we gathered only the data required for the given test examples. Within a practical machine translation system, the disambiguation module would require a database containing all the syntactic tuples of the corpus with their frequency counts. In the current research project we did not have the computing resources necessary for constructing the complete database (the major cost being parsing time). However, such resources are not needed in order to evaluate the proposed method. Since we evaluated the method only on a relatively small number of random sentences, we first constructed the set of all &amp;quot;relevant&amp;quot; target tuples, i.e., tuples that should be considered for the test sentences. Then we scanned the entire corpus and extracted only sentences that contain both words from at least one of the relevant tuples. Only the extracted sentences were parsed, and their counts were recorded in our database. Even though this database is much smaller than the full database, for the ambiguous words of the test sentences, both databases provide the same information. Thus, the success rate for the test sentences is the same for both methods, while requiring a considerably smaller amount of resources at the research phase.</Paragraph>
      <Paragraph position="2"> The problem with this method is that for every set of sample sentences the entire corpus has to be scanned. Thus, a practical system would have to preprocess the corpus to construct a database of the entire corpus. Then, to resolve ambiguities, only this database need be consulted.</Paragraph>
      <Paragraph position="3"> After acquiring all the relevant data, the algorithm of Section 3.3 was executed for each of the test sentences.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML