File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1103_metho.xml
Size: 31,241 bytes
Last Modified: 2025-10-06 14:15:12
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1103"> <Title>Using a Probabilistic Translation Model for Cross-Language Information Retrieval</Title> <Section position="3" start_page="18" end_page="20" type="metho"> <SectionTitle> 2. A Probabilistie Translation Model </SectionTitle> <Paragraph position="0"> Any source language input e can usually be translated in a great many different ways.</Paragraph> <Paragraph position="1"> Machine translation systems are expected to select but one particular translation f for each input. In the current state of the art, unaided MT is generally unable to produce high-quality translations: human translators remain mostly unchallenged. Moreover, it has been shown repeatedly that human translators seldom find it practical to post-edit MT output: the machine has just made too many wrong or questionable decisions.</Paragraph> <Paragraph position="2"> If the goal is to help human translators, it is advisable to stop short of producing a full-blown automatic translation. There is no point in having the machine spontaneously propose a detailed target language syntactic structure unless there is at least a reasonably good chance that the translator will want to use it.</Paragraph> <Paragraph position="3"> Similarly, there is no point in having the machine select target language equivalents for all source language words unless most of these equivalents are likely to be retained by the translator.</Paragraph> <Paragraph position="4"> In recent years it has been shown that existing MT techniques can produce useful results when they are applied to tasks that amount to somewhat less than translation proper. In previous work, we have shown that probabilistic translation models such as those of Brown et al. \[2\] could be used as the key component of various translation support tools. Specifically, our work on the TransTalk project \[1, 4\] has established that such models could become instrumental in improving the process of automatically transcribing a spoken translation. And our ongoing work on the TransType project \[5\] indicates that models of the same kind can drive typing aids for translators.</Paragraph> <Paragraph position="5"> A key feature of such applications is that they do not expect the machine to volunteer a full-fledged translation on its own. Rather, the machine is only expected to restrict the range of possible translations so as to make it easier to guess what the intentions of the human translator are.</Paragraph> <Paragraph position="6"> * For example, in certain incarnations of the TransTalk system, the translation model is used as a means of answering the following question: given a source language sentence e, what is the likelihood of observing the word f in any target language sentence f that constitutes a valid translation of e? If e is an English sentence that contains the word &quot;horses&quot;, then the likelihood that &quot;chevaux&quot; (the most direct equivalent for &quot;horses&quot;) will appear in a French translation ff of e is much greater than the a priori likelihood of observing &quot;chevaux&quot; in a random French sentence. In contrast, there is no reason to expect that the likelihood of observing &quot;cheveux&quot; (an acoustically close word that means &quot;hair&quot;) in f will be significantly altered: p(chevaux E f I horses ~ e) > p(chevaux ~ f) p(cheveu x ~ f I horses ~ e) =p(cheveux ~ f) TransTalk makes use of this fact to help resolve the acoustic ambiguity between t\]he French words &quot;chevaux'&quot; and &quot;cheveux&quot;. From the point of view of translation support, doing somewhat less than full-blown MT is likely to achieve more.</Paragraph> <Paragraph position="7"> In this paper, we want to argue that CLIR is facing a similar situation in that subjecting the source language query to a process that stops short of producing a full-blown target language query can result in a good retrieval performance. For the purpose of CLIR, our goal is to obtain a set of words that are the best translations of an original query. This goal may be achieved by using probabilistic translation models of the kind used in our TransTalk and TransSearch.</Paragraph> <Paragraph position="8"> By translation model, we mean a mechanism which associates to each source language sentence(or query) e a probability distribution p(fle) on the sentences (or queries) f of the target language. A precise description of a family of such models can be found in Brown & al. \[2\]. The model we will be using for the experiments reported here is basically their &quot;Model 1&quot;. In this model, a source e and !its translation f are connected through an alignment a, that is a mapping of the words of e onto those of f. If e = e 1, e r .... e t and f =.f, f2, .... f. then aj will be used to refer to the particular position in e that is connected with position j in f (for example, a 2 = 4 expresses the fact thatf~ is connected with e,) and e, will be used to refer to the word in e at position a r The probability p(fle) is decomposed as a sum over all possible alignments:</Paragraph> <Paragraph position="10"> The conditional probability of f under alignment a given e can be analysed as follows: p(f, ale)= p(fla,e)p(ale) = Ke,r p(fla,e) The latter equality stems from the fact that in model 1, all alignments are considered equiprobable (see below). Consequently p(ale) is a constant Kel equal to 1 over the total number of alignements.</Paragraph> <Paragraph position="11"> The core of the model is tf3~le), the lexical probability that some word e~ is translated as word fr The value of p(fla,e) depends mostly on the product of the lexical probabilities of each word pair connected by the alignment:</Paragraph> <Paragraph position="13"> where Cr, ~ is a constant that accounts for certain dependencies between the respective lengths of sentences e and f (mostly irrelevant here).</Paragraph> <Paragraph position="14"> The probability of observing wordfj in f under a particular alignment a is:</Paragraph> <Paragraph position="16"> Since all alignments are considered equiprobable, we can simply sum up the values obtained by connecting f~ to each word e e e~ .... e~ of e. In other words, the probability of observing a particular word in a given position in f is established as the total of the lexical contributions of each word of e.</Paragraph> <Paragraph position="17"> The parameters of our translation model are estimated from a bilingual parallel corpus in which each sentence has been aligned with the corresponding sentence(s) of the other language. Such alignments can be produced using algorithms such as the one described in \[10\]. Given such alignments we can estimate reasonable values for the parameters t(~le) using the Expectation Maximization algorithm, as described in \[2\]. The model used in the experiments reported here has been trained using 8 years of the Canadian Hansard (parliamentary debates), that is, approximately 50 million words in each language.</Paragraph> <Paragraph position="18"> Obviously, a translation model in which all alignments are considered equiprobable, like Model 1, can only be a very coarse model.</Paragraph> <Paragraph position="19"> The lexical translation probabilities t~le) are independent from the positions offj and e i. As a result, for any j, j', the model assigns the same value to p~le) and to pf~.le). In other words, the model is completely blind to syntax. This means that it is much too weak to generate full-blown translations on its own. At the very least, one would need to use it in tandem with a language model p(f) capable of capturing some constraints on acceptable sequences of words in the target language.</Paragraph> <Paragraph position="20"> Notwithstanding its weaknesses Model 1 does capture some non trivial aspects of the translation relationship as we observe it across natural languages. For example, it is indeed a property of that model that a relatively unambiguous source language word (say, the English &quot;chimney&quot;) will reinforce its equivalents in a stronger way than a very ambiguous word. An ambiguous word like &quot;drug&quot; will reinforce each of its equivalents (&quot;mrdicament&quot; and &quot;drogue&quot;) according to a translation probability estimated from the training corpus. While the model only operates at the level of simple word (as opposed to complex terms), it should be observed that it nonetheless captures some non-trivial contextual effects. For example, if the training corpus contains many occurrences of the expression &quot;drug traffic&quot; translated as &quot;trafic de drogue&quot;, the presence of the English word &quot;traffic&quot; will thereafter tend to reinforce the French word &quot;drogue&quot; (in this instance, more than the French word &quot;mrdicament&quot;). And given the fact that the intended application is not MT but CLIR, the use of a &quot;weak&quot; translation model turns out to be, in some respects, sufficient. In our IR system queries and documents are reperesented as vectors of weighted terms. Given any query e, our translation model will calculate a value for p(t~le), the probability of observing wordfj in the translation of e. It turns out to be straightforward to reinterpret this probability distribution as a vector of weighted terms.-</Paragraph> </Section> <Section position="4" start_page="20" end_page="21" type="metho"> <SectionTitle> 3. Cross-Language information </SectionTitle> <Paragraph position="0"> retrieval After a brief description of the principal functions of an IR system, we report our experiments on CLIR.</Paragraph> <Paragraph position="1"> representation (for example, a vector) for each document. Before indexing can be accomplished. We proceed the following preprocessing: null Morphological analysis: each word is transformed into a canonical, citation form. For example, nouns and (French) adjectives are transformed into their masculine singular form, and verbs are transformed into their infinitive forms. This neutralization of irrelevant differences in form often reduces retrieval silence.</Paragraph> <Paragraph position="2"> - Elimination of grammatical words: words that are more or less semantically empty are useless for IR. Such words are eliminated in order to reduce the size of the index and speed up the search process.</Paragraph> <Paragraph position="3"> For the indexing process, each document is represented as a set or a vector of weighted terms (words in canonical form). Term weight is determined by the following two factors: * ((term frequency): the relative frequency of the term in the document; and * idf(inverse document frequency): a measure of the non uniformity of the distribution of term across documents of the collection. The terms that rank best within a document d are those that are at the same time frequent within d and distributed unevenly in the collection of documents. The tt'*idf weigthing schema combines these two criteria \[3, 9\]. To determine the weight wt, of term t i in document d we used the following variant of</Paragraph> <Paragraph position="5"> where f(ti, d) is the frequency of term t i in document d, N is the total number of documents in the collection, and n is the number of documents including t i.</Paragraph> <Paragraph position="6"> The indexing process maps each document and query onto a vector of weights within the vector space of the indexes of the corpus. For example,</Paragraph> <Paragraph position="8"> where Wd, and Wq, are the weights of t i in document d and query q.</Paragraph> <Paragraph position="9"> The indexing process for queries is the same: Query matching involves measuring the degree of similarity sim(d, q) between the query vector q and each document vector d. In our case, sim(d, q) is calculated as follows:</Paragraph> <Paragraph position="11"> The IR system then produces a list of documents sorted by order of similarity with the query.</Paragraph> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 3.2. Experiments </SectionTitle> <Paragraph position="0"> Our experiments are conducted on a French corpus used in TREC-6 (Text Retrieval Conference) \[8\]. The corpus contains a collection of articles from a Swiss newspaper</Paragraph> </Section> </Section> <Section position="5" start_page="21" end_page="26" type="metho"> <SectionTitle> - SDA (Schweizerische Depeschen Agentur) - </SectionTitle> <Paragraph position="0"> French edition, published between 1988 and 1990. There are 141,656 documents, for a total size of 87 megabytes. TREC-6 data includes 25 queries, each written in English, French and German versions. Manual evaluations for 22 of these have been made available by NIST (National Institute of Standards and Technology). Our evaluations are based on this data: the French documents and the French and English queries.</Paragraph> <Paragraph position="1"> We compared five different approaches: 1. Monolingual French query-French documents IR. This is not CLIR, but is used as a reference point with which CLIR performance is compared.</Paragraph> <Paragraph position="2"> In the other approaches, the English query is translated into a French query using various tools. The translated queries are then used to retrieve French documents in the same way as in monolingual IR. We tested the following translation approaches : 2. Using MT systems (two of them: LOGOS and SYSTRAN) ; 3. Using a bilingual dictionary only; 4. Using a probabilistic translation model; 5. Combining 3 and 4.</Paragraph> <Paragraph position="3"> Each approach is now described in detail.</Paragraph> <Section position="1" start_page="21" end_page="26" type="sub_section"> <SectionTitle> Monolingual IR </SectionTitle> <Paragraph position="0"> The classical vector space model described in Section 3.1 is used. System performance is assessed by a standard IR method: average precision over 11 points of recall. We use the term IR effectiveness to refer to this particular measure.</Paragraph> <Paragraph position="1"> In this monolingual task, our average precision for the 22 queries was 37.31%. At the TREC-6 conference, only 13 of the 25 queries had been evaluated manually by NIST. The best performance for monolingual IR was 45.68% for the 13 queries. For the same set of queries, we obtain a performance of 42.93%, slightly below that of the best system.</Paragraph> <Paragraph position="2"> LOGOS flags the words missing from its dictionary with a question mark. In the case of the first query, the missing word Waldheira will still be considered during indexing because there are French documents that happen to contain it (fortunately, proper names tend to be preserved intact in translations). In other cases, words that the MT system did not know will end up being ignored at indexing. For example, one of our queries contained the rare word &quot;reusage&quot; which none of our MT systems knew.</Paragraph> <Paragraph position="3"> As stated earlier, the (sometimes questionable) quality of translations with respect to syntactic structure has little effect on IR effectiveness. What is important is the choice of correct target language equivalents. Both LOGOS and SYSTRAN produced several instances of inappropriate choice. For example, one of our queries contained &quot;drug traffic&quot;; while SYSTRAN correctly translated this term as &quot;trafic de drogue&quot;, LOGOS incorrectly translated it as &quot;circulation de mrdicament&quot;. The same query contained the word stem used as a verb and SYSTRAN mistranslated it as the noun &quot;tige&quot; (&quot;tree stem&quot;). Such errors lead to retrieving irrelevant documents.</Paragraph> <Paragraph position="4"> Because MT systems choose a unique equivalent for each source language term, the resulting query sometimes misses documents containing different but related words. For example, the meaning of &quot;drug&quot; in the sense of Query 3 may be expressed as &quot;drogue&quot; or &quot;stuprfiant&quot; in French. By choosing to translate &quot;drug&quot; only by &quot;drogue&quot;, documents describing &quot;stuprfiant&quot; cannot be retrieved. Despite these problems, the translations produced by LOGOS and SYSTRAN scored relatively high: an average precision of 28.66% with LOGOS and 27.63% with SYSTRAN. These results appear very good in&quot; comparison with comparable tests conducted in TREC-6 \[6, 7\]: typically, the average precision of this method was only about 1/2 2/3 as high as monolingual IR. At the TREC-6 conference, the best CLIR system for English-French IR achieved at a performance of 24.35% for the 13 evaluated queries. For the same queries, we obtained 31.96% and 28.90% using LOGOS and SYSTRAN respectively. These performances are significantly better than other systems presented at TREC6.</Paragraph> <Paragraph position="5"> CLIR using a bilingual dictionary We obtained from the Ergane project a bilingual dictionary which contains 7898 citation forms in English. Each English word is translated into one or more French words.</Paragraph> <Paragraph position="6"> For example: drug: remade, mrdicament, drogue, stup4fiant.</Paragraph> <Paragraph position="7"> increase: accro~tre, agrandir, amplifier, augmenter, 4tendre, accroissement, grossir, s'accroPStre, redoubler, accroissement.</Paragraph> <Paragraph position="8"> We tested a very simple approach: each word of an English query was replaced by all the French equivalents listed in the dictionary. For the first 3 queries, this resulted in the following word lists: where ?waldheim and ?worldwide are unknown words. During indexing, the word worldwide will be ignored whereas waldheim will be indexed.</Paragraph> <Paragraph position="9"> From the above examples, we can observe the following facts: In some cases, our dictionary lookup only produces inappropriate translations. For example, the verb &quot;stem&quot; used in the third query is translated as a noun (tige, queue, tronc) . In many other cases, inappropriate translations are given along with some correct ones. Thus, &quot;'drug&quot; receives the correct equivalents drogue and stup4fiane, but also the inappropriate remade and m4dicaraent. On one hand, in failing to choose between distinct meanings of a source language word (drogue ~ m4dicaraent) the dictionary method will produce additional retrieval noise; on the other hand, in refraining from arbitrarily selecting between target language synonyms (drogue stup4fiant) the method performs a natural query expansion which will reduce retrieval silence.</Paragraph> <Paragraph position="10"> We also observe that the dictionary is not well distributed in the sense that less important words (from the IR point of view) may have more translations than more important ones. For example, in query 2, the word &quot;marriage&quot; has only one translation, whereas the word &quot;increase&quot; has 10 translations. As a consequence, documents containing a word meaning &quot;increase&quot; will have a higher chance to be retrieved than a document about &quot;marriage&quot;. Bilingual dictionaries do not seem to reflect the notion of importance that is relevant for IR.</Paragraph> <Paragraph position="11"> Our test queries contained few words that were missing from our dictionary, despite its limited size. No doubt, this is because the queries were mostly about general topics.</Paragraph> <Paragraph position="12"> Our dictionary-translated queries scored an average precision of 18.33%, that is, about 50% of our monolingual score.</Paragraph> <Paragraph position="13"> A variant of this approach consists in using a bilingual terminology database instead of a bilingual dictionary. In contrast with dictionaries, terminology databases tend to contain a lot of complex terms. Moreover, the terms are usually classified into domains.</Paragraph> <Paragraph position="14"> Consequently, one would expect terminology databases to provide a better basis on which to choose accurate indices for IR queries.</Paragraph> <Paragraph position="15"> We tested this approach using the &quot;Banque de Terminologie du Qurbec&quot; (Terminology database of Quebec - BTQ). This database contains over 500 000 terms in English and French, classified into about 160 domains.</Paragraph> <Paragraph position="16"> Most terms are highly specialized. Thus, the database is very rich in domain-specific information. On the other hand, words and expressions of everyday language are often missing. For example, in Query 1 &quot;Reasons for controversy surrounding Waldheim's World War II actions&quot;, only the following words are found in BTQ: surround, if, action. In addition, matched words are assigned very idiosyncratic meanings in different specialized domains. In Query 2 &quot;are marriages increasing worldwide ?&quot;, none of the words is found. Replacing the original query with BTQ matches does not result in anything close to a reasonable translation. As a result, our average precision was only about 8%, a performance well below our dictionary approach. We conclude that a highly specialized terminology database such as BTQ is not appropriate for general CLIR.</Paragraph> <Paragraph position="17"> CLIR using a probabilistic translation model Query translation is performed as follows. An English query e is submitted to the probabilistic model as a single sentence so as to calculate p0~le), the probability that word fj will occur in any translation f of e. Since fj ranges over a very large vocabulary (all the French words observed in our training corpus), we want to retain only the best scoring words. This is because: 1) The longer the word list, the longer the time for the retrieval process. So a restriction in length leads to an increase in retrieval speed.</Paragraph> <Paragraph position="18"> 2) As the translation model is not perfect, the list is sometimes noisy. This is especially true when the source language query contains words whose frequency was low in our training corpus: probability estimations are then notoriously unreliable. By limiting the resulting list to an appropriate length, the amount of noise may be reduced.</Paragraph> <Paragraph position="19"> Thus, our &quot;translation&quot; of a query e will be simply made up of the n words f~ for which p(~le) is highest. We will experiment with several values of n in order to assess how this parameter affects IR effectiveness.</Paragraph> <Paragraph position="20"> The following lists show the first 20 words in the translations of the first 3 queries of our test corpus and their probabilities.</Paragraph> <Paragraph position="21"> Punctuation symbols are treated as ordinary words because we did not remove them from consideration in our training. This has little impact because they are ignored during query indexing. We plan to remove them altogether in our future experiments.</Paragraph> <Paragraph position="22"> Some interesting facts may be observed in these lists: I) The word translations obtained reflect the pecularities of our training corpus. For example, the word &quot;drug&quot; is translated by, among others, &quot;m4dicament&quot; et &quot;drogue&quot;, and a higher probability is attributed to &quot;mrdicament&quot;. This is because in the Hansard corpus, the English &quot;drug&quot; refers more often to the sense &quot;m~dicament'&quot; than to &quot;drogue&quot;. 2) This dependence on the training corpus sometimes leads to odd translations. For example, the word &quot;bille&quot; is considered as a French translation of &quot;logging&quot; in the English query &quot;effects of logging on desertification&quot;. This translation comes from the fact that in the Hansard corpus &quot;log&quot; in English is often translated as &quot;bille de bois&quot; in French. 3) Some words are rare or even absent in our training corpus, and this leads to unreliable translations. For example, there was only one occurrence of &quot;acupuncture&quot; in the training corpus. Because of that, the model fails to assign a higher probability to the French &quot;acuponcture&quot; than to other semantically unrelated words that appeared in the same sentence.</Paragraph> <Paragraph position="23"> 4) The model sometimes fail to distinguish the real translation from noise induced by simple statistical associations. For example, the word &quot;prendre&quot; appears in the translations of queries 1 and 3. It is attributed with even higher probabilities than the true translation words of the query such as &quot;second&quot;, &quot;action&quot; and &quot;stup4fiant&quot;. Statistics alone may prove insufficient for tackling this problem correctly.</Paragraph> <Paragraph position="24"> Despite these problems, we observe that real translations and associated words tend to score relatively high and appear at the top of the list. When the probabilities are incorporated into the query vector used to retrieve documents, the documents containing these words will be retrieved in priority.</Paragraph> <Paragraph position="25"> What use should we make of the probabilities that our translation model associates to each word? Should we use them directly as the weights appearing in our query vector? Should we rather combine them with other information? Notice that the probabilities assigned by the translation model are related to the tf (term frequency) criterion of IR: our definition of p(~ le) is such that each individual occurrence of a word e~ in the query e will reinforce the f~'s that are likely translations for e,.</Paragraph> <Paragraph position="26"> However, our translation model has little to say about the other criterion that is so important in IR: idf (inverse document frequency). One possible way to derive a o~idf-like weighting is to use the following transformed weight in the query vector: wq = p(~l e) * log(N/n) where p(fjl e)is the probability obtained by the probabilistic translation model, and log(N/n) represents the idf criterion as described in section 3.1, In our experiments, we tested different lengths of the list of translation words, as well as the two weighting methods in query vectors. The following table shows the IR effectiveness obtained in different cases.</Paragraph> <Paragraph position="27"> We observe that when the length of the translation word list increases from 10 to 50, the retrieval effectiveness increases slightly. However, when the length becomes too high (100), the effectiveness declines. This phenomenon may be explained as follows: the more words we retain in the translation: 1) the more related words get to be included; but 2) the more unrelated words get to be included as well. A good compromise is needed.</Paragraph> <Paragraph position="28"> Comparing lists of length 100 with shorten ones confirms our intuition that ignoring words with low probabilities reduces the risk of incorrect word associations, thus the risk of retrieving irrelevant documents.</Paragraph> <Paragraph position="29"> It is also evident that the transformed weighting which takes into account the idfo crtedon produces better results than translation probabilities alone. This is just another confirmation of the importance of the idf-cntedon in IR.</Paragraph> <Paragraph position="30"> To compare with the systems participating in the TREC-6 trial, we evaluated our system using transformed weight, at the lengths of 20 and 50. We obtain 29.71% and 29.97% in performance respectively.</Paragraph> <Paragraph position="31"> We mentioned above that our probabilistic translation model is sometimes unable to distinguish true translations from accidental statistical associations. We thought it might help to incorporate additional evidence of a true translation relationship if any such evidence was available. It is often the case in IR that combining different sources of evidence increases IR effectiveness. This is why we tried combining our probabilistic translation model with the bilingual dictionary mentioned above.</Paragraph> <Paragraph position="32"> Combining the probabilistic translation model with a bilingual dictionary A problem arises in such a combination due to the different nature of each element: one is weighted and the other is not. In other words, the question is the following: if a French word is a translation of an English word in the bilingual dictionary, how much should we increase the weight (probability) of this translation in the probabilistic model ? Our goal was not to provide a theoretically well founded answer to that question but simply to see if a simple-minded solution would prove useful in practic e . We tested the following approach: when a French translation is stored in the bilingual dictionary, its probability is increased by a default value, a constant determined manually. The new &quot;probability&quot; is used to obtain the transformed weight for the query vector as before. We tested several default values, ranging from 0.005 to 0.05.</Paragraph> <Paragraph position="33"> The following table shows the IR effectiveness obtained in each case.</Paragraph> <Paragraph position="34"> Length of the list of translation words First and foremost, note that in all cases the combined resources yield better retrieval effectiveness than either the probabilistic model alone or the bilingual dictionary alone. This strongly confirms our intuition that combining two sources of information should produce better results.</Paragraph> <Paragraph position="35"> In many of the tested cases the combined approach outperform the MT systems. In the case where the default value is 0,02, and 50 translation words are retained, we obtained the best effectiveness 29,85% (among all the tested cases). It may be claimed here that there are better tools for CLIR than MT systems. For the 13 queries used in the TREC-6 tests, we obtain 34.26% and 30.49% for the cases where the default value is set at 0.02, and the lengths at 20 and 50. These performances are excellent in comparing with the best systems at the TREC-6 conference (24.35%).</Paragraph> <Paragraph position="36"> Although the improvements in effectiveness of the combined approach over MT systems obtained so far are still small, we think that this approach may be further improved by 1) using a better training corpus; 2) using a more complete bilingual dictionary; and 3) a better method of combination. It is also possible to combine our probabilistic translation model with an MT system. As these two methods are based on different knowledge sources, the results could well prove superior too. We plan to examine this combination in the future.</Paragraph> </Section> </Section> class="xml-element"></Paper>