File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1029_metho.xml
Size: 19,209 bytes
Last Modified: 2025-10-06 14:15:26
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1029"> <Title>Using Mutual Information to Resolve Query Translation Ambiguities and Query Term Weighting</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Dept. of Computer Science, </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="223" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> An easy way of translating queries in one language to the other for cross-language information retrieval (IR) is to use a simple bilingual dictionary. Because of the general-purpose nature of such dictionaries, however, this simple method yields a severe translation ambiguity problem. This paper describes the degree to which this problem arises in Korean-English cross-language IR and suggests a relatively simple yet effective method for disambiguation using mutual information statistics obtained only from the target document collection. In this method, mutual information is used not only to select the best candidate but also to assign a weight to query terms in the target language. Our experimental results based on the TREC-6 collection shows that this method can achieve up to 85% of the monolingual retrieval case and 96% of the manual disambiguation case.</Paragraph> <Paragraph position="1"> Introduction Cross-language information retrieval (IR) enables a user to retrieve documents written in diverse languages using queries expressed in his or her own language. For cross-language IR, either queries or documents are translated to overcome the language differences. Although it is possible to apply a high-quality machine translation system for documents as in Oard & Hackett (1997), query translation has emerged as a more popular method because it is much simpler and more economical compared to document translation. Query translation can be done in one or more of the three approaches: a dictionary-based approach, a thesaurus-based approach, or a corpus-based approach.</Paragraph> <Paragraph position="2"> There are three problems that a cross-language IR system using a query translation method must solve (Grefenstette, 1998). The first problem is to figure out how a term expressed in one language might be written in another. The second problem is to determine which of the possible translations should be retained. The third problem is to determine how to properly weight the importance of translation alternatives when more than one is retained.</Paragraph> <Paragraph position="3"> For cross-language IR between Korean and English, i.e. between Korean queries and English documents, an easy way to handle query , translation is to use a Korean-English machine-readable dictionary (MRD) because such bilingual MRDs are more widely available than other resources such as parallel corpora.</Paragraph> <Paragraph position="4"> However, it has been known that with a simple use of bilingual dictionaries in other language pairs, retrieval effectiveness can be only 40%60% of that with monolingual retrieval (Ballesteros & Croft, 1997). It is obvious that other additional resources need to be used for better performance.</Paragraph> <Paragraph position="5"> This paper focuses on the last two problems: pruning translations and calculating the weights for translation alternatives. We first describe the overall query translation process and the extent to which the ambiguity problem arises in Korean-English cross-language IR. We then propose a relatively simple yet effective method for resolving translation disambiguation using mutual information (MI) (Church and Hanks, 1990) statistics obtained only from the target document collection. In this method, mutual information is used not only to select the best candidate but also to assign a weight to query terms in the target language.</Paragraph> </Section> <Section position="3" start_page="223" end_page="224" type="metho"> <SectionTitle> 1 Overall Query Translation Process </SectionTitle> <Paragraph position="0"> Our Korean-to-English query translation scheme works in four stages: keyword selection, dictionary-based query translation, bilingual word sense disambiguation, and query term weighting. Although none of the common resources such as dictionaries, thesauri, and corpora alone is complete enough to produce high quality English queries, we decided to use a bilingual dictionary at the second stage and a target-language corpus for the third and the fourth stages. Our strategy was to try not to depend on scarce resources to make the approach practical. Figure 1 shows the four stages of Korean-to-English query translation.</Paragraph> <Section position="1" start_page="223" end_page="223" type="sub_section"> <SectionTitle> 1.1 Keyword Selection </SectionTitle> <Paragraph position="0"> At the first stage, Korean keywords to be fed into the query translation process are extracted from a quasi-natural language query. This keyword selection is done with a morphological analyzer and a stochastic part-of-speech (POS) tagger for the Korean language (Shin et al., 1996). The role of the tagger is to help select the exact morpheme sequence from the multiple candidate sequences generated by the morphological analysis. This process of employing a morphological analysis and a tagger is crucial for selecting legitimate query words from the topic statements because Korean is an agglutinative language. Without the tagger, all the extraneous candidate keywords generated from the morphological analyzer will have to be entered into the translation process, which in and of itself will generate extraneous words, due to one-to-many mapping in the bilingual dictionary.</Paragraph> </Section> <Section position="2" start_page="223" end_page="223" type="sub_section"> <SectionTitle> 1.2 Dictionary-Based Query Translation </SectionTitle> <Paragraph position="0"> The second stage does the actual query translation based on a dictionary look-up, by applying both word-by-word translation and phrase-level translation. For the correct identification of phrases in a Korean query, it would help to identify the lexical relations and produce statistical information on pairs of words in a text corpus as in Smadja (1993). Since the bilingual dictionary lacks some words that are essential for a correct interpretation of the Korean query, it is important to identify unknown words such as foreign words and transliterate them into English strings that need to be matched against an English dictionary (Jeong et al., 1997).</Paragraph> </Section> <Section position="3" start_page="223" end_page="223" type="sub_section"> <SectionTitle> 1.3 Selection of the Correct Translations </SectionTitle> <Paragraph position="0"> At the word disambiguation stage, we filter out the extraneous words generated blindly from the dictionary lookup process. In addition to the POS tagger, we employed a bilingual word disambiguation technique using the co-occurrence information extracted from the collection of target documents. More specifically, The mutual information statistics between pairs of words were used to determine whether English words from different sets generated by the translation process are &quot;compatible&quot;. In a sense, we make use of mutual disambiguation effect among query terms. More details are described in Section 3.</Paragraph> </Section> <Section position="4" start_page="223" end_page="224" type="sub_section"> <SectionTitle> 1.4 Query Term Weighting </SectionTitle> <Paragraph position="0"> Finally, we apply our query term weighting technique to produce the final target query. The term weighting scheme basically reflects the degree of associations between the translated terms, and we give a high or low term weighting value according to the degree of mutual association between query terms. This is another area where we make use of mutual information obtained from a text corpus. The result from the four stages is a set of query terms to be used in a vector-space retrieval model.</Paragraph> </Section> </Section> <Section position="4" start_page="224" end_page="224" type="metho"> <SectionTitle> 2 Analysis of Translation Ambiguity </SectionTitle> <Paragraph position="0"> Although an easy way to find translations of query terms is to use a bilingual dictionary, this method alone suffers from problems caused by translation ambiguity since there are often one-to-many correspondences in a bilingual dictionary. For example, in a Korean query consisting of three words, &quot;:Z\]-o--~-5~\]- -~7\] _Q_~&quot;(ja-dong-cha gong-gi oh-yum) that means air pollution caused by automobiles, each word can be translated into multiple English words when a Korean-English dictionary is used in a straightforward way. The first word &quot;:Z\]-o-~-5~\]-&quot; (ja-dong-cha) of the query can be translated into English words with semantically similar but different words like &quot;motorcar&quot;, &quot;automobile&quot;, and &quot;car&quot;. The second word &quot;--~-71&quot; (gong-gi), a homonymous word, can be translated into English words with different meanings: &quot;air&quot;, &quot;atmosphere&quot;, &quot;empty vessel&quot;, and &quot;bowl&quot;. And the last word &quot;_9--4&quot; (oh-yum) can be translated into two English words, &quot;pollution&quot; and &quot;contamination&quot;.</Paragraph> <Paragraph position="1"> Retaining multiple candidate words can be useful in promoting recall in monolingual IR system, but previous research indicates that failure to disambiguate the meanings of the words can hurt retrieval effectiveness tremendously. For instance, it is obvious that a phrase like empty vessel would change the meaning of the query entirely. Even a word like contamination, a synonym of pollution, may end up retrieving unrelated documents due to the slight differences in meaning.</Paragraph> <Section position="1" start_page="224" end_page="224" type="sub_section"> <SectionTitle> Title Sho~ Long </SectionTitle> <Paragraph position="0"> Table 1 shows the extent to which ambiguity occurs in our query translation when an English-Korean dictionary is used blindly after the morphological analysis and tagging. The three rows, title, short, and long, indicate three different ways of composing queries from the topic statements in the TREC collection. The left half shows the average number of English words per Korean word for each query, whereas the right half shows the average number of word pairs in English that can be formed from a single word pair in Korean. The latter indicates that the disambiguation process will have to select one out of more than 9 possible pairs on the average, regardless of which part of the topic statements is used for formal query generation.</Paragraph> </Section> </Section> <Section position="5" start_page="224" end_page="225" type="metho"> <SectionTitle> 3 Query Translation and Mutual </SectionTitle> <Paragraph position="0"> Information Our strategy for cross-language IR aims at practicality in that we try not to depend on scarce resources. Along the same line of reasoning, we opted for a disambiguation approach that requires only a collection of documents in the target language, which is always available in any cross-language IR environment. Since the goal of disambiguation is to select the best pair among many alternatives as described above, the mutual information statistic is a natural choice in judging the degree to which two words co-occur within a certain text boundary. It would be reasonable to choose the pair of words that are most strongly associated with each other, thereby eliminating those translations that are not likely to be correct ones.</Paragraph> <Paragraph position="1"> Mutual information values are calculated based on word co-occurrence statistics and used as a measure to calculate correlation between words. The mutual information Ml(x,y) is defined as the following formula (Church and Hanks, 1990).</Paragraph> <Paragraph position="3"> Here x and y are words occurring within a window of w words.</Paragraph> <Paragraph position="4"> The probabilities p(x) and p(y) are estimated by counting the number of observations of x and y in a corpus, f(x) and fly), and normalizing each by N, the size of the corpus. Joint probabilities, p(x,y), are estimated by counting the number of times, f,(x,y), that x is followed by y in a window of w words and normalizing it by N. In our application of query translation, the joint co-occurrence frequency f,(x,y) has 6-word window size which seems to allow semantic relations of query as well as fixed expressions (idioms such as bread and butter). We ensure that the word x be followed by the word y within the same sentence only.</Paragraph> <Paragraph position="5"> In our query translation scheme, MI values are used to select most likely translations after each Korean query word is translated into one or more English words. Our use of MI values is based on the assumption that when two words co-occur in the same query, they are likely to co-occur in the same affinity in documents.</Paragraph> <Paragraph position="6"> Conversely, two words that do not co-occur in the same affinity are not likely to show up in the same query. In a sense, we are conjecturing mutual information can reveal some degree of semantic association between words.</Paragraph> <Paragraph position="7"> Table 2 gives some examples of MI values for the alternative word pairs for translated queries of TREC-6 Cross-Language IR Track. These MI values were extracted from the English text corpus consisting of 1988 - 1990 AP news, strong and produce credible results for disambiguation of translations. However, if Ml(x,y) < 0, we can predict that the word x and word y are in complementary distribution.</Paragraph> </Section> <Section position="6" start_page="225" end_page="226" type="metho"> <SectionTitle> 4 Disambiguation and Weight Calculation </SectionTitle> <Paragraph position="0"> We can alleviate the translation ambiguity by discriminating against those word pairs with low MI values. The word pair with the highest MI value is considered to be the correct one among all the candidates in the two sets. Since a query is likely to be targeted at a single concept, regardless of how broad or narrow it is, we conjecture that words describing the concept are likely to have a high degree of association.</Paragraph> <Paragraph position="1"> Although we use the mutual information statistic to measure the association, others such as those used by Ballesteros & Croft (1998) can be considered.</Paragraph> <Paragraph position="2"> In the example of Section 2, each Korean word has multiple English words due to translation ambiguity. Figure 2 shows the MI values calculated for the word pairs comprising the translations of the original query. The words under wl, w2, and w3 are the translations from the three query words, respectively. The lines indicate that mutual information values are available for the pairs, and the numbers show some of the significant MI values for the corresponding pairs among all the possible pairs.</Paragraph> <Paragraph position="4"/> <Section position="1" start_page="225" end_page="226" type="sub_section"> <SectionTitle> Values </SectionTitle> <Paragraph position="0"> Our bilingual word disambiguation and weighting schemes rely on both relative and absolute magnitudes of the MI vales. The algorithm first looks for the pair with the highest MI value and selects the best candidates before and after the pair by comparing the MI values for the pairs that are connected with the initially chosen pairs. This process is applied to the words immediately before or after the chosen pair in order to limit the effect of the choice that may be incorrect.</Paragraph> <Paragraph position="1"> It should be noted that the words not chosen in this process are not used in the translated query unless the MI values are greater than a threshold. As described below, we assume that the candidates not in the first tier may still be useful if they are strongly associated with the adjacent word selected.</Paragraph> <Paragraph position="2"> For example, the word pair <air, pollution> that has the bold line representing the strongest association in the column is choisen first. Then the three MI values for the pairs containing air are compared to select the <automobile, air> pair, resulting in <automobile, air, pollution>. If there were additional columns in the example, the same process would be applied to the rest of the network.</Paragraph> <Paragraph position="3"> There are three reasons why query term weighting is of some value in addition to the pruning of conceptually unrelated terms. First, our word selection method is not guaranteed to give the correct translation. The method would give a reasonable result only when two consecutive query terms are actually used together in many documents, which is a hypothesis yet to be confirmed for its validity. Second, there may be more than one strong association whose degrees are different from each other by a large magnitude. Third, seemingly extraneous terms may serve as a recall-enhancing device with a query expansion effect.</Paragraph> <Paragraph position="4"> The basic idea in our term weighting scheme is to give a large weight to the best candidate and divide the remaining quantity to assign equal weights to the rest of the candidates. In other words, the weight for the best candidate, W~, is either 1 if it is greater than a threshold value or expressed as follows.</Paragraph> <Paragraph position="6"> Here x and 0 are a MI value and a threshold, respectively. The numerator, f(x), gives the smallest integer greater than the MI value so that the resulting weight is the same for all the candidates whose MI values are within a certain interval. Once the value for W b is calculated, the weight for the rest of the candidates are calculated as follows:</Paragraph> <Paragraph position="8"> where n is the number of candidates. It should be noted that W~ + Z W = 1.</Paragraph> <Paragraph position="9"> Based on our observation of the calculated MI values, we chose to use 3.0 as the cut-off value in choosing the best candidate and assign a fairly high weight. The cut-off value was determined purely based on the data we obtained; it can vary based on the new range of MI values when different corpora are used.</Paragraph> <Paragraph position="10"> In the example of Fig. 2, the word pair candidate between wl and w2 are (motorcar, air), (automobile, air), and (car, air). Here because the weight of the word pairs (automobile, air) is W, = 0.83, the word &quot;automobile&quot; has a relatively higher term weight than the other two words &quot;motorcar&quot; and &quot;car&quot;. Finally the optimal English query set with their term weight, <(motocar,0.085), (automobile, 0.83), (car, 0.085) >, is generated for the translations of wl.</Paragraph> </Section> </Section> class="xml-element"></Paper>