File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/w00-1313_abstr.xml

Size: 8,229 bytes

Last Modified: 2025-10-06 13:41:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1313">
  <Title>Query Translation in Chinese-English Cross-Language Information Retrieval</Title>
  <Section position="1" start_page="0" end_page="105" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This paper proposed a new query translation method based on the mutual information matrices of terms in the Chinese and English corpora. Instead of looking up a * bilingual phrase dictionary, the compositional phrase (the translation of phrase can be derived from the translation of its components) in the query can be indirectly translated via a general-purpose Chinese-English dictionary look-up procedure. A novel selection method for translations of query terms is also presented in detail. Our query translation method ultimately constructs an English query in which each query term has a weight. The evaluation results show that the retrieval performance achieved by our query translation method is about 73% of monolingual information retrieval and is about 28% higher than that of simple word-by-word translation way.</Paragraph>
    <Paragraph position="1"> Introduction With the rapid growth of electronic documents and the great development of network in China, there are more and more people touching the Intemet, on which, however, English is the most popular language being used. It is difficult for most people in China to use English fluently, so they would like to use Chinese to express their queries to retrieval the relevant English documents. This situation motivates research in Cross Language Information Retrieval (CLIR). There are two approaches to CLIR, one is query translation; the other is translating original language documents to destination This research was supported by the National Science Fund of China for Distinguished Young Scholars under contact 69983009.</Paragraph>
    <Paragraph position="2"> language equivalents. Obviously, the latter is a very expensive task since there are so many documents in a collection and there is not yet a reliable machine translation system that can be used to process automatically. Most researchers are inclined to choose the query translation approach \[Oard. (1996)\]. Methods for query translation have focused on three areas: the employment of machine translation techniques, dictionary based translation \[Hull &amp; Grefenstette (1996); Ballesteros &amp; Croft (1996)\], parallel or comparable corpora for generating a translation model \[Davis &amp; Dunning (1995); Sheridan &amp; Ballerini (1996); Nie, Jian-Yun et a1.(1999)\]. Machine translation (MT) method has many obstacles to prevent its employment into CLIR such as deep syntactic and semantic analysis, user queries consisting of only one or two words, and an arduous task to build a MT system. Dictionary based query translation is the most popular method because of its easiness to perform. The main reasons leading to the great drops in CLIP,. effectiveness by this method are ambiguities caused by more than one translation of a query term and failures to translate phrases during query translation. Previous studies \[Hull &amp; Grefenstette (1996); Ballesteros &amp; Croft (1996)\] have shown that automatic word-by-word (WBW) query translation via machine readable dictionary (MKD) results in a 40-60% loss in effectiveness below that of monolingual retrieval. With regard to the use of parallel corpora translation method, the critiques one often raises concern the availability of reliable parallel text corpora. An alternative way is that making use of the comparable corpora because they are easier to be obtained and there are more and more bilingual even multilingual documents on the Internet. From analyzing a document collection, an associated word list can be yielded and it is often used to expansion the query in monolingual information retrieval \[Qiu</Paragraph>
    <Paragraph position="4"> In this paper, a new query translation is presented by combination dictionary based method with the comparable corpora analyzing.</Paragraph>
    <Paragraph position="5"> Ambiguity problem and phrase information lost are attacked in dictionary based Chinese-English Cross-Language information Retrieval (CECLIR). The remainder of this paper is organized as follows: section 1 gives a method to calculate the mutual information matrices of Chinese-English comparable corpora. Section 2 develops a scheme to select the translations of the Chinese query terms and introduces how the compositional phrase can be kept in our method.</Paragraph>
    <Paragraph position="6"> Section 3 presents a set of preliminary experiment on comparable corpora to evaluate our query translation method and gives some explanations.</Paragraph>
    <Paragraph position="7"> 1 .Mutual information matrices calculation We hypothesize that the words in a sentence after being removed the stop words be associated with each other and work together to express a query requirement. The association relationship between two words can be indicated by their mutual information, which can be further used to discover phrases \[Church :&amp; Hanks (1990)\]. If two words are independent with each other, their mutual information would be close to zero. On the other hand, if they are strongly related, the mutual information would be much greater than zero and they would be much like to be a phrase; if they occur complementarily, the mutual information would be negative. In conclusionC/ the bigger the mutual information of word pair, the more probable the word phrase would be a phrase.</Paragraph>
    <Paragraph position="8"> According to \[Fano (1961)\], we can define the mutual information M1 (tl,t z) of term t I and</Paragraph>
    <Paragraph position="10"> t~ and t~ in a Chinese sentence. The reason we select a Chinese sentence to be a window other than a fixed length window is that a full Chinese sentence can keep more linguistic information and consequently, it is more reasonable that we can regard t~ and t 2 to be a phrase when they co-occur in a sentence. P(t l) and P(t 2) are the occurrence probabilities of term t I and t 2 in a sentence. These probabilities can be calculated by the occurrence of term t~ and t 2 in the collection as equation (2), (3) and (4).</Paragraph>
    <Paragraph position="12"> nt~ , nt2 is the individual term frequency of term t I and t 2 respectively if either of them occur in a sentence of the collection, ntt,t ~ is the co-occurrence frequency of term t I and t 2 if they are all in a sentence of the collection. N is the number of sentences of the collection.</Paragraph>
    <Paragraph position="13"> Replacing (1) with equation (2), (3) and (4), the mutual information of term t I and t 2 can be expressed by following formula.</Paragraph>
    <Paragraph position="14"> n,,. N MI(q,t</Paragraph>
    <Paragraph position="16"> Table 2 and table 3 show the occurrence frequency values and mutual information values calculated by formula (5) for three Chinese compositional phrases and their corresponding English phrases respectively found in our comparable corpora.</Paragraph>
    <Paragraph position="18"> phrases (N = 184,000) Anal)zing the Chinese-English comparable corpora in this way, we can get two mutual information value matrices to indicate which two terms (as to the Chinese collection, they are  almost Chinese words after segmentation) would be most possible to be a phrase. A word list associated to each Chinese query term can be obtained by looking up the mutual information value matrix of the Chinese corpus with a cutoff of M1 =1.50. As discussed above, the bigger the mutual information value between two terms, the more possible the two words would be a phrase. We can infer that the associated word list of the query term contains the terms that are the most possible components of a compositional phrase. In other words, the phrase information can be kept by this way. The Chinese query is translated into English via looking up the English senses of Chinese query term and words in its associated word list in a Chinese-English dictionary. The procedures how to select appropriate tranlations and to construct the English query are discussed in section 2.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML