File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0605_metho.xml
Size: 13,225 bytes
Last Modified: 2025-10-06 14:15:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0605"> <Title>Cross-Language Information Retrieval for Technical Documents</Title> <Section position="3" start_page="29" end_page="30" type="metho"> <SectionTitle> 2 System Overview </SectionTitle> <Paragraph position="0"> Before explaining our CLIR system, we classify existing CLIR into three approaches in terms of the implementation of the translation phase. The first approach translates queries into the document language (Ballesteros and Croft, 1998; Carbonell et al., 1997; Davis and Ogden, 1997; Fujii and Ishikawa, 1999; Hull and Grefenstette, 1996; Kando and Aizawa, 1998; Okumura et al., 1998), while the second approach translates documents into the query language (Gachot et al., 1996; Oard and Hackett, 1997). The third approach transfers both queries and documents into an interlingual representation: bilingual thesaurus classes (Mongar, 1969; Salton, 1970; Sheridan and Ballerini, 1996) and language-independent vector space models (Carbonell et al., 1997; Dumais et al., 1996). We prefer the first approach, the &quot;query translation&quot;, to other approaches because (a) translating all the documents in a given collection is expensive, (b) the use of thesauri requires manual construction or bilingual compatable corpora, (c) interlingual vector space models also need comparable corpora, and (d) query translation can easily be combined with existing IR engines and thus the implementation cost is low. At the same time, we concede that other CLIR approaches are worth further exploration.</Paragraph> <Paragraph position="1"> Figure 1 depicts the overall design of our CLIR system, where most components are the same as those for monolingual IR, excluding &quot;translator&quot;.</Paragraph> <Paragraph position="2"> First, &quot;tokenizer&quot; processes &quot;documents&quot; in a given collection to produce an inverted file (&quot;surrogates&quot;). Since our system is bidirectional, tokenization differs depending on the target language. In the case where documents are in English, tokenization involves eliminating stopwords and identifying root forms for inflected words, for which we used &quot;Word-Net&quot; (Miller et al., 1993). On the other hand, we segment Japanese documents into lexical units using the &quot;ChaSen&quot; morphological analyzer (Matsumoto et al., 1997) and discard stopwords. In the current implementation, we use word-based uni-gram indexing for both English and Japanese documents. In other words, compound words are decomposed into base words in the surrogates. Note that indexing and retrieval methods are theoretically independent of ! the translation method.</Paragraph> <Paragraph position="3"> Thereafter, the &quot;translator&quot; processes a query in the source language (&quot;S-query&quot;) to output the translation (&quot;T-query&quot;). T-query can consist of more than one translation, because multiple translations are often appropriate for a single technical term.</Paragraph> <Paragraph position="4"> Finally, the &quot;IR engine&quot; computes the similarity between T-query and each document in the surrogates based on the vector space model (Salton and McGill, 1983), and sorts document according to the similarity, in descending order. We compute term weight based on the notion of TF.IDF. Note that T-query is decomposed into base words, as performed in the document preprocessing.</Paragraph> <Paragraph position="5"> In Section 3, we will explain the &quot;translator&quot; in Figure 1, which involves compound word translation and transliteration modules.</Paragraph> <Paragraph position="7"/> </Section> <Section position="4" start_page="30" end_page="32" type="metho"> <SectionTitle> 3 Translation Module </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 3.1 Overview </SectionTitle> <Paragraph position="0"> Given a query in the source language, tokenization is first performed as for target documents (see Figure 1). To put it more precisely, we use WordNet and ChaSen for English and Japanese queries, respectively. We then discard stop-words and extract only content words. Here, &quot;content words&quot; refer to both single and compound words. Let us take the following query as an example: improvement of data mining methods.</Paragraph> <Paragraph position="1"> For this query, we discard &quot;of&quot;, to extract &quot;improvement&quot; and &quot;data mining methods&quot;. Thereafter, we translate each extracted content word individually. Note that we currently do not consider relation (e.g. syntactic relation and collocational information) between content words. If a single word, such as &quot;improvement&quot; in the example above, is listed in our bilingual dictionary (we will explain the way to produce the dictionary in Section 3.2), we use all possible translation candidates as query terms for the subsequent retrieval phase.</Paragraph> <Paragraph position="2"> Otherwise, compound word translation is performed. In the case of Japanese-English translation, we consider all possible segmentations of the input word, by consulting the dictionary. Then, we select such segmentations that consist of the minimal number of base words. During the segmentation process, the dictionary derives all possible translations for base words. At the same time, transliteration is performed whenever katakana sequences unlisted in the dictionary are found. On the other hand, in the case of English-Japanese translation, transliteration is applied to any unlisted base word (including the case where the input English word consists of a single base word). Finally, we compute the probability of occurrence of each combination of base words in the target language, and select those with greater probabilities, for both Japanese-English and English-Japanese translations.</Paragraph> </Section> <Section position="2" start_page="30" end_page="31" type="sub_section"> <SectionTitle> 3.2 Compound Word Translation </SectionTitle> <Paragraph position="0"> This section briefly explains the compound word translation method we previously proposed (Fujii and Ishikawa, 1999). This method translates input compound words on a word-by-word basis, maintaining the word order in the source language 2. The formula for the source compound word and one translation candidate are represented as below.</Paragraph> <Paragraph position="2"> of compound technical terms defined in a bilingual dictionary maintain the same word order in both source and target languages.</Paragraph> <Paragraph position="3"> Here, si and ti denote i-th base words in source and target languages, respectively. Our task, i.e., to select T which maximizes P(TIS), is transformed into Equation (1) through use of the Bayesian theorem.</Paragraph> <Paragraph position="4"> arg n~x P(TIS ) = arg n~x P(SIT ) * P(T) (1) P(SIT ) and P(T) are approximated as in Equation (2), which has commonly been used in the recent statistical NLP research (Church and Mercer, 1993).</Paragraph> <Paragraph position="6"> (2) We produced our own dictionary, because conventional dictionaries are comprised primarily of general words and verbose definitions aimed at human readers. We extracted 59,533 English/Japanese translations consisting of two base words from the EDR technical terminology dictionary, which contains about 120,000 translations related to the information processing field (Japan Electronic Dictionary Research Institute, 1995), and segment Japanese entries into two parts 3. For this purpose, simple heuristic rules based mainly on Japanese character types (i.e., kanji, katakana, hiragana, alphabets and other characters like numerals) were used. Given the set of compound words where Japanese entries are segmented, we correspond English-Japanese base words on a word-by-word basis, maintaining the word order between English and Japanese, to produce a Japanese-English/English-Japanese base word dictionary. As a result, we extracted 24,439 Japanese base words and 7,910 English base words from the EDR dictionary. During the dictionary production, we also count the collocational frequency for each combination of si and ti, in order to based on English words, while Japanese compound words lack lexical segmentation.</Paragraph> <Paragraph position="7"> si is transliterated into ti, we use an arbitrarily predefined value for P(s,ilti). For the estimation of P(ti+llti), we use the word-based bi-gram statistics obtained from target language corpora, i.e., &quot;documents&quot; in the collection (see Figure 1).</Paragraph> </Section> <Section position="3" start_page="31" end_page="32" type="sub_section"> <SectionTitle> 3.3 Transliteration </SectionTitle> <Paragraph position="0"> Figure 2 shows example correspondences between English and (romanized) katakana words, where we insert hyphens between each katakana character for enhanced readability. The basis of our transliteration method is analogous to that for compound word translation described in Section 3.2. The formula for the source word and one transliteration candidate are represented as below.</Paragraph> <Paragraph position="2"> However, unlike the case of compound word translation, si and ti denote i-th &quot;symbols&quot; (which consist of one or more letters), respectively. Note that we consider only such T's that are indexed in the inverted file, because our transliteration method often outputs a number of incorrect words with great probabilities.</Paragraph> <Paragraph position="3"> Then, we compute P(TIS ) for each T using Equations (1) and (2) (see Section 3.2), and select k-best candidates with greater probabilities. The crucial content here is the way to produce a bilingual dictionary for symbols. For this purpose, we used approximately 3,000 katakana entries and their English translations listed in our base word dictionary. To illustrate our dictionary production method, we consider Figure 2 again. Looking at this figure, one may notice that the first letter in each katakana character tends to be contained in its corresponding English word. However, there are a few exceptions. A typical case is that since Japanese has no distinction between &quot;L&quot; and &quot;R&quot; sounds, the two English sounds collapse into the same Japanese sound. In addition, a single English letter corresponds to multiple katakana characters, such as &quot;x&quot; to &quot;ki-su&quot; in &quot;<text, te-ki-su-to>&quot;. To sum up, English and romanized katakana words are not exactly identical, but similar to each other.</Paragraph> <Paragraph position="4"> We first mangally define the similarity between the EngliSh letter e and the first romanized letter for each katakana character j, as shown in Table 1. In this table, &quot;phonetically similar&quot; letters refer to a certain pair of letters, such as &quot;L&quot; and &quot;R ''4. We then consider the similarity for afiy possible combination of letters in English and romanized katakana words, which can be represented as a matrix, as shown in Figure 3. This figure shows the similarity between letters in &quot;<text, te-ki-su-to>&quot;. We put a dummy letter &quot;$&quot;, which has a positive similarity only t.o itself, at the end of both English and katakana words. One may notice that matching plausible symbols can be seen as finding the path which maximizes the total similarity from the first to last letters. The best path can easily be found by, for example, Dijkstra's algorithm (Dijkstra, 1959). From Figure 3, we can derive the following correspondences: &quot;<re, te>&quot;, &quot;<X, ki-su>&quot; and &quot;<t, to>&quot;. The resultant correspondences contain 944 Japanese and 790 English symbol types, from which we also estimated P(si\[ti) and P(ti+l\]ti).</Paragraph> <Paragraph position="5"> As can be predicted, a preliminary experiment showed that our transliteration method is not accurate when compared with a word-based translation. For example: the Japanese word &quot;re-ji-su-ta (register)&quot; is transliterated to &quot;resister&quot;, &quot;resistor&quot; and &quot;register&quot;, with the probability score in descending order. How4~re identified approximately twenty pairs of phonetically similar letters.</Paragraph> <Paragraph position="6"> ever, combined with the compound word translation, irrelevant transliteration outputs are expected to be discarded. For example, a compound word like &quot;re-ji-su-ta tensou 9engo (register transfer language)&quot; is successfully translated, given a set of base words &quot;ten.sou (transfer)&quot; and &quot;gengo (language)&quot; as a context. e and j are identical 3 e and j are phonetically similar both e and j axe vowels or consonants 1</Paragraph> <Paragraph position="8"/> </Section> </Section> class="xml-element"></Paper>