File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0605_intro.xml

Size: 4,562 bytes

Last Modified: 2025-10-06 14:06:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0605">
  <Title>Cross-Language Information Retrieval for Technical Documents</Title>
  <Section position="2" start_page="0" end_page="29" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Cross-language information retrieval (CLIR), where the user presents queries in one language to retrieve documents in another language, has recently been one of the major topics within the information retrieval community. One strong motivation for CLIR is the growing number of documents in various languages accessible via the Internet. Since queries and documents are in different languages, CLIR requires a translation phase along with the usual monolingual retrieval phase. For this purpose, existing CLIR systems adopt various techniques explored in natural language processing (NLP) research. In brief, bilingual dictionaries, corpora, thesauri and machine translation (MT) systems are used to translate queries or/and documents.</Paragraph>
    <Paragraph position="1"> In this paper, we propose a Japanese/English CLIR system for technical documents, focusing on translation of technical terms. Our purpose also includes integration of different components within one framework. Our research is partly motivated by the &amp;quot;NACSIS&amp;quot; test collection for IR systems (Kando et al., 1998) 1 , which consists of Japanese queries and Japanese/English abstracts extracted from technical papers (we will elaborate on the NACSIS collection in Section 4). Using this collection, we investigate the effectiveness of each component as well as the overall performance of the system.</Paragraph>
    <Paragraph position="2"> As with MT systems, existing CLIR systems still find it difficult to translate technical terms and proper nouns, which are often unlisted in general dictionaries. Since most CLIR systems target newspaper articles, which are comprised mainly of general words, the problem related to unlisted words has been less explored than other CLIR subtopics (such as resolution of translation ambiguity). However, Pirkola (1998), for example, used a subset of the TREC collection related to health topics, and showed that combination of general and domain specific (i.e., medical) dictionaries improves the CLIR performance obtained with only a general dictionary. This result shows the potential contribution of technical term translation to CLIR. At the same time, note that even domain specific dictionaries lhttp ://www. rd. nacs is. ac. j p/-nt cadm/index-en, html  do not exhaustively list possible technical terms. We classify problems associated with technical term translation as given below: (1) technical terms are often compound word~ which can be progressively created simply by combining multiple existing morphemes (&amp;quot;base words&amp;quot;), and therefore it is not entirely satisfactory to exhaustively enumerate newly emerging terms in dictionaries, (2) Asian languages often represent loanwords based on their special phonograms (primarily for technical terms and proper nouns), which creates new base words progressively (in the case of Japanese, the phonogram is called katakana).</Paragraph>
    <Paragraph position="3"> To counter problem (1), we use the compound word translation method we proposed (Fujii and Ishikawa, 1999), which selects appropriate translations based on the probability of occurrence of each combination of base words in the target language. For problem (2), we use &amp;quot;transliteration&amp;quot; (Chen et al., 1998; Knight and Graehl, 1998; Wan and Verspoor, 1998).</Paragraph>
    <Paragraph position="4"> Chen et al. (1998) and Wan and Verspoor (1998) proposed English-Chinese transliteration methods relying on the property of the Chinese phonetic system, which cannot be directly applied to transliteration between English and Japanese. Knight and Graehl (1998) proposed a Japanese-English transliteration method based on the mapping probability between English and Japanese katakana sounds. However, since their method needs large-scale phoneme inventories, we propose a simpler approach using surface mapping between English and katakana characters, rather than sounds.</Paragraph>
    <Paragraph position="5"> Section 2 overviews our CLIR system, and Section 3 elaborates on the translation module focusing on compound word translation and transliteration. Section 4 then evaluates the effectiveness of our CLIR system by way of the standardized IR evaluation method used in TREC programs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML