File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/00/w00-0803_relat.xml
Size: 8,763 bytes
Last Modified: 2025-10-06 14:15:37
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0803"> <Title>Chinese-Japanese Cross Language Information Retrieval: A Han Character Based Approach</Title> <Section position="2" start_page="19" end_page="20" type="relat"> <SectionTitle> 1 Related Research and Motivation </SectionTitle> <Paragraph position="0"> Several approaches are investigated in CJK text indexing to address monolingual information retrieval (MLIR) - for example, (1) indexing single ideographie character, (2) indexing n-gram I ideographic characters and (3) indexing words or phrases after segmentation and morphological analysis. Monolingual information retrieval (MLIR) of CJK languages is further complicated with the fact that CJK texts do not contain word delimiters (e.g., a blank space after each word in English) to separate words. From the un-delimited sequence of characters, words must be exlIacted first (this process is known as segmentation). For inflectional ideographic language like Japanese, morphological analysis must ~so be performed.</Paragraph> <Paragraph position="1"> Sentences are segmented int,~ words with the help of a dictionary and using some machine learning techniques. Morphol0giccal analysis also needs intensive linguistic knowledge and computer processing. Segmentation and morphological analysis are tedious tasks and the accuracy of the automatic segmentation and morphological analysis drastically vary in different domains. The word based indexing of CJK texts is therefore computationally expensive. Segmentation mid morphological analysis related issues of both Chinese and Japanese are intensively addressed elsewhere (Sproat et al., 1996; Matsumoto et al., 1997 and many others).</Paragraph> <Paragraph position="2"> The n-gram (n >1) character based indexing is computationally expensive as well. The number of indexing terms (n-grams) ilacreases drastically as n increases. Moreover, not all the n-grams are semantically meaningful words; therefore, smoothing and filtering hcmristics must be employed to extract linguistk~lly meaningful n-grams for effective retrieval of information. See Nie et al. (1996, 1998, 1999), (;hen et al. (1997), Fujii et al. (1993), Kimet al. 0999) for details. In contrast, indexing sinlgle characters is straightforward and less demanding in terms of both space and time. In single character indexing, there is no need to (1) maintain a i In this paper, we use the terra, n-gram to refer to (n >1) cases. When n =1, we rise the term, single character indexing. multilingual dictionary or thesaurus of words, (2) to extract word and morphemes, and (3) to employ machine learning and smoothing to prune the less important n-grams or ambiguity resolution in word segmentation (Kwok, 1997; Ogawa et al., 1997; Lee et al., 1999; etc.).</Paragraph> <Paragraph position="3"> Moreover, a CLIR system, based on Han character semantics, incurs no translation overhead for both queries and documents. In a single character based CUR approach for CJK languages, some of the CLIR related problems discussed in (Grefenstette, 1998) can also be circumvented.</Paragraph> <Paragraph position="4"> Comparison of experimental results in monolingual IR using single character indexing, n-gram character indexing and (segmented) word indexing in Chinese information retrieval is reported in Nie et al. (1996, 1998, 1999) and Kwok (1997). For the case of monolingual information retrieval (MLIR) task, in comparison to the single character based indexing approach, n-gram based and word based approaches obtained better retrieval at the cost of the extra time and space complexity.</Paragraph> <Paragraph position="5"> Similar comparison and conclusion for Japanese and Korean MLIR are made in Fujii et al. (1993) and Lee et al. (1999), respectively.</Paragraph> <Paragraph position="6"> Cross language information retrieval (CUR, Oard and Dorr, 1996) refers to the retrieval when the query and the document collection are in different languages. Unlike MLIR, in cross language information retrieval, a great deal of efforts is allocated in maintaining the multilingual dictionary and thesaurus, and translating the queries and documents, and so on. There are other approaches to CLIR where techniques like latent semantic indexing (LSI) are used to automatically establish associations between queries and documents independent of language differences (Rchder et al., 1998).</Paragraph> <Paragraph position="7"> Due to the special nature (ideographic, undefimited, etc.) of the CJK languages, the cross language information retrieval of these languages is extremely complicated. Probably, this is the reason why only a few reports are available so far in Cross Asian Language Information Retrieval (CALIR).</Paragraph> <Paragraph position="8"> Tan and Nagao (1995) used correlated Han characters to align Japanese-Chinese bilingual texts. According to them, the occurrence of common Han characters (in Japanese and Chinese language texts) sometimes is so prevalent that even a monolingual reader could perform a partial alignment of the bilingual texts.</Paragraph> <Paragraph position="9"> One of the authors of this paper is not a native speaker of Chinese or Japanese but has the intermediate level proficiency in both languages now. However, before learning Japanese, based on the familiar Han characters (their visual similarity and therefore, the semantic relation) appeared in the Japanese texts, the author could roughly comprehend the theme of the articles written in Japanese. This is due to the fact that unlike Latin alphabets, Han characters capture significant semantic information in them. Since docuraent retrieval is inherently a task of semantic distinction between queries and documents, Han character based CLIR approach can therefore be justified. It is worthy to mention here that the pronunciation of the Han characters varies significantly across the CJK languages, but the visual appearance of the Han characters in written texts (across OK language) retains certain level of similarity.</Paragraph> <Paragraph position="10"> As discussed above, we can make use of the non-trivial semantic information encoded within the ideographic characters to find associations between queries and documents across the languages and perform cross language information retrieval. By doing so, we can avoid compficated segmentation or morphological analysis process. At the same time, multilingual dictionary and thesaurus lookup, and querydocuments translations can also be circumvented.</Paragraph> <Paragraph position="11"> In our research, we index single Han characters (common and/or semantically related) appeared in both Japanese and Chinese texts to model a new simplistic CLIR for Japanese and Chinese cross language information retrieval. CJK languages use a significant number of common (or similar) Han characters in writing. Although some ambiguities 2 exist in the usage of Han 2 Ambiguities also exist in word or phrase level. characters across the languages, there are obvious contextual and semantic associations in the usage of Han characters in the written texts across the CJK languages (Tan and Nagao, 1995).</Paragraph> <Paragraph position="12"> 2 Encoding scenarios of CJK languages Character encoding schemes of CJK languages have several variations (e.g., Chinese: GB and BIG-5, etc.; Japanese: JIS, EUC, etc.) 3. The number of Han characters encoded under a particular encoding scheme also varies.</Paragraph> <Paragraph position="13"> However, due to the continuous acceptance and popularity of the Unlcode (Unicode-2000) by the computer industry, we have a way to investigate these languages comprehensively.</Paragraph> <Paragraph position="14"> The Common CJK Ideograph section of the Unicode encoding scheme includes all characters encoded in each individual language and encoding scheme. Unicode version 3.0 assigned codes to 27,484 Han characters, a superset of characters encoded in other popular However, Unicode encoding is not a linguistically based encoding scheme; it is rather an initiative to cope with the variants of different local standards. A critical analysis of Unicode and a proposal of Multicode can be found in Mudawwar (1997). Unicode standard avoids duplicate encoding of the same character; for example, the character 'a' is encoded only once although it is being used in several western languages. However, for ideographic characters, such efforts failed to a certain extent due to the variation of typeface used under different situations and cultures. The characters in Figure 1, although they represent the same word (sword in English), is given a unique code under Unicode encoding scheme to satisfy the round-</Paragraph> </Section> class="xml-element"></Paper>