File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0803_metho.xml
Size: 13,824 bytes
Last Modified: 2025-10-06 14:07:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0803"> <Title>Chinese-Japanese Cross Language Information Retrieval: A Han Character Based Approach</Title> <Section position="3" start_page="20" end_page="21" type="metho"> <SectionTitle> 3 A typical Internet search engine (like Yahoo) </SectionTitle> <Paragraph position="0"> sometimes asks users to specify not only the language but also the encoding scheme (e.g., simplified (GB) or traditional Chinese (BIG-5)) for a single language search.</Paragraph> <Paragraph position="1"> trip criteria 4 , that is, to allow round-trip conversion between the source standard (in this case, JIS) and&quot; the Unicode. Ilae 27,484 Han characters encoded in Unicode, therefore, includes semantic redundancy in both single-language and multiple-language perspectives. In the unified CJK ideograph section, Unicode maintains redundancy to accommodate typographical or cultural con~,atibility because the design goal of Unicode i, mainly to attain compatibility with the existitlg corporate and national encoding standards. In a Han character based CUR approach, such redundancy and multiplicity must be identified and resolved to achieve semantic uniformity and association.</Paragraph> <Paragraph position="2"> Such multiplicity resolution esks, with compare to maintaining multifingual (Word) dictionaries, are less painstaking. In our Him character based CLIR, we use a table lookup mapping approach to resolve semantic ambiguities of the Han characters and associate the s,~mantically related ideographs within and across CJK languages, as a preprocessing task.</Paragraph> </Section> <Section position="4" start_page="21" end_page="21" type="metho"> <SectionTitle> 3 Comparative analysis ~of Japanese and </SectionTitle> <Paragraph position="0"> Chinese language for Han character based CUR Chinese text is written honlogeneously using only Han characters. Th~e: are no word delimiters and therefore, segmentation must be performed to extract words :from the string of Han characters. Chinese is a non-inflectional language and therefore morphological analysis is not essential.</Paragraph> <Paragraph position="1"> In contrast, Japanese text is ~tMtten usually as a mixture of Hart character~, Hiragana and Katakana. Katakana is usually used to write non-Japanese words (except those borrowed from Chinese). Hiragana is mostly used to represent the inflectional pan of a word and to substitute complicated (and less comman) Han characters in modern Japanese. Japanese texts are also written without word delimiters and therefore, must be segmented. Prior ta any word based indexing, due to the infl(~ctional nature of Japanese, text must be morpllvlogieally analyzed and the root words should be indexed</Paragraph> <Paragraph position="3"/> </Section> <Section position="5" start_page="21" end_page="22" type="metho"> <SectionTitle> 4 A detail description of the ~Inicode ideographic </SectionTitle> <Paragraph position="0"> character unification rules can l~e found in Unicode2000, pp. 258-271.</Paragraph> <Paragraph position="1"> (equivalent to the stemming in western languages) to cope with the inflectional variations.</Paragraph> <Paragraph position="2"> Due to the historical evolution and cultural differences, Han charmer itself become ambiguous across the CJK languages. We will discuss the semantic irregularities of Han characters in Japanese and Chinese below with examples.</Paragraph> <Paragraph position="3"> Han Characters: In Japanese, the ideographic character-string, tJJ2-~ means postal stamp. The constituent characters, if used independently in other contexts, represent &quot;to cut&quot; and &quot;hand&quot;, respectively. However, in Chinese, gl~ represents postal stamp and the constituent characters represent &quot;postal&quot; and &quot;ticket&quot;, respectively. Interestingly, both in Japanese and in Chinese, the character string, gl~, represents post office. However, majority of the postal service related words, in both Chinese and Japanese, consist of the Han character, i!5 as a component. Although there are some idiosyncrasies, there are significant regularities in the usage of Han characters across the CJK languages. Like word sense disarnbiguation (WSD), Kanji Sense Disarnbiguation (KSD) within and across the CJK languages is an interesting area of research by itself. Lua (1995) reported an interesting neural network based experiment to predict the meaning of Hart character based words using their constituent characters' semantics.</Paragraph> <Paragraph position="4"> For effective CLIR, we need to analyze the irregular Hart characters and work out relevant mapping algorithm to augment the query and document vectors. A simplistic approach (with binary weight) is illustrated in Table 1. For the partial co-occurrences of the characters like, i~J, ~:- and mid, etc. in a particular document or a query requires adjustments of the document or the query vector. We are aware that such manual modification is not feasible for a large heterogeneous document collection.</Paragraph> <Paragraph position="5"> Dimensionality reduction techniques, fike LSI (Evans at al., 1998; Rehder et al, 1998) or Han character clustering are the potential solutions to automatically discover associations among Hart characters.</Paragraph> <Paragraph position="6"> \[..1.. 1.. *.. *..\]' \[..*.. *.. 1.. 1..\]' etc.</Paragraph> <Paragraph position="7"> \[..1.. 1.. 1.. 1..1'</Paragraph> <Paragraph position="9"> technological domain, Katakana is predominantly used to transliterate foreign words. For example, in modem Japanese, the words, &quot;~--Ib and ff'~ / \[\] ~--, etc. (tool and technology, respectively) are very common.</Paragraph> <Paragraph position="10"> Their Han character equivalents are lEA and ~, etc., and they are similar to those used in Chinese. A Katakana to Kanji (Han character) mapping table is created to transfer the semantics of Kat0_kana in the form of Hart characters (relative positions of the document or query vector need to be adjusted) to help our Chinese-Japanese CLIR task. In this purpose, the definition part of a Japanese monolingual dictionary is used to find the relevant Hart characters for a particular Katakana string.</Paragraph> <Paragraph position="11"> Manual correction is then conducted to retain the meaningful Han character(s).</Paragraph> <Paragraph position="12"> Proper Names: In Japanese, foreign proper names are consistently written in KaLakana.</Paragraph> <Paragraph position="13"> However, in Chinese, they are written in Han characters. For a usable CLIR system for Chinese and Japanese, a mapping table is therefore inevitable. In our experiment, due to the nature of the text collection, we manually edited the small number of proper names to establish association. We are aware that such manual approach is not feasible for large scale CLIR task. However, since proper name detection and manipulation is itself a major research issue for natural language processing, we will not address it here.</Paragraph> <Paragraph position="14"> Hiragana Strings: Continuous long strings of Hiragana need to be located and replaced s with the respective Hart characters, and the document and the query vectors must be adjusted accordingly. Shorter hiragana strings can be ignored as stop word since such hiragana strings are mostly functional words or inflectional attributes.</Paragraph> </Section> <Section position="6" start_page="22" end_page="23" type="metho"> <SectionTitle> 4 Vector Space Model: Western and Asian </SectionTitle> <Paragraph position="0"> language perspective The most popular IR model, the Vector Space Model, uses vectors to represent documents and queries. Each element of a document or a query vector represents the presence or absence of a particular term (binary), or the weight (entropy, frequency, etc.). Functional words are eliminated; stemming and other preprocessing are also done prior to the vectofizafion. As a result, syntactic information is lost. The vector simply consists of an ordered list of terms, and therefore, the contextual cues have also disappeared. The document and the query vectors are gross approximation of the original document or query (Salton et al., 1983). In vector space information retrieval, we sacrifice syntactic, contextual and other information for representational and computational simplicity.</Paragraph> <Paragraph position="1"> For western languages, sometimes phrase indexing is proposed to offset such losses and to achieve better retrieval quality. In vector space model, a terra usually refers to a word. For western languages, a document or a query vector constructed from the letters of the alphabets would not yield any effective retrieval.</Paragraph> <Paragraph position="2"> However, representing CJK documents and query in terms of Han character vectorization yields reasonably effective retrieval. This is due to the fact that a Han character encodes non-trivial semantics information within itself, which is crucial for information retrieval. Han Character based document and query representation is therefore justified. For CLIR, s In Japan, it is common that materials written for young people uses t-Iiragana extensively to bypass complex Han characters.</Paragraph> <Paragraph position="3"> considering the inherent co~,lexity in query and document translation, multilingual dictionary and thesaurus malnleaance, etc., Han character based (both single clcaracter or n-gram characters) approaches under the vector space framework, despite of being a gross approximation, provide significant semantic cues for effective retrieval ckle to the same reason.</Paragraph> </Section> <Section position="7" start_page="23" end_page="24" type="metho"> <SectionTitle> 5 Experimental Setup </SectionTitle> <Paragraph position="0"> We collected the translated 'versions of the Lewis Carroll's &quot;Alice's A,Iventure in the Wonderland&quot; in Japanese and in Chinese. The original Chinese version (in GB code) and the original Japanese version (in S-JIS code) are then converted into Unicode. Preprocessing is also conducted to correlate the proper names, to resolve the semantic multiplicky of coding and to associate the language spe~tific irregularities, etc. as described in Section 2 aad 3.</Paragraph> <Paragraph position="1"> The mg system (a public domain indexing system from the New Zealantl Digital Library project, Witten et al., 1999) is adapted to handle Unicode and used to index the Unicode files. We consider each paragraph of th0 book as a single document. There are 835 paragraphs in the original book and the translated versions in both Japanese and Chinese also preserve the total number of paragraphs. In this; way, we have a collection of 1670 paragraplhs (hereafter, we refer to each paragraph as a document of our bilingual text collection) in lmth Chinese and Japanese. We used the mg system to index the collection based on TF.IDF weighting. For a particular query the mg system is used to retrieve documents in order of ~,elevance.</Paragraph> <Paragraph position="2"> We asked 2 native Japan~ who have an intermediate level understan,~ing of Chinese language and who are the fmtuent users of the Internet search engines, to folanulate 5 queries each in natural Japanese. Similarly, we also asked 2 native Chinese who have the intermediate level understanding of Japanese and who are the frequent users of the lntemet, to formulate 5 queries each in Chinese. Therefore, 4 bilingual human subjects folanulated a total of 20 queries in their respective native tongue (10 queries in Chinese and 10 quq~ies in Japanese).</Paragraph> <Paragraph position="3"> The subjects were initially nDt told about the cross language issues involved in the experimental process, that is, the subjects formulated the queries as how they would usually do for monolingual information retrieval.</Paragraph> <Paragraph position="4"> All the 4 subjects are familiar with the story of the Alice's Adventure in the Wonderland.</Paragraph> <Paragraph position="5"> However, we asked them to take a quick look at the electronic version of the book in their own language to help them to formulate 5 different queries in their own native language.</Paragraph> <Paragraph position="6"> Documents are retrieved with the queries from both the Japanese and the Chinese versions of the book. Top 10 documents in Chinese and top 10 documents in Japanese language are then retrieved for each query. Each subject is then presented with the 20 extracted documents for each of his/her own original query. Therefore, for the total 5 queries forrnulated by a subject, a total of 100 documents (50 documents in his/her mother tongue and 50 documents in the other language) are given back to each subject for evaluation. Subjects are asked to evaluate the documents extracted in their native language first and then similarly the documents extracted in the other language.</Paragraph> <Paragraph position="7"> As shown in Table 2, it can be concluded that the cross language information retrieval in this experimental framework performed about 6374% as good as their monolingual counterparts. Cross language information retrieval of European languages, with the help of multilingual thesaurus enhancement reaches about 75% performance of their monolingual counterparts (Eichman et al., 1998). The effectiveness of Han character based CLIR for CJK languages is therefore promising. It is important to note here that in business, political and natural science domains, Han characters are prevalently correlated across Japanese and Chinese documents. Our approach should perform even better if applied in those domains.</Paragraph> </Section> class="xml-element"></Paper>