Chinese-Japanese Cross Language Information Retrieval: A Han 
Character Based Approach 
Md Maruf HASAN 
Computational Linguistic Laboratory 
Nara Institute of Science and Technology 
8916-5, Takayama, Ikoma, 
Nara, 630-0101 Japan 
maruf-h @is.alst-nara.ac.jp 
Yuji MATSUMOTO 
ComputationalLinguistic Laboratory 
Nam Institute of Science and Technology 
8916--5, Takayama, Ikoma, 
Nara, 630-0101 Japan 
matsu @is.aist-nara.ac.jp 
AbsU'act 
In this paper, we investigate cross  
information retrieval (CLIR) for Chinese 
and Japanese texts utilizing the Han 
characters - common ideographs used in 
writing Chinese, Japanese and Korean (CJK) 
s. The Unicode encoding scheme, 
which encodes the superset of Han 
characters, is used as a common encoding 
platform to deal with the mulfilingual 
collection in a uniform manner. We discuss 
the importance of Han character semantics 
in document indexing and retrieval of the 
ideographic s. We also analyse the 
baseline results of the cross  
information retrieval using the common Han 
characters appeared in both Chinese and 
Japanese texts. 
Keywords: Cross Language Information 
Retrieval, Multilingual Information 
Processing, Chinese, Japanese and Korean 
(CJK) Languages 
Introduction 
After the opening of the Cross Language 
Information Retrieval (CLIR) track in the 
TREC-6 conference (TREC-1998), several 
reports have been published on cross  
information retrieval in European s, and 
sometimes, European s along with one 
of the Asian s (e.g., Chinese, Japanese 
or Korean). However, no report is found in cross 
 IR that focuses on the Asian s 
exclusively. In 1999, Pergamon published a 
special issue of the journal, Information 
Processing and Management focusing on 
Information Retrieval with Asian Languages 
(Pergamon-1999). Among the eight papers 
included in that special issue, only one paper 
addressed CLIR (Kim et al., 1999). Kim et al. 
reported on nmltiple Asian  information 
retrieval (English, Japanese and Korean CLIR) 
using mulfilingual dictionaries and machine 
translation techniques (to translate both queries 
and documents). 
In TREC, intensive research efforts are made for 
the European s, for example, English, 
Gerrn~, French, Spanish, etc. Historically, these 
s share many similar linguistic 
properties. However, exclusive focus on Asian 
s, for example, Chinese, Japanese and 
Korean (CJK) - which also share significantly 
similar linguistic properties, has not been given. 
Enormous amount of CJK information is 
currently on the Internet. The combined growth 
rate of the CJK electronic information is also 
predicted to be growing at a faster rate. Cross 
 IR focusing on these Asian s 
is therefore inevitable. 
In this paper, we investigate the potential of 
indexing the semantically correlated Han 
characters appear in both Chinese and Japanese 
documents and queries to facilitate a cross 
 information retrieval. Using Han 
character oriented document and query vectors, 
within the framework of the vector space 
information retrieval, we then evaluate the 
effectiveness of the cross  IR with 
respect to their monolingual counterparts. We 
conclude with a discussion about further 
research possibilities and potentials of Han 
character oriented cross  information 
retrieval for the CJK s. 
19 
1 Related Research and Motivation 
Several approaches are investigated in CJK text 
indexing to address monolingual information 
retrieval (MLIR) - for example, (1) indexing 
single ideographie character, (2) indexing n- 
gram I ideographic characters and (3) indexing 
words or phrases after segmentation and 
morphological analysis. Monolingual infor- 
mation retrieval (MLIR) of CJK s is 
further complicated with the fact that CJK texts 
do not contain word delimiters (e.g., a blank 
space after each word in English) to separate 
words. From the un-delimited sequence of 
characters, words must be exlIacted first (this 
process is known as segmentation). For 
inflectional ideographic  like Japanese, 
morphological analysis must ~so be performed. 
Sentences are segmented int,~ words with the 
help of a dictionary and using some machine 
learning techniques. Morphol0giccal analysis also 
needs intensive linguistic knowledge and 
computer processing. Segmentation and 
morphological analysis are tedious tasks and the 
accuracy of the automatic segmentation and 
morphological analysis drastically vary in 
different domains. The word based indexing of 
CJK texts is therefore computationally 
expensive. Segmentation mid morphological 
analysis related issues of both Chinese and 
Japanese are intensively addressed elsewhere 
(Sproat et al., 1996; Matsumoto et al., 1997 and 
many others). 
The n-gram (n >1) character based indexing is 
computationally expensive as well. The number 
of indexing terms (n-grams) ilacreases drastically 
as n increases. Moreover, not all the n-grams are 
semantically meaningful words; therefore, 
smoothing and filtering hcmristics must be 
employed to extract linguistk~lly meaningful n- 
grams for effective retrieval of information. See 
Nie et al. (1996, 1998, 1999), (;hen et al. (1997), 
Fujii et al. (1993), Kimet al. 0999) for details. 
In contrast, indexing sinlgle characters is 
straightforward and less demanding in terms of 
both space and time. In single character 
indexing, there is no need to (1) maintain a 
i In this paper, we use the terra, n-gram to refer to (n 
>1) cases. When n =1, we rise the term, single character indexing. 
multilingual dictionary or thesaurus of words, 
(2) to extract word and morphemes, and (3) to 
employ machine learning and smoothing to 
prune the less important n-grams or ambiguity 
resolution in word segmentation (Kwok, 1997; 
Ogawa et al., 1997; Lee et al., 1999; etc.). 
Moreover, a CLIR system, based on Han 
character semantics, incurs no translation 
overhead for both queries and documents. In a 
single character based CUR approach for CJK 
s, some of the CLIR related problems 
discussed in (Grefenstette, 1998) can also be 
circumvented. 
Comparison of experimental results in 
monolingual IR using single character indexing, 
n-gram character indexing and (segmented) 
word indexing in Chinese information retrieval 
is reported in Nie et al. (1996, 1998, 1999) and 
Kwok (1997). For the case of monolingual 
information retrieval (MLIR) task, in 
comparison to the single character based 
indexing approach, n-gram based and word 
based approaches obtained better retrieval at the 
cost of the extra time and space complexity. 
Similar comparison and conclusion for Japanese 
and Korean MLIR are made in Fujii et al. (1993) 
and Lee et al. (1999), respectively. 
Cross  information retrieval (CUR, 
Oard and Dorr, 1996) refers to the retrieval 
when the query and the document collection are 
in different s. Unlike MLIR, in cross 
 information retrieval, a great deal of 
efforts is allocated in maintaining the 
multilingual dictionary and thesaurus, and 
translating the queries and documents, and so 
on. There are other approaches to CLIR where 
techniques like latent semantic indexing (LSI) 
are used to automatically establish associations 
between queries and documents independent of 
 differences (Rchder et al., 1998). 
Due to the special nature (ideographic, un- 
defimited, etc.) of the CJK s, the cross 
 information retrieval of these 
s is extremely complicated. Probably, 
this is the reason why only a few reports are 
available so far in Cross Asian Language 
Information Retrieval (CALIR). 
20 
Tan and Nagao (1995) used correlated Han 
characters to align Japanese-Chinese bilingual 
texts. According to them, the occurrence of 
common Han characters (in Japanese and 
Chinese  texts) sometimes is so 
prevalent that even a monolingual reader could 
perform a partial alignment of the bilingual 
texts. 
One of the authors of this paper is not a native 
speaker of Chinese or Japanese but has the 
intermediate level proficiency in both s 
now. However, before learning Japanese, based 
on the familiar Han characters (their visual 
similarity and therefore, the semantic relation) 
appeared in the Japanese texts, the author could 
roughly comprehend the theme of the articles 
written in Japanese. This is due to the fact that 
unlike Latin alphabets, Han characters capture 
significant semantic information in them. Since 
docuraent retrieval is inherently a task of 
semantic distinction between queries and 
documents, Han character based CLIR approach 
can therefore be justified. It is worthy to mention 
here that the pronunciation of the Han characters 
varies significantly across the CJK s, 
but the visual appearance of the Han characters 
in written texts (across OK ) retains 
certain level of similarity. 
As discussed above, we can make use of the 
non-trivial semantic information encoded within 
the ideographic characters to find associations 
between queries and documents across the 
s and perform cross  
information retrieval. By doing so, we can avoid 
compficated segmentation or morphological 
analysis process. At the same time, multilingual 
dictionary and thesaurus lookup, and query- 
documents translations can also be 
circumvented. 
In our research, we index single Han characters 
(common and/or semantically related) appeared 
in both Japanese and Chinese texts to model a 
new simplistic CLIR for Japanese and Chinese 
cross  information retrieval. CJK 
s use a significant number of common 
(or similar) Han characters in writing. Although 
some ambiguities 2 exist in the usage of Han 
2 Ambiguities also exist in word or phrase level. 
characters across the s, there are 
obvious contextual and semantic associations in 
the usage of Han characters in the written texts 
across the CJK s (Tan and Nagao, 
1995). 
2 Encoding scenarios of CJK s 
Character encoding schemes of CJK s 
have several variations (e.g., Chinese: GB and 
BIG-5, etc.; Japanese: JIS, EUC, etc.) 3. The 
number of Han characters encoded under a 
particular encoding scheme also varies. 
However, due to the continuous acceptance and 
popularity of the Unlcode (Unicode-2000) by 
the computer industry, we have a way to 
investigate these s comprehensively. 
The Common CJK Ideograph section of the 
Unicode encoding scheme includes all 
characters encoded in each individual  
and encoding scheme. Unicode version 3.0 
assigned codes to 27,484 Han characters, a 
superset of characters encoded in other popular 
standards. 
Figure 1: Different ideogr~hs represent the same 
concept, sword 
However, Unicode encoding is not a 
linguistically based encoding scheme; it is rather 
an initiative to cope with the variants of different 
local standards. A critical analysis of Unicode 
and a proposal of Multicode can be found in 
Mudawwar (1997). Unicode standard avoids 
duplicate encoding of the same character; for 
example, the character 'a' is encoded only once 
although it is being used in several western 
s. However, for ideographic characters, 
such efforts failed to a certain extent due to the 
variation of typeface used under different 
situations and cultures. The characters in Figure 
1, although they represent the same word (sword 
in English), is given a unique code under 
Unicode encoding scheme to satisfy the round- 
3 A typical Internet search engine (like Yahoo) 
sometimes asks users to specify not only the 
 but also the encoding scheme (e.g., 
simplified (GB) or traditional Chinese (BIG-5)) for a 
single  search. 
21 
trip criteria 4 , that is, to allow round-trip 
conversion between the source standard (in this 
case, JIS) and" the Unicode. Ilae 27,484 Han 
characters encoded in Unicode, therefore, 
includes semantic redundancy in both single- 
 and multiple- perspectives. 
In the unified CJK ideograph section, Unicode 
maintains redundancy to accommodate 
typographical or cultural con~,atibility because 
the design goal of Unicode i, mainly to attain 
compatibility with the existitlg corporate and 
national encoding standards. In a Han character 
based CUR approach, such redundancy and 
multiplicity must be identified and resolved to 
achieve semantic uniformity and association. 
Such multiplicity resolution esks, with compare 
to maintaining multifingual (Word) dictionaries, 
are less painstaking. In our Him character based 
CLIR, we use a table lookup mapping approach 
to resolve semantic ambiguities of the Han 
characters and associate the s,~mantically related 
ideographs within and across CJK s, as 
a preprocessing task. 
3 Comparative analysis ~of Japanese and 
Chinese  for Han character based CUR 
Chinese text is written honlogeneously using 
only Han characters. Th~e: are no word 
delimiters and therefore, segmentation must be 
performed to extract words :from the string of 
Han characters. Chinese is a non-inflectional 
 and therefore morphological analysis is 
not essential. 
In contrast, Japanese text is ~tMtten usually as a 
mixture of Hart character~, Hiragana and 
Katakana. Katakana is usually used to write non- 
Japanese words (except those borrowed from 
Chinese). Hiragana is mostly used to represent 
the inflectional pan of a word and to substitute 
complicated (and less comman) Han characters 
in modern Japanese. Japanese texts are also 
written without word delimiters and therefore, 
must be segmented. Prior ta any word based 
indexing, due to the infl(~ctional nature of 
Japanese, text must be morpllvlogieally analyzed 
and the root words should be indexed 
~,..B 
4 A detail description of the ~Inicode ideographic 
character unification rules can l~e found in Unicode- 
2000, pp. 258-271. 
(equivalent to the stemming in western 
s) to cope with the inflectional 
variations. 
Due to the historical evolution and cultural 
differences, Han charmer itself become 
ambiguous across the CJK s. We will 
discuss the semantic irregularities of Han 
characters in Japanese and Chinese below with 
examples. 
Han Characters: In Japanese, the ideographic 
character-string, tJJ2-~ means postal stamp. The 
constituent characters, if used independently in 
other contexts, represent "to cut" and "hand", 
respectively. However, in Chinese, gl~ 
represents postal stamp and the constituent 
characters represent "postal" and "ticket", 
respectively. Interestingly, both in Japanese and 
in Chinese, the character string, gl~, 
represents post office. However, majority of the 
postal service related words, in both Chinese and 
Japanese, consist of the Han character, i!5 as a 
component. Although there are some 
idiosyncrasies, there are significant regularities 
in the usage of Han characters across the CJK 
s. Like word sense disarnbiguation 
(WSD), Kanji Sense Disarnbiguation (KSD) 
within and across the CJK s is an 
interesting area of research by itself. Lua (1995) 
reported an interesting neural network based 
experiment to predict the meaning of Hart 
character based words using their constituent 
characters' semantics. 
For effective CLIR, we need to analyze the 
irregular Hart characters and work out relevant 
mapping algorithm to augment the query and 
document vectors. A simplistic approach (with 
binary weight) is illustrated in Table 1. For the 
partial co-occurrences of the characters like, i~J, 
~:- and mid, etc. in a particular document or a 
query requires adjustments of the document or 
the query vector. We are aware that such manual 
modification is not feasible for a large 
heterogeneous document collection. 
Dimensionality reduction techniques, fike LSI 
(Evans at al., 1998; Rehder et al, 1998) or Han 
character clustering are the potential solutions to 
automatically discover associations among Hart 
characters. 
22 
Table 1: Enhancement of query or document vectors 
to create semantic association (an example) 
Document or Query Vector Representation 
(partial) 
Han Characters appeared in a Japanese or a 
Chinese docun~nt or a query: 
\[..~J.. ~.. j.. g..\]' 
Possible binary vectors 
representing a query or 
a document 
(before enhancement) 
Mapped binary vector 
representing a query or 
a document 
(after enhancemenO 
\[..1.. 1.. *.. *..\]' 
\[..*.. *.. 1.. 1..\]' 
etc. 
\[..1.. 1.. 1.. 1..1' 
Asterisk (*) represents 0 or 1. 
Katakana Strings: In Japanese, especially in the 
technological domain, Katakana is 
predominantly used to transliterate foreign 
words. For example, in modem Japanese, the 
words, "~--Ib and ff'~ / \[\] ~--, etc. (tool 
and technology, respectively) are very common. 
Their Han character equivalents are lEA and 
~, etc., and they are similar to those used in 
Chinese. A Katakana to Kanji (Han character) 
mapping table is created to transfer the 
semantics of Kat0_kana in the form of Hart 
characters (relative positions of the document or 
query vector need to be adjusted) to help our 
Chinese-Japanese CLIR task. In this purpose, 
the definition part of a Japanese monolingual 
dictionary is used to find the relevant Hart 
characters for a particular Katakana string. 
Manual correction is then conducted to retain the 
meaningful Han character(s). 
Proper Names: In Japanese, foreign proper 
names are consistently written in KaLakana. 
However, in Chinese, they are written in Han 
characters. For a usable CLIR system for 
Chinese and Japanese, a mapping table is 
therefore inevitable. In our experiment, due to 
the nature of the text collection, we manually 
edited the small number of proper names to 
establish association. We are aware that such 
manual approach is not feasible for large scale 
CLIR task. However, since proper name 
detection and manipulation is itself a major 
research issue for natural  processing, 
we will not address it here. 
Hiragana Strings: Continuous long strings of 
Hiragana need to be located and replaced s with 
the respective Hart characters, and the document 
and the query vectors must be adjusted 
accordingly. Shorter hiragana strings can be 
ignored as stop word since such hiragana strings 
are mostly functional words or inflectional 
attributes. 
4 Vector Space Model: Western and Asian 
 perspective 
The most popular IR model, the Vector Space 
Model, uses vectors to represent documents and 
queries. Each element of a document or a query 
vector represents the presence or absence of a 
particular term (binary), or the weight (entropy, 
frequency, etc.). Functional words are 
eliminated; stemming and other preprocessing 
are also done prior to the vectofizafion. As a 
result, syntactic information is lost. The vector 
simply consists of an ordered list of terms, and 
therefore, the contextual cues have also 
disappeared. The document and the query 
vectors are gross approximation of the original 
document or query (Salton et al., 1983). In 
vector space information retrieval, we sacrifice 
syntactic, contextual and other information for 
representational and computational simplicity. 
For western s, sometimes phrase 
indexing is proposed to offset such losses and to 
achieve better retrieval quality. In vector space 
model, a terra usually refers to a word. For 
western s, a document or a query vector 
constructed from the letters of the alphabets 
would not yield any effective retrieval. 
However, representing CJK documents and 
query in terms of Han character vectorization 
yields reasonably effective retrieval. This is due 
to the fact that a Han character encodes non- 
trivial semantics information within itself, which 
is crucial for information retrieval. Han 
Character based document and query 
representation is therefore justified. For CLIR, 
s In Japan, it is common that materials written for 
young people uses t-Iiragana extensively to bypass 
complex Han characters. 
23 
considering the inherent co~,lexity in query 
and document translation, multilingual 
dictionary and thesaurus malnleaance, etc., Han 
character based (both single clcaracter or n-gram 
characters) approaches under the vector space 
framework, despite of being a gross 
approximation, provide significant semantic 
cues for effective retrieval ckle to the same 
reason. 
5 Experimental Setup 
We collected the translated 'versions of the 
Lewis Carroll's "Alice's A,Iventure in the 
Wonderland" in Japanese and in Chinese. The 
original Chinese version (in GB code) and the 
original Japanese version (in S-JIS code) are 
then converted into Unicode. Preprocessing is 
also conducted to correlate the proper names, to 
resolve the semantic multiplicky of coding and 
to associate the  spe~tific irregularities, 
etc. as described in Section 2 aad 3. 
The mg system (a public domain indexing 
system from the New Zealantl Digital Library 
project, Witten et al., 1999) is adapted to handle 
Unicode and used to index the Unicode files. We 
consider each paragraph of th0 book as a single 
document. There are 835 paragraphs in the 
original book and the translated versions in both 
Japanese and Chinese also preserve the total 
number of paragraphs. In this; way, we have a 
collection of 1670 paragraplhs (hereafter, we 
refer to each paragraph as a document of our 
bilingual text collection) in lmth Chinese and 
Japanese. We used the mg system to index the 
collection based on TF.IDF weighting. For a 
particular query the mg system is used to 
retrieve documents in order of ~,elevance. 
We asked 2 native Japan~ who have an 
intermediate level understan,~ing of Chinese 
 and who are the fmtuent users of the 
Internet search engines, to folanulate 5 queries 
each in natural Japanese. Similarly, we also 
asked 2 native Chinese who have the 
intermediate level understanding of Japanese 
and who are the frequent users of the lntemet, to 
formulate 5 queries each in Chinese. Therefore, 
4 bilingual human subjects folanulated a total of 
20 queries in their respective native tongue (10 
queries in Chinese and 10 quq~ies in Japanese). 
The subjects were initially nDt told about the 
cross  issues involved in the 
experimental process, that is, the subjects 
formulated the queries as how they would 
usually do for monolingual information 
retrieval. 
All the 4 subjects are familiar with the story of 
the Alice's Adventure in the Wonderland. 
However, we asked them to take a quick look at 
the electronic version of the book in their own 
 to help them to formulate 5 different 
queries in their own native . 
Table 2: Comparison of mono- and cross-  
information retrieval 
Queries 
in 
Chinese 
(total 10 
queries 
from 2 
native 
Chinese 
subjects) 
Queries 
in 
Japanese 
(total 10 
queries 
from 2 
native 
Japanese 
subjects 
Number of! 
Chinese 
documents 
judged 
relevant 
(a total of 10 
documents 
are retrieved 
for each 
query) 
Out of 100 
retrieved does 
35 
19 
Number of 
Japanese 
documents 
judged 
relevant 
(a total of 10 
documents are retrieved 
for each 
query) 
Out of lOO 
retrieved does 
CLIR to 
MLIR 
ratio 
26 74 % 
30 63 % 
Documents are retrieved with the queries from 
both the Japanese and the Chinese versions of 
the book. Top 10 documents in Chinese and top 
10 documents in Japanese  are then 
24 
retrieved for each query. Each subject is then 
presented with the 20 extracted documents for 
each of his/her own original query. Therefore, 
for the total 5 queries forrnulated by a subject, a 
total of 100 documents (50 documents in his/her 
mother tongue and 50 documents in the other 
) are given back to each subject for 
evaluation. Subjects are asked to evaluate the 
documents extracted in their native  
first and then similarly the documents extracted 
in the other . 
As shown in Table 2, it can be concluded that 
the cross  information retrieval in this 
experimental framework performed about 63- 
74% as good as their monolingual counterparts. 
Cross  information retrieval of 
European s, with the help of 
multilingual thesaurus enhancement reaches 
about 75% performance of their monolingual 
counterparts (Eichman et al., 1998). The 
effectiveness of Han character based CLIR for 
CJK s is therefore promising. It is 
important to note here that in business, political 
and natural science domains, Han characters are 
prevalently correlated across Japanese and 
Chinese documents. Our approach should 
perform even better if applied in those domains. 
6 Further Research 
In our experiment, we represent Chinese and 
Japanese documents and queries as weighted 
vectors of Han Characters. Before the 
vectorisation, necessary preprocessing is done to 
cope with the multiplicity of coding problem of 
sern~tically similar ideographs and to cope with 
some obvious  specific issues. Same as 
the monolingual vector space information 
retrieval approach, we measured cosine 
similarity between a query and a document to 
retrieve relevant documents in order of 
relevance. Similarity is measured for both cases; 
that is, (1) monolingual: the query and the 
document are in the same , and (2) 
cross-: the query and the document are 
of different s. The comparative result 
shows that the effectiveness of cross  
information retrieval between Chinese and 
Japanese in this way is comparable to that of 
other CLIR experiments conducted mainly with 
multiple western s with the help of 
thesauri and machine translation techniques. 
One of the promising applications of this 
approach can be in identifying and aligning 
Chinese and Japanese documents online. For 
example, retrieving relevant news articles 
published in both s from the Internet. It 
is understood that several mathematical 
techniques, like Han character clustering and 
dimensionality reduction techniques (Evans et 
al., 1998) can augment and automate the process 
of finding associations among the Han 
characters within and across the CJK s. 
The vector space model is also flexible for the 
adjustment of weighting scheme. Therefore, we 
can flexibly augment the Han character based 
query vectors (a pseudo- query expansion 
techniques) and document vectors (a pseudo- 
relevance feedback technique) for effective 
CLIR. We left these parts as our immediate 
future work. 
As done with the MLIR, n-gram characters 
based indexing can also be experimented. 
However, due to the small document collection 
and the number of queries we had, n-gram based 
indexing suffers from data sparseness problem. 
We, therefore, left out the n-gram character 
based CUR evaluation until a huge collection of 
documents and queries are ready. 
Conclusion 
In this paper, we experimented on a small 
collection of homogeneous bilingual texts and a 
small set of queries. The result obtained supports 
the promising aspect of using Han characters for 
cross  information retrieval of CJK 
s. Such an approach has its own 
advantage since no translation of query or 
documents are needed. In comparison to 
maintaining multilingual dictionaries or thesauri, 
maintaining Han characters mapping table is 
more effective because the mapping table needs 
not to be updated so often. Sophisticated 
mathematical analysis of Han characters can 
bring a new dimension in retrieving cross Asian 
 information. Kanji Sense 
Disambiguation (KSD) techniques using 
advanced machine learning techniques can make 
the proposed CLIR method more effective. KSD 
is a long neglected area of research. 
Dimensionality reduction techniques, chistedng, 
independent component analysis (ICA) and 
other mathematical methods can be exploited to 
25 
enhance Han character based l)Xc, cessing of CJK 
s. 

References 
Chen, A., Jianzhang He, Liangjie Xu, Fredric C. Gey 
and Jason Meggs (1997). Chinese Text Retrieval 
Without Using a Dictionary. In Proceeding of the 
Conference on Research and Development in 
Information Rertrieval, ACM $IGIR-97, pp. 42-49. 
Eichmann, D., M.E. Ruiz and P. Srinivasan (1998). 
Cross- Information Retrieval with the 
UMLS Metathesaurus. In iFhroceeding of the 
Conference on Research and Development in 
Information Rertrieval, ACM SIGIR-98, pp. 72-80. 
Evans, D.A., S.K. Handerson~ I.A. Monarch, J. 
Pereiro, L. Delon, W.R. Her~h (1998). Mapping 
Vocabularies Using Latent Semantics. In Gregory 
Grefenstette Edited, Cross-Lcnguage Information 
Retrieval, Kluwer Academic Publisher. 
Grefenstette, G. (1998) The Problem of Cross- 
Language Information ReU!eval. In Gregory 
Grefenstette Edited, Cross-L~guage Information 
Retrieval, Kluwer Academic Publisher, pp. 1-10. 
Fujfi, H. and W.B. Croft (1991~). A comparison of 
Indexing for Japanese "I~xt Retrieval. In 
Proceeding of the ACM SIGIt\[-93, pp. 237-246. 
Kim T., Sire C.-M., Yuh S., Jung H., Kim Y.-IC, 
Choi S.-K., Park D.-I., Choi tL.S. (1999). FromTo- 
CLIR~: web-based natural hmguage interface for 
cross- information lretrieval. Journal of 
Information Processing end Management, 
Pergamon, Vol. 35. No.4. pp. 559-586. 
Kwok, K.L (1997). Comparittg Representation in 
Chinese Information Retriewfl, In Proceeding of 
the ACM SIGIR-97, pp. 34-41. 
Lee, J.H. Hyun Yang Cho, Hyoltk Ro Park (1999). n- 
Gram-based Indexing for Korean Text Retrieval. 
Journal of Information Processing and 
Management, Pergamon, VoL 35. No.4. pp. 427- 
441. 
Lua ICT. (1995) Predicaticm of Meaning of 
Bisyllabic Chinese Words Ush~g Back Propagation 
Neural Network. In Commmfications of COLIPS, 
An International Journal of ~hinese and Oriental 
Languages Information Proc~ssing Society, Vol.5, 
Singapore. URL: 
htm:/Iwww.como.nus.edu.s~ ~coliDs/commcolios/¢ 
aper/p95.huul 
Matsumoto, Y., H. Kitauchi, T.. Yamashita 0997). 
User's Manual of Japaaese Morphological 
Analyzer, ChaSen version 1.0 (in Japanese). 
Technical Report IS-TR970DT, Nara Institute of 
Science and Technology (NAIST), Japan. 
Mudawwar, M.F. (1997). Multicode: A Truly 
Multilingual Approach to Text Encoding. IEEE Computer, 
Vol. 30. No. 4, pp. 37-43. 
Nie, J.Y., Martin Brisebois and Xiaobo Ren (1996). 
On Chinese Text Retrieval. In Proceeding of the 
ACM SIGIR-96, pp. 225-233. 
Nie, J.Y., Jean-Pierre Chevallet and Marie-France 
Bmandet (1998). Between terms and Words for 
European Language IR and Between Words and 
Bigrams for Chinese IR. In Proceeding of Text 
REtrieval Conference (TREC-6), pp. 697-710. 
Nie, J.Y. and Fuji Ren (1999). Chinese Information 
Retrieval:using character or words? Journal of 
Information Processing and Management, 
Pergamon, VoL 35. No.4. pp. 443-462. 
Oard, D.W. and Bonnie J. Dorr (1996). A Aurvey of 
Multilingual Text Retrieval. University of 
Maryland, Technical Report, UMIACS-TR-96-19, 
CS-TR-3615. 
Ogawa, Y. and Torn Matsuda (1997). Overlapping 
Statistical Word Indexing: A New Indexing 
Method for Japanese Text. In Proceeding of the 
ACM SIGIR-97, pp. 226-234. 
Pergamon-1999 (1999) Special issue on Information 
Retrieval with Asian s, Journal of 
Information Processing and Management, Vol 35. 
No.4. Pergamon Press, London. 
Rehder, B., M.L. Littman, Susan Dumais and T.K. 
Landaner (1998). Automatic 3-Language Cross- 
Language Information Retrieval with Latent 
Semantic Indexing. In Proceeding of Text 
REtrieval Conference (TREC-6), pp. 233-240. 
Salton, G. and M.J. McGill (I983). Introduction to 
Modem Information Retrieval, McGraw-Hill, New 
York, 1983. 
Sproat, R., Chilin Shih, William Gale and Nancy 
Chang. A Statistic Finite State Word-Segmentation 
Algorithm for Chinese, Computational Linguistics, 
Vol. 22 No. 2, pp. 377-404. 
Tan C.L and Makoto Nagao (1995) Automatic 
Alignment of Japanese-Chinese Bilingual Texts, In 
IEICE Transactions of Information and Systems, 
Japan. Vol. E78-D. No. 1. pp. 68-76. 
TREC-6 (1998). Proceeding of Text REtrieval 
Conference (TREC-6). National Institute of Sciece 
and Technology (NIST). URL: 
http://trec.nist.gov/pubs/trec6/ 
Unicode-2000 (2000). The Unicode Standard, 
Version 3.0, Addison Wesley, Reading, \]VIA, URL: 
http://www.unicode.org/ 
Witten I.H., Alistair Moffat and T.C. Bell (1999). 
Managing Gigabytes: Compressing and Indexing 
Documents and Images, Second Edition, Morgan 
Kaufmann Publishers. 
