File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/w00-0803_abstr.xml
Size: 3,127 bytes
Last Modified: 2025-10-06 13:41:47
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0803"> <Title>Chinese-Japanese Cross Language Information Retrieval: A Han Character Based Approach</Title> <Section position="1" start_page="0" end_page="19" type="abstr"> <SectionTitle> AbsU'act </SectionTitle> <Paragraph position="0"> In this paper, we investigate cross language information retrieval (CLIR) for Chinese and Japanese texts utilizing the Han characters - common ideographs used in writing Chinese, Japanese and Korean (CJK) languages. The Unicode encoding scheme, which encodes the superset of Han characters, is used as a common encoding platform to deal with the mulfilingual collection in a uniform manner. We discuss the importance of Han character semantics in document indexing and retrieval of the ideographic languages. We also analyse the baseline results of the cross language information retrieval using the common Han characters appeared in both Chinese and Japanese texts.</Paragraph> <Paragraph position="1"> reports have been published on cross language information retrieval in European languages, and sometimes, European languages along with one of the Asian languages (e.g., Chinese, Japanese or Korean). However, no report is found in cross language IR that focuses on the Asian languages exclusively. In 1999, Pergamon published a special issue of the journal, Information Processing and Management focusing on Information Retrieval with Asian Languages (Pergamon-1999). Among the eight papers included in that special issue, only one paper addressed CLIR (Kim et al., 1999). Kim et al. reported on nmltiple Asian language information retrieval (English, Japanese and Korean CLIR) using mulfilingual dictionaries and machine translation techniques (to translate both queries and documents).</Paragraph> <Paragraph position="2"> In TREC, intensive research efforts are made for the European languages, for example, English, Gerrn~, French, Spanish, etc. Historically, these languages share many similar linguistic properties. However, exclusive focus on Asian languages, for example, Chinese, Japanese and Korean (CJK) - which also share significantly similar linguistic properties, has not been given. Enormous amount of CJK information is currently on the Internet. The combined growth rate of the CJK electronic information is also predicted to be growing at a faster rate. Cross language IR focusing on these Asian languages is therefore inevitable.</Paragraph> <Paragraph position="3"> In this paper, we investigate the potential of indexing the semantically correlated Han characters appear in both Chinese and Japanese documents and queries to facilitate a cross language information retrieval. Using Han character oriented document and query vectors, within the framework of the vector space information retrieval, we then evaluate the effectiveness of the cross language IR with respect to their monolingual counterparts. We conclude with a discussion about further research possibilities and potentials of Han character oriented cross language information retrieval for the CJK languages.</Paragraph> </Section> class="xml-element"></Paper>