File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/00/w00-0803_concl.xml

Size: 3,781 bytes

Last Modified: 2025-10-06 13:52:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0803">
  <Title>Chinese-Japanese Cross Language Information Retrieval: A Han Character Based Approach</Title>
  <Section position="8" start_page="24" end_page="25" type="concl">
    <SectionTitle>
6 Further Research
</SectionTitle>
    <Paragraph position="0"> In our experiment, we represent Chinese and Japanese documents and queries as weighted vectors of Han Characters. Before the vectorisation, necessary preprocessing is done to cope with the multiplicity of coding problem of sern~tically similar ideographs and to cope with some obvious language specific issues. Same as the monolingual vector space information retrieval approach, we measured cosine similarity between a query and a document to retrieve relevant documents in order of relevance. Similarity is measured for both cases; that is, (1) monolingual: the query and the document are in the same language, and (2) cross-language: the query and the document are of different languages. The comparative result shows that the effectiveness of cross language information retrieval between Chinese and Japanese in this way is comparable to that of other CLIR experiments conducted mainly with multiple western languages with the help of thesauri and machine translation techniques.</Paragraph>
    <Paragraph position="1"> One of the promising applications of this approach can be in identifying and aligning Chinese and Japanese documents online. For example, retrieving relevant news articles published in both languages from the Internet. It is understood that several mathematical techniques, like Han character clustering and dimensionality reduction techniques (Evans et al., 1998) can augment and automate the process of finding associations among the Han characters within and across the CJK languages.</Paragraph>
    <Paragraph position="2"> The vector space model is also flexible for the adjustment of weighting scheme. Therefore, we can flexibly augment the Han character based query vectors (a pseudo- query expansion techniques) and document vectors (a pseudo-relevance feedback technique) for effective CLIR. We left these parts as our immediate future work.</Paragraph>
    <Paragraph position="3"> As done with the MLIR, n-gram characters based indexing can also be experimented.</Paragraph>
    <Paragraph position="4"> However, due to the small document collection and the number of queries we had, n-gram based indexing suffers from data sparseness problem.</Paragraph>
    <Paragraph position="5"> We, therefore, left out the n-gram character based CUR evaluation until a huge collection of documents and queries are ready.</Paragraph>
    <Paragraph position="6"> Conclusion In this paper, we experimented on a small collection of homogeneous bilingual texts and a small set of queries. The result obtained supports the promising aspect of using Han characters for cross language information retrieval of CJK languages. Such an approach has its own advantage since no translation of query or documents are needed. In comparison to maintaining multilingual dictionaries or thesauri, maintaining Han characters mapping table is more effective because the mapping table needs not to be updated so often. Sophisticated mathematical analysis of Han characters can bring a new dimension in retrieving cross Asian language information. Kanji Sense Disambiguation (KSD) techniques using advanced machine learning techniques can make the proposed CLIR method more effective. KSD is a long neglected area of research.</Paragraph>
    <Paragraph position="7"> Dimensionality reduction techniques, chistedng, independent component analysis (ICA) and other mathematical methods can be exploited to  enhance Han character based l)Xc, cessing of CJK languages.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML