File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/03/w03-1120_relat.xml

Size: 3,604 bytes

Last Modified: 2025-10-06 14:15:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1120">
  <Title>Cross-Language Information Retrieval Based on Category Matching Between Language Versions of a Web Directory</Title>
  <Section position="3" start_page="0" end_page="0" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> Approaches to CLIR can be classified into three categories; document translation, query translation, and the use of inter-lingual representation. The approach based on translation of target documents has the advantage of utilizing existing machine translation systems, in which more content information can be used for disambiguation. Thus, in general, it achieves a better retrieval effectiveness than those based on query translation(Sakai, 2000). However, since it is impractical to translate a huge document collection beforehand and it is difficult to extend this method to new languages, this approach is not suitable for multilingual, large-scale, and frequently-updated collection of the Web. The second approach transfers both documents and queries into an inter-lingual representation, such as bilingual thesaurus classes or a language-independent vector space. The latter approach requires a training phase using a bilingual (parallel or comparable) corpus as a training data.</Paragraph>
    <Paragraph position="1"> The major problem in the approach based on the translation and disambiguation of queries is that the queries submitted from ordinary users of Web search engines tend to be very short (approximately two words on average (Jansen et al., 2000)) and usually consist of just an enumeration of keywords (i.e. no context). However, this approach has an advantage that the translated queries can simply be fed into existing monolingual search engines. In this approach, a source language query is first translated into target language using a bilingual dictionary, and translated query is disambiguated. Our method falls into this category.</Paragraph>
    <Paragraph position="2"> It is pointed out that corpus-based disambiguation methods are heavily affected by the difference in domain between query and corpus. Hull suggests that the difference between query and corpus may cause bad influence on retrieval effectiveness in the methods that use parallel or comparable corpora (Hull, 1997). Lin et al. conducted comparative experiments among three monolingual corpora that have different domains and sizes, and has concluded that large-scale and domain-consistent corpus is needed for obtaining useful co-occurrence data (Lin et al., 1999).</Paragraph>
    <Paragraph position="3"> On the Web retrieval, which is the target of our research, the system has to cope with queries in many different kinds of topics. However, it is impractical to prepare corpora that cover any possible domains. In our previous paper(Kimura et al., 2003), we proposed a CLIR method which uses documents in a Web directory that has several language versions (such as Yahoo!), instead of using existing corpora, in order to improve the retrieval effectiveness. In this paper, we propose an extension of our method which takes account of the hierarchical structure of Web directories. Dumais et al.(Dumais and Chen, 2000) suggests that the precision of Web document classification could be improved to a certain extent by limiting the target categories to compare by using the hierarchical structure of a Web directory. In this paper, we try to improve our proposed method by incorporating the hierarchical structure of a Web directory for merging categories.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML