File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1120_intro.xml
Size: 2,944 bytes
Last Modified: 2025-10-06 14:02:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1120"> <Title>Cross-Language Information Retrieval Based on Category Matching Between Language Versions of a Web Directory</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> With the popularity of the Internet, more and more languages are becoming to be used for Web documents, and it is now much easier to access documents written in foreign languages. However, existing Web search engines only support the retrieval of documents which are written in the same language as the query, so the monolingual users are not able to retrieve documents written in non-native languages efficiently. Also, there might be cases, depending on the user's demand, where information written in a language other than the user's native language is rich. Needs for retrieving such information must be large. In order to satisfy such needs on a usual monolingual retrieval system, the user him-/herself has to manually translate the query by using a dictionary, etc. This process not only imposes a burden to the user but also might choose incorrect translations for the query, especially for languages that are unfamiliar to the user.</Paragraph> <Paragraph position="1"> To fulfill such needs, researches on Cross-Language Information Retrieval (CLIR), a technique to retrieve documents written in a certain language using a query written in another language, have been active in recent years. A variety of methods, including employing corpus statistics for the translation of terms and the disambiguation of translated terms, are studied and a certain results has been obtained. However, corpus-based disambiguation methods are heavily affected by the domain of the training corpus, so the retrieval effectiveness for other domains might drop significantly. Besides, since the Web consists of documents in various domains or genres, the method for CLIR of Web documents should be independent of a particular domain.</Paragraph> <Paragraph position="2"> In this paper, we propose a CLIR method which employs Web directories provided in multiple language versions (such as Yahoo!). Our system uses two or more language versions of a Web directory.</Paragraph> <Paragraph position="3"> One version is the query language, and others are the target languages. From these language versions, category correspondences between languages are estimated in advance. First, feature terms are extracted from Web documents for each category in the source and the target languages. Then, one or more corresponding categories in another language are determined beforehand by comparing similarities between categories across languages. Using these category pairs, we intend to resolve ambiguities of simple dictionary translation by narrowing the categories to be retrieved in the target language.</Paragraph> </Section> class="xml-element"></Paper>