File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1008_intro.xml

Size: 4,807 bytes

Last Modified: 2025-10-06 14:01:17

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1008">
  <Title>A Transitive Model for Extracting Translation Equivalents of Web Queries through Anchor Text Mining</Title>
  <Section position="2" start_page="0" end_page="1" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Cross-language information retrieval (CLIR), addressing the special need where users can query in one language and retrieve relevant documents written or indexed in another language, has become an important issue in the research of information retrieval (Dumais et al., 1996; Davis et al., 1997; Ballesteros &amp; Croft, 1998; Nie et al., 1999). However, its application to practical Web search services has not lived up to expectations, since they suffer a major bottleneck that lacks up-to-date bilingual lexicons containing the translation of popular query terms  such as proper nouns (Kwok, 2001).</Paragraph>
    <Paragraph position="1"> To enable capability of CLIR, existing IR systems mostly rely on bilingual dictionaries for cross-lingual retrieval. In these systems, queries submitted in a source language normally have to be translated into a target language by means of simple dictionary lookup. These dictionary-based techniques are limited in real-world applications, since the queries given by users often contain proper nouns.</Paragraph>
    <Paragraph position="2"> Another kind of popular approaches to dealing with query translation based on corpus-based techniques uses a parallel corpus containing aligned sentences whose translation pairs are corresponding to each other (Brown et al., 1993; Dagan et al., 1993; Smadja et al., 1996). Although more reliable translation equivalents can be extracted by these techniques, the unavailability of large enough parallel corpora for various subject domains and multiple languages is still in a thorny situation. On the other hand, the alternative approach using comparable or unrelated text corpora were studied by Rapp (1999) and Fung et al. (1998). This task is more difficult due to lack of parallel correlation between document or sentence pairs.  In our collected query logs, most of user queries contain only one or two words, so we use query term, query or term interchangeably in this paper.</Paragraph>
    <Paragraph position="3"> In our previous research we have developed an approach to extracting translations of Web queries through mining of Web anchor texts and link structures (Lu, et al., 2001). This approach exploits Web anchor texts as live bilingual corpora to reduce the existing difficulties of query translation. Anchor text sets, which are composed of a number of anchor texts linking to the same pages, may contain similar description texts in multiple languages, thus it is more likely that users queries and their corresponding translations frequently appear together in the same anchor text sets. The anchor-text mining approach has been found effective particularly for proper names, such as international company names, names of foreign movie stars, worldwide events, e.g., &amp;quot;Yahoo&amp;quot;, &amp;quot;Anthrax&amp;quot;, &amp;quot;Harry Potter&amp;quot;, etc.</Paragraph>
    <Paragraph position="4"> Discovering useful knowledge from the potential resource of Web anchor texts is still not fully explored. According to our previous experiments, the extracted translation equivalents might not be reliable enough when a query term whose corresponding translations either appear infrequently in the same anchor text sets or even do not appear together.</Paragraph>
    <Paragraph position="5"> Especially, the translation process will be unavailable if there is a lack of sufficient anchor texts for a particular language pair. Although Web anchor texts, undoubtedly, are live multilingual resources, not every particular pair of languages contains sufficient anchor texts.</Paragraph>
    <Paragraph position="6"> To deal with the problems, this paper extends the previous anchor-text-based approach by adding a phase of indirect translation via an intermediate language. For a query term which is unable to be translated, our idea is to translate it into a set of translation candidates in an intermediate language, and then seek for the most likely translation from the candidates, which are translated from the intermediate language into the target language (Gollins et al., 2001; Borin, 2000). We therefore propose a transitive translation model to further exploit anchor text mining for translating Web queries. A series of experiments has been conducted to realize the performance of the proposed approach. Preliminary experimental results show that many query translations which cannot be obtained using the previous approach can be extracted with the improved approach.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML