File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1099_intro.xml

Size: 5,720 bytes

Last Modified: 2025-10-06 14:02:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1099">
  <Title>Query Translation by Text Categorization</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Cross Language Information Retrieval (CLIR) is increasingly relevant as network-based resources become commonplace. In the medical domain, it is of strategic importance in order to flll the gap between clinical records, written in national languages and research reports, massively written in English. There are several ways for handling CLIR. Historically, the most traditional approach to IR in general and to multilingual retrieval in particular, uses a controlled vocabulary for indexing and retrieval. In this approach, a librarian selects for each document a few descriptors taken from a closed list of authorized terms. A good example of such a human indexing is found in the MedLine database, whose records are manually annotated with Medical Subject Headings (MeSH). Ontological relations (synonyms, related terms, narrower terms, broader terms) can be used to help choose the right descriptors, and solve the sense problems of synonyms and homographs. The list of authorized terms and semantic relations between them are contained in a thesaurus. A problem remains, however, since concepts expressed by one single term in one language sometime are expressed by distinct terms in another. We can observe that terminology-based CLIR is a common approach in well-delimited flelds for which multilingual thesauri already exist (not only in medicine, but also in the legal domain, energy, etc.) as well as in multinational organizations or countries with several o-cial languages. This controlled vocabulary approach is often associated with Boolean-like engines, and it gives acceptable results but prohibits precise queries that cannot be expressed with these authorized keywords. The two main problems are: + it can be di-cult for users to think in terms of a controlled vocabulary, therefore the use of these systems -like most Booleansupported engines- is often performed by professionals rather than general users; + this retrieval method ignores the free-text portions of documents during indexing.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 Translation-based approach
</SectionTitle>
      <Paragraph position="0"> A second approach to multilingual interrogation is to use existing machine translation (MT) systems to automatically translate the queries (Davis, 1998), or even the entire textual database (Oard and Hackett, 1998) (McCarley, 1999) from one language to another, thereby transforming the CLIR problem into a mono-lingual information retrieval (MLIR) problem. This kind of method would be satisfactory if current MT systems did not make errors. A certain amount of syntactic error can be accepted without perturbing results of information retrieval systems, but MT errors in translating concepts can prevent relevant documents, indexed on the missing concepts, from being found. For example, if the word traitement in French is translated by processing instead of prescription, the retrieval process would yield wrong results. This drawback is limited in MT systems that use huge transfer lexicons of noun phrases by taking advantage of frequent collocations to help disambiguation, but in any collection of text, ambiguous nouns will still appear as isolated nouns phrases untouched by this approach. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.2 Using parallel resources
</SectionTitle>
      <Paragraph position="0"> A third approach receiving increasing attention is to automatically establish associations between queries and documents independent of language difierences. Seminal researches were using latent semantic indexing (Dumais et al., 1997). The general strategy when working with parallel or comparable texts is the following: if some documents are translated into a second language, these documents can be observed both in the subspace related to the flrst language and the subspace related to the second one; using a query expressed in the second language, the most relevant documents in the translated subset are extracted (usually using a cosine measure of proximity). These relevant documents are in turn used to extract close untranslated documents in the subspace of the flrst language. This approach use implicit dependency links and co-occurrences that better approximate the notion of concept. Such a strategy has been tested with success on the English-French language pair using a sample of the Canadian Parliament bilingual corpus. It is reported that for 92% of the English text documents the closest document returned by the method was its correct French translation. Such an approach presupposes that the sample used for training is representative of the full database, and that su-cient parallel/comparable corpora are available or acquired. null Other approaches are usually based on bilingual dictionaries and terminologies, sometimes combined with parallel corpora. These approaches attempt to infer a word by word transfer function: they typically begin by deriving a translation dictionary, which is then applied to query translation. To synthesize, we can consider that performances of CLIR systems typically range between 60% and 90% of the corresponding monolingual run (Sch~auble and Sheridan, 1998). CLIR ratio above 100% have been reported (Xu et al., 2001), however such results were obtained by computing a weak monolingual baseline.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML