File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1060_metho.xml

Size: 8,957 bytes

Last Modified: 2025-10-06 14:07:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1060">
  <Title>Rapidly Retargetable Interactive Translingual Retrieval</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2. DOCUMENT TRANSLATION AND IN-
DEXING
</SectionTitle>
    <Paragraph position="0"> We have adopted a document translation architecture for two reasons. First, we support a single query language (English) but multiple document languages, so indexing English terms simplifies query processing (where interactive response time can be a concern). Second, a document translation architecture simplifies the display of translated documents by decoupling the translation and display processes. Gigabyte collections require machine translation that is orders of magnitude faster than present commercial systems. We accomplish this using term-by-term translation, in which the basic data structure is a simple hash table lookup. Any translation requires some source of translation knowledge--we use a bilingual term list containing English translation(s) for eachforeign language term. We typically construct these term lists by harvesting Internet-available translation resources, so the foreign language terms for which translations are known are typically an eclectic mix of root and inflected forms. We accommodate this limitation using a four-stage backoff statistical stemming approach to enhance translation coverage.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.1 Preprocessing.
</SectionTitle>
      <Paragraph position="0"> Differences in use of diacritic-s, case, and punctuation can inhibit matching between term list entries and document terms, so normalization is important. In order to maximize the probability of matching document words with term list entries, we normalize the bilingual term list and the documents by: converting characters in Western languages to lowercase, removing all accents and diacritics, and segmentation, which for Western languages merely involves separating punctuation from other text by the addition of white space.</Paragraph>
      <Paragraph position="1"> Our preprocessingalso includes conversion of the bilingual term list and the document collection into standard formats. The preprocessing typically requires about half a day of programmer time.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.2 Four-Stage Backoff Translation.
</SectionTitle>
      <Paragraph position="0"> Bilingual term lists found on the Web often contain an eclectic mix of root forms and morphological variants. We thus developed a four-stage backoff strategy to maximize coverage while limiting spurious translations:  1. Match the surface form of a document term to surface forms of source language terms in the bilingual term list. 2. Match the stem of a document term to surfaceforms of source language terms in the bilingual term list.</Paragraph>
      <Paragraph position="1"> 3. Match the surface form of a documentterm to stems of source language terms in the bilingual term list.</Paragraph>
      <Paragraph position="2"> 4. Match the stem of a document term to stems of source lan null guage terms in the bilingual term list.</Paragraph>
      <Paragraph position="3"> The process terminates as soon as a match is found at any stage, and the known translations for that match are generated. Although this may produce an inappropriate morphological variant for a correct English translation, use of English stemming at indexing time minimizes the effect of that factor on retrieval effectiveness. Becausewe are ultimately interested in processing documents in any language, we may not have a hand-crafted stemmer available for the document language. We have thus explored the application of rule induction to learn stemming rules in an unsupervised fashion from the collection that is being indexed [2].</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.3 Balanced Top-2 Translation.
</SectionTitle>
      <Paragraph position="0"> We produce exactly two English terms for each foreign-language term. For terms with no known translation, the untranslated term is generated twice (often appropriate for proper names in the Latin1 character set). For terms with one translation, that translation is generated twice. For terms with two or more known translations, we generate the &amp;quot;best&amp;quot; two translations. In prior experiments we have found that this balanced translation strategy significantly out-performs the usual (unbalanced) technique of including all known translations [1]. We establish the &amp;quot;best&amp;quot; translations by sorting the bilingual term list in advanceusing only English resources. All single-word translations are ordered by decreasing unigram frequency in the Brown corpus, followed by all multi-word translations, and finally by any single word entries not found in the Brown corpus.</Paragraph>
      <Paragraph position="1"> This ordering has the effect of minimizing the effect of infrequent words in non-standard usages or of misspellings that sometimes appear in bilingual term lists. This translation strategy allows balancing of translations in a modular fashion, even when one does not have access to the internal parameters of the information retrieval system. We translate 100 MB per hour using Perl on a SPARC Ultra 5.</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.4 Post-translation Document Expansion.
</SectionTitle>
      <Paragraph position="0"> We implement post-translation document expansion for the foreign language stories after translation into English in order to enrich the indexing vocabulary beyond that which was available after term-by-term translation. This is analogous to the process that Singhal et al. applied to monolingual speech retrieval [4].</Paragraph>
      <Paragraph position="1"> Term-by-term translation producesa set of English terms that serve as a noisy representation of the original source language document.</Paragraph>
      <Paragraph position="2"> These terms are then treated as a query to a comparable English collection, typically contemporaneous newswire text, from which we retrieve the five highest ranked documents. From those five documents, we extract the most selective terms and use them to enrich the original translations of the documents. For this expansion process we select one instance of every term with an IDF value above an ad hoc threshold that was tuned to yield approximately 50 new terms. This optional step is the slowest processing stage, with a throughput of about 20 MB per hour.</Paragraph>
    </Section>
    <Section position="5" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.5 Indexing
</SectionTitle>
      <Paragraph position="0"> The resulting collection is then indexed using Inquery (version 3.1p1), with the kstem stemmer and default English stopword list.</Paragraph>
      <Paragraph position="1"> Indexing is the fastest stage in the process, with throughput exceeding one gigabyte per hour.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3. INTERACTIVE RETRIEVAL
</SectionTitle>
    <Paragraph position="0"> Interactive searches are performed using a Web interface. Summary information for the top-ranked documents is displayedin groups of ten per page. Document summaries consist of the date and a gloss translation of the document title. Users can inspect a gloss translation of the full text of any document if the title is not sufficiently informative. For both title and full text, the gloss translations are generated in advance using the same process as translation for indexing, with the following differences in detail: Terms added as a result of document expansion are not displayed. null The number of retained translations is separately selectable for the title and for full text indexing.</Paragraph>
    <Paragraph position="1"> Translations are not duplicated when fewer than the maximum allowable number of translations are known.</Paragraph>
    <Paragraph position="2"> Our goal is to support the process of finding documents, with the realization that the process of using documents may need to be supported in some other way (e.g., by forwarding relevant documents to someone who is able to read that language). We have therefore designedour interface to highlight the query terms in translated documents and to facilitate skimming by emphasizing the most common translation when multiple translations are displayed. We have found that such displays can support a classification task, even when the translation is not easy to read [3]. Documents must be classified by the user as relevant or not relevant, so our classification results suggest that this can be an effective user interface design.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML