File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/p99-1068_intro.xml

Size: 4,661 bytes

Last Modified: 2025-10-06 14:06:56

<?xml version="1.0" standalone="yes"?>
<Paper uid="P99-1068">
  <Title>Mining the Web for Bilingual Text</Title>
  <Section position="2" start_page="0" end_page="527" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Text in parallel translation is a valuable resource in natural language processing. Statistical methods in machine translation (e.g.</Paragraph>
    <Paragraph position="1"> (Brown et al., 1990)) typically rely on large quantities of bilingual text aligned at the document or sentence level, and a number of approaches in the burgeoning field of cross-language information retrieval exploit parallel corpora either in place of or in addition to mappings between languages based on information from bilingual dictionaries (Davis and Dunning, 1995; Landauer and Littman, 1990; Hull and Oard, 1997; Oard, 1997). Despite the utility of such data, however, sources of bilingual text are subject to such limitations as licensing restrictions, usage fees, restricted domains or genres, and dated text (such as 1980's Canadian politics); or such sources simply may not exist for * This work was supported by Department of Defense contract MDA90496C1250, DARPA/ITO Contract N66001-97-C-8540, and a research grant from Sun Microsystems Laboratories. The author gratefully acknowledges the comments of the anonymous reviewers, helpful discussions with Dan Melamed and Doug Oard, and the assistance of Jeff Allen in the French-English experimental evaluation.</Paragraph>
    <Paragraph position="2"> language pairs of interest.</Paragraph>
    <Paragraph position="3"> Although the majority of Web content is in English, it also shows great promise as a source of multilingual content. Using figures from the Babel survey of multilinguality on the Web (htZp ://www. isoc. org/), it is possible to estimate that as of June, 1997, there were on the order of 63000 primarily non-English Web servers, ranging over 14 languages. Moreover, a follow-up investigation of the non-English servers suggests that nearly a third contain some useful cross-language data, such as parallel English on the page or links to parallel English pages -the follow-up also found pages in five languages not identified by the Babel study (Catalan, Chinese, Hungarian, Icelandic, and Arabic; Michael Littman, personal communication). Given the continued explosive increase in the size of the Web, the trend toward business organizations that cross national boundaries, and high levels of competition for consumers in a global marketplace, it seems impossible not to view multilingual content on the Web as an expanding resource. Moreover, it is a dynamic resource, changing in content as the world changes. For example, Diekema et al., in a presentation at the 1998 TREC-7 conference (Voorhees and Harman, 1998), observed that the performance of their cross-language information retrieval was hurt by lexical gaps such as Bosnia/Bosniethis illustrates a highly topical missing pair in their static lexical resource (which was based on WordNet 1.5). And Gey et al., also at TREC-7, observed that in doing cross-language retrieval using commercial machine translation systems, gaps in the lexicon (their example was acupuncture/Akupunktur) could make the difference between precision of 0.08 and precision of 0.83 on individual queries.</Paragraph>
    <Paragraph position="4"> ttesnik (1998) presented an algorithm called</Paragraph>
    <Paragraph position="6"> the Web as a source of parallel text, demonstrating its potential with a small-scale evaluation based on the author's judgments. After briefly reviewing the STRAND architecture and preliminary results (Section 2), this paper goes beyond that preliminary work in two significant ways. First, the framework is extended to include a filtering stage that uses automatic language identification to eliminate an important class of false positives: documents that appear structurally to be parallel translations but are in fact not in the languages of interest. The system is then run on a somewhat larger scale and evaluated formally for English and Spanish using measures of agreement with independent human judges, precision, and recall (Section 3). Second, the algorithm is scaled up more seriously to generate large numbers of parallel documents, this time for English and French, and again subjected to formal evaluation (Section 4). The concrete end result reported here is an automatically acquired English-French parallel corpus of Web documents comprising 2491 document pairs, approximately 1.5 million words per language (without markup), containing little or no noise.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML