File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/91/p91-1022_intro.xml

Size: 2,457 bytes

Last Modified: 2025-10-06 14:05:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="P91-1022">
  <Title>ALIGNING SENTENCES IN PARALLEL CORPORA</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Recent work by Brown et al., \[Brown et al., 1988, Brown et al., 1990\] has quickened anew the long dormant idea of using statistical techniques to carry out machine translation from one natural language to another. The lynchpin of their approach is a. large collection of pairs of sentences that. are mutual translations. Beyond providing grist to the sta.tistical mill, such pairs of sentences are valuable to researchers in bilingual lexicography \[I(la.va.ns and Tzoukerma.nn, 1990, Warwick and Russell, 1990\] and may be usefifl in other approaches to machine translation \[Sadler, 1989\].</Paragraph>
    <Paragraph position="1"> In this paper, we consider the problem of extra.cting from pa.raJlel French and F, nglish corpora pairs sentences that are translations of one another. The task is not trivial because at times a single sentence in one language is translated as two or more sentences in the other language. At other times a sentence, or even a whole passage, may be missing from one or the other of the corpora.</Paragraph>
    <Paragraph position="2"> If a person is given two parallel texts and asked to match up the sentences in them, it is na.tural for him to look at the words in the sentences. Elaborating this intuitively appealing insight, researchers at Xerox and at ISSCO \[Kay, 1991, Catizone et al., 1989\] have developed alignment Mgodthms that pair sentences according to the words that they contain. Any such algorithm is necessarily slow and, despite the potential for highly accurate alignment, may be unsuitable for very large collections of text. Our algorithm makes no use of the lexical details of the corpora, but deals only with the number of words in each sentence.</Paragraph>
    <Paragraph position="3"> Although we have used it only to align parallel French and English corpora from the proceedings of the Canadian Parliament, we expect that our technique wouhl work on other French and English corpora and even on other pairs of languages. The work of Gale and Church , \[Gale and Church, 1991\], who use a very similar method but measure sentence lengths in characters rather than in words, supports this promise of wider applica.bility.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML