File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/91/p91-1023_intro.xml

Size: 7,864 bytes

Last Modified: 2025-10-06 14:05:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="P91-1023">
  <Title>A PROGRAM FOR ALIGNING SENTENCES IN BILINGUAL CORPORA</Title>
  <Section position="2" start_page="0" end_page="178" type="intro">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Researchers in both machine lranslation (e.g., Brown et al, 1990) and bilingual lexicography (e.g., Klavans and Tzoukermann, 1990) have recently become interested in studying bilingual corpora, bodies of text such as the Canadian I-lansards (parliamentary debates) which are available in multiple languages (such as French and English). The sentence alignment task is to identify correspondences between sentences in one language and sentences in the other language.</Paragraph>
    <Paragraph position="1"> This task is a first step toward the more ambitious task finding correspondances among words. I The input is a pair of texts such as Table 1.</Paragraph>
    <Paragraph position="2"> 1. In statistics, string matching problems are divided into two classes: alignment problems and correspondance problems. Crossing dependencies are possible in the latter, but not in the former.</Paragraph>
    <Section position="1" start_page="0" end_page="178" type="sub_section">
      <SectionTitle>
Input to Alignment Program
English
</SectionTitle>
      <Paragraph position="0"> According to our survey, 1988 sales of mineral water and soft drinks were much higher than in 1987, reflecting the growing poptdm'ity of these products. Cola drink manufacturers in particular achieved above-average growth rates. The higher turnover was largely due to an increase in the sales volume. Employment and investment levels also climbed. Following a two-year Iransitional period, the new Foodstuffs Ordinance for Mineral Water came into effect on April 1, 1988. Specifically, it contains more stringent requirements regarding quality consistency and purity guarantees.</Paragraph>
      <Paragraph position="1"> French Quant aux eaux rain&amp;ales et aux limonades, elles rencontrent toujours plus d'adeptes. En effet, notre sondage fait ressortir des ventes nettement SUl~rieures h celles de 1987, pour les boissons base de cola notamment. La progression des chiffres d'affaires r~sulte en grande partie de l'accroissement du volume des ventes. L'emploi et les investissements ont 8galement augmentS.</Paragraph>
      <Paragraph position="2"> La nouvelle ordonnance f&amp;16rale sur les denr6es alimentaires concernant entre autres les eaux min6rales, entree en vigueur le ler avril 1988 aprbs une p6riode transitoire de deux ans, exige surtout une plus grande constance dans la qualit~ et une garantie de la puret&amp; The output identifies the alignment between sentences. Most English sentences match exactly one French sentence, but it is possible for an English sentence to match two or more French sentences. The first two English sentences (below) illustrate a particularly hard case where two English sentences align to two French sentences. No smaller alignments are possible because the clause &amp;quot;... sales ... were higher...&amp;quot; in  the first English sentence corresponds to (part of) the second French sentence. The next two alignments below illustrate the more typical case where one English sentence aligns with exactly one French sentence. The final alignment matches two English sentences to a single French sentence. These alignments agreed with the results produced by a human judge.</Paragraph>
      <Paragraph position="3">  According to our survey, 1988 sales of mineral water and soft drinks were much higher than in 1987, reflecting the growing popularity of these products. Cola drink manufacturers in particular achieved above-average growth rates.</Paragraph>
      <Paragraph position="4"> Quant aux eaux mintrales et aux limonades, elles renconlrent toujours plus d'adeptes. En effet, notre sondage fait ressortir des ventes nettement SUlX~rieures A celles de 1987, pour les boissons A base de cola notamment.</Paragraph>
      <Paragraph position="5"> The higher turnover was largely due to an increase in the sales volume.</Paragraph>
      <Paragraph position="6"> La progression des chiffres d'affaires r#sulte en grande partie de l'accroissement du volume des ventes.</Paragraph>
      <Paragraph position="7"> Employment and investment levels also climbed. L'emploi et les investissements ont #galement augmenUf.</Paragraph>
      <Paragraph position="8"> Following a two-year transitional period, the new Foodstuffs Ordinance for Mineral Water came into effect on April 1, 1988. Specifically, it contains more stringent requirements regarding quality consistency and purity guarantees.</Paragraph>
      <Paragraph position="9"> La nonvelle ordonnance f&amp;l&amp;ale sur les denrtes alimentaires concernant entre autres les eaux mindrales, entree en viguenr le ler avril 1988 apr~ une lxfriode tmmitoire de deux ans, exige surtout une plus grande constance darts la qualit~ et une garantie de la purett.</Paragraph>
      <Paragraph position="10"> Aligning sentences is just a first step toward constructing a probabilistic dictionary (Table 3) for use in aligning words in machine translation (Brown et al., 1990), or for constructing a bilingual concordance (Table 4) for use in lexicography (Klavans and Tzoukermann, 1990).</Paragraph>
      <Paragraph position="11">  bank/banque (&amp;quot;money&amp;quot; sense) and the governor of the et le gouvemeur de la 800 per cent in one week through % ca une semaine ~ cause d' ut~ bank/banc (&amp;quot;place&amp;quot; sense) bank of canada have fwxluanfly bcaque du canada ont fr&amp;lnemm bank action. SENT there banque. SENT voil~ such was the case in the georges ats-tmis et lc canada it Wolx~ du he said the nose and tail of the _,~M__~ lcs extn~tta du bank issue which was settled betw banc de george.</Paragraph>
      <Paragraph position="12"> bank were surrendered by banc. SENT~ fair Although there has been some previous work on the sentence alignment, e.g., (Brown, Lai, and Mercer, 1991), (Kay and Rtscheisen, 1988), (Catizone et al., to appear), the alignment task remains a significant obstacle preventing many potential users from reaping many of the benefits of bilingual corpora, because the proposed solutions are often unavailable, unreliable, and/or computationally prohibitive.</Paragraph>
      <Paragraph position="13"> The align program is based on a very simple statistical model of character lengths. The model makes use of the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend to be translated into shorter sentences. A probabilistic score is assigned to each pair of proposed sentence pairs, based on the ratio of lengths of the two sentences (in characters) and the variance of this ratio. This probabilistic score is used in a dynamic programming framework in order to find the maximum likelihood alignment of sentences.</Paragraph>
      <Paragraph position="14">  It is remarkable that such a simple approach can work as well as it does. An evaluation was performed based on a trilingual corpus of 15 economic reports issued by the Union Bank of Switzerland (UBS) in English, French and German (N = 14,680 words, 725 sentences, and 188 paragraphs in English and corresponding numbers in the other two languages). The method correctly aligned all but 4% of the sentences. Moreover, it is possible to extract a large subcorpus which has a much smaller error rate. By selecting the best scoring 80% of the alignments, the error rate is reduced from 4% to 0.7%. There were roughly the same number of errors in each of the English-French and English-German alignments, suggesting that the method may be fairly language independent. We believe that the error rate is considerably lower in the Canadian Hansards because the translations are more literal.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML