File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/91/h91-1026_abstr.xml
Size: 5,098 bytes
Last Modified: 2025-10-06 13:47:10
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1026"> <Title>Identifying Word Correspondences in Parallel Texts</Title> <Section position="1" start_page="0" end_page="152" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Researchers in both machine translation (e.g., Brown et a/, 1990) arm bilingual lexicography (e.g., Klavans and Tzoukermarm, 1990) have recently become interested in studying parallel texts (also known as bilingual corpora), bodies of text such as the Canadian Hansards (parliamentary debates) which are available in multiple languages (such as French and English). Much of the current excitement surrounding parallel texts was initiated by Brown et aL (1990), who outline a self-organizing method for using these parallel texts to build a machine translation system.</Paragraph> <Paragraph position="1"> Brown et al. begin by aligning the parallel texts at the sentence level. In our experience, 90% of the English sentences match exactly one French sentence, but other possibilities, especially two sentences matching one or one matching two, are not uncommon. There has been quite a bit of recent work on sentence alignment, e.g., (Brown, Lai and Mercer, 1990, (Kay and Rbscheisen, 1988), (Catizone, Russell, and Warwick, to appear); we use a method described in (Gale and Church, 1991) which makes use of the fact that the length of a text (in characters) i~ 5ighly correlated (0.991) with the length of its translation. A probabilistic score is assigned to each proposed match, based on the lengths of the two regions and some simple assumptions about the distributions of these two lengths. This probabilistic score is used in a dynamic programming framework to find the maximum likelihood alignment of sentences.</Paragraph> <Paragraph position="2"> After sentences have been aligned, the second step is to identify correspondences at the word level. That is, we would like to know which words in the English text correspond to which words in the French text. The identification of word level correspondences is the main topic of this paper.</Paragraph> <Paragraph position="3"> We wish to distinguish the terms alignment and correspondence, The term alignment will be used when order constraints must be preserved and the term correspondence will be used when order constraints need not be preserved and crossing dependencies are permitted. We refer to the matching problem at the word level as a correspondence problem because it is important to model crossing dependencies (e.g., sales volume and volume des ventes). In contrast, we refer to the matching problem at the sentence level as an alignment problem because we believe that it is not necessary to model crossing dependencies at the sentence level as they are quite rare and can be ignored for now.</Paragraph> <Paragraph position="4"> Here is an example of our word correspondence program. Given the input English and French Sentences.&quot; English we took the initiative in assessing and amending current legislation and policies to ensure that they reflect a broad interpretation of the charter.</Paragraph> <Paragraph position="5"> French nous avons pris 1' initiative d' 4valuer et de modifier des lois et des politiques en vigueur afin qu' elles correspondent ~ une interprdation ggn4reuse de la charm.</Paragraph> <Paragraph position="6"> The program wouM produce the following correspondences: Output: we/nous took/O the/O initiative/initiative in/O assessing/6valuer and/et ammending/modifier current/O legislation/O and/et policies/politiques to/~ ensure/O that/qu' they/elles reflect/O a/une broad/O interpretafion/interpr6tation of/de theBa charter/charte ./.</Paragraph> <Paragraph position="7"> In this example, 15 out of the 23 (65%) English words were matched with a French word (with to/d in error), and 8 of the English words were left unmatched (paired with &quot;0&quot;). Throughout this work, we have focused our attention on robust statistics that tend to avoid making hard decisions when there isn't much confidence. In other words, we favor methods with relatively high precision and possibly low recall. For now, we are more concerned with errors of commission than errors of omission. Based on a sample of 800 sentences, we estimate that our word matching procedure matches 61% of the English words with some French word, and about 95% of these pairs match the English word with the appropriate French word.</Paragraph> <Paragraph position="8"> After word correspondences have been identified, it is possible to estimate a probabilistic transfer dictionary. The entry for &quot;the&quot; found in prawn et al.) includes the estimates of ~rob(le I the)=.61 and Prob(1a I the)=.18. Brown et al. show how this probabilistic transfer dictionary can be combined with a trigram grammar in order to produce a machine translation system. Since this paper is primarily concerned with the identification of word correspondences. we will not go into these other very interesting issues here.</Paragraph> </Section> class="xml-element"></Paper>