File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/p94-1051_intro.xml
Size: 4,287 bytes
Last Modified: 2025-10-06 14:05:46
<?xml version="1.0" standalone="yes"?> <Paper uid="P94-1051"> <Title>AUTOMATIC ALIGNMENT IN PARALLEL CORPORA</Title> <Section position="3" start_page="0" end_page="334" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> Parallel linguistically meaningful text units are indispensable in a number of NLP and lexicographic applications and recently in the so called Example-Based Machine Translation (EBMT).</Paragraph> <Paragraph position="1"> As regards EBMT, a large amount of bimultilingual translation examples is stored in a database and input expressions are rendered in the target language by retrieving from the database that example which is most similar to the input. A task of crucial importance in this framework, is the establishment of correspondences between units of multilingual texts at sentence, phrase or even word level.</Paragraph> <Paragraph position="2"> The adopted criteria for ascertaining the embedded extra-linguistic data (tables, anchor points, SGML markers, etc) and their possible inconsistencies.</Paragraph> <Paragraph position="3"> * it should be able to process a large amount of texts in linear time and in a computationally effective way.</Paragraph> <Paragraph position="4"> * in terms of performance a considerable success rate (above 99% at sentence level) must be encountered in order to construct a database with truthfully correspondent units. It is desirable that the alignment method is languageindependent. null s the proposed method must be extensible to accommodate future improvements. In addition, any training or error correction mechanism should be reliable, fast and should not require vast amounts of data when switching from a pair of languages to another or dealing with different text type corpora.</Paragraph> <Paragraph position="5"> Several approaches have been proposed tackling the problem at various levels. \[Catizone 89\] proposed linking regions of text according to the regularity of word co-occurrences across texts.</Paragraph> <Paragraph position="6"> \[Brown 91\] described a method based on the number of words that sentences contain.</Paragraph> <Paragraph position="7"> Moreover, certain anchor points and paragraph markers are also considered. The method has been applied to the Hansard Corpus achieving an accuracy between 96%-97%.</Paragraph> <Paragraph position="8"> \[Gale 91\] \[Church 93\] proposed a method that relies on a simple statistical model of character lengths. The model is based on the observation that longer sentences in one language tend to be translated into longer sequences in the other language while shorter ones tend to be translated into shorter ones. A probabilistic score is assigned to each pair of proposed sentence pairs, based on the ratio of lengths of the two sentences and the variance of this ratio.</Paragraph> <Paragraph position="9"> Although the apparent efficacy of the Gale-Church algorithm is undeniable and validated on different pairs of languages, it faces problems when handling complex alignments. The 2-1 alignments had five times the error rate of 1-1. The 2-2 category disclosed a 33% error rate, while the 1-0 or 0-1 alignments were totally missed.</Paragraph> <Paragraph position="10"> To overcome the inherited weaknesses of the Gale-Church method, \[Simard 92\] proposed using cognates, which are pairs of tokens of different languages which share &quot;obvious&quot; phonological or orthographic and semantic properties, since these are likely to be used as mutual translations.</Paragraph> <Paragraph position="11"> In this paper, an alignment scheme is proposed in order to deal with the complexity of varying requirements envisaged by different applications in a systematic way. For example, in EBMT, the requirements are strict in terms of information integrity but relaxed in terms of delay and response time. Our approach is based on several observations. First of all, we assume that establishment of correspondences between units can be applied at sentence, clause, and phrase level. Alignment at any of these levels has to invoke a different set of textual and linguistic information (acting as unit delimiters). In this paper, alignment is tackled at sentence level.</Paragraph> </Section> class="xml-element"></Paper>