File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1037_metho.xml
Size: 13,281 bytes
Last Modified: 2025-10-06 14:14:39
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1037"> <Title>Transformation Step Original CorPora + Categorization + 'por2favor ' + Word Splitting Translation Errors \[~.\]</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Overview: The Statistical Approach to Translation </SectionTitle> <Paragraph position="0"> The goal is the translation of a text given in some source language into a target language. We are given o J a source ('French') string fl = fl...fj...f.l, which is to be translated into a target ('English') string c~ = el...ei...el. Among all possible target strings, we will choose the one with the highest probability which is given by Bayes' decision rule (Brown et al..</Paragraph> <Paragraph position="1"> 1993):</Paragraph> <Paragraph position="3"> Pr(e{) is the language model of the target language.</Paragraph> <Paragraph position="4"> whereas Pr(j'lale{) is the string translation model.</Paragraph> <Paragraph position="5"> The argmax operation denotes the search problem.</Paragraph> <Paragraph position="6"> In this paper, we address * the problem of introducing structures into the probabilistic dependencies in order to model the string translation probability Pr(f\] \[e~).</Paragraph> <Paragraph position="7"> * the search procedure, i.e. an algorithm to perform the argmax operation in an efficient way.</Paragraph> <Paragraph position="8"> * transformation steps for both the source and the target languages in order to improve the translation process.</Paragraph> <Paragraph position="9"> The transformations are very much dependent on the language pair and the specific translation task and are therefore discussed in the context of the task description. We have to keep in mind that in the search procedure both the language and the translation model are applied after the text transformation steps. However, to keep the notation simple we will not make this explicit distinction in the subsequent exposition. The overall architecture of the statistical translation approach is summarized in Figure 1.</Paragraph> </Section> <Section position="3" start_page="0" end_page="291" type="metho"> <SectionTitle> 2 Aligmnent Models </SectionTitle> <Paragraph position="0"> A key issue in modeling the string translation probability Pr(f(le I) is the question of how we define the correspondence between the words of the target sentence and the words of the source sentence. In typical cases, we can assume a sort of pairwise dependence by considering all word pairs (fj,ei) for a given sentence pair \[f(; el\]. We further constrain this model by assigning each source word to exactly one target word. Models describing these types of dependencies are referred to as alignrnen.t models (Brown et al., 1993), (Dagan eta\].. 1993). (Kay & R6scheisen, 1993). (Fung & Church. 1994), (Vogel et al., 1996).</Paragraph> <Paragraph position="1"> In this section, we introduce a monotoue HMM based alignment and an associated DP based search algorithm for translation. Another approach to statistical machine translation using DP was presented in (Wu, 1996). The notational convention will be a,s follows. We use the symbol Pr(.) to denote general based on Bayes decision rule.</Paragraph> <Paragraph position="2"> probability distributions with (nearly) no specific assnmptions. In contrast, for model-based probability distributions, we use the generic symbol p(.).</Paragraph> <Section position="1" start_page="289" end_page="289" type="sub_section"> <SectionTitle> 2.1 Alignment with HMM </SectionTitle> <Paragraph position="0"> When aligning the words in parallel texts (for Indo-European language pairs like Spanish-English, German-English, halian-German .... ), we typically observe a strong localization effect.. Figure 2 illustrates this effect, for the language pair Spanish-to-English. In many cases, although not always, there is an even stronger restriction: the difference in the position index is smaller than 3 and the alignment.</Paragraph> <Paragraph position="1"> is essentially monotone. To be more precise, the sentences can be partitioned into a small number of segments, within each of which the alignment is monotone with respect to word order in both langaages. null To describe these word-by-word alignments, we introduce the mapping j -- o j, which assigns a position j (with source word .fj ) to the position i = aj (with target word ei). The concept of these alignments is similar to the ones introduced by (Brown et al., 1993), but we will use another type of dependence in the probability distributions. Looking at. such alignments produced by a human expert, it, is evident that the mathematical model should try to capture the strong dependence of aj on the preceding alignment a j-1. Therefore the probability of alignment aj for position j should have a dependence on the previous alignment position O j_l:</Paragraph> <Paragraph position="3"> A similar approach has been chosen by (Dagan et al., 1993) and (Vogel et al.. 1996). Thus the problem formulation is similar t.o that of/,he time alignment problem in speech recognition, where the so-called Hidden Markov models have been successfully used for a long time (Jelinek. 1976). Using the same basic principles, we can rewrite the probability by introducing the 'hidden&quot; aligmnents a~ := a l...aj...aa for a sentence pair \[f~; c/\]:</Paragraph> <Paragraph position="5"> To avoid any confnsion with the term 'hidden'in comparison with speech recognition, we observe that the model states as such (representing words) are not hidden but the actual alignments, i.e. the sequence of position index pairs (j. i = aj ).</Paragraph> <Paragraph position="6"> So far there has been no basic restriction of the approach. We now assume a first-order dependence on the alignments aj only:</Paragraph> <Paragraph position="8"> where, in addition, we have assumed that the lexicon probability p(fle) depends only on aj and not. on</Paragraph> <Paragraph position="10"> To reduce the number of alignment parameters, we assume that the HMM alignment probabilities p(i\[i') depend only on the jump width (i - i'). The monotony condition can than be formulated as: p(i\[i')=O for iC/i'+O.i'+l,i'+2.</Paragraph> <Paragraph position="11"> This monotony requirement limits the applicability of our approach. However, by performing simple word reorderings, it. is possible to approach this requirement (see Section 4.2). Additional countermeasures will be discussed later. Figure 3 gives an illustration of the possible alignments for the monotone hidden Markov model. To draw the analogy with speech recognition, we have to identify the states (along the vertical axis) with the positions i of the target words ei and the time (along the horizont.al axis) with the positions j of the source words J).</Paragraph> </Section> <Section position="2" start_page="289" end_page="291" type="sub_section"> <SectionTitle> 2.2 Training </SectionTitle> <Paragraph position="0"> To train the alignment and the lexicon model, we use the maximum likelihood criterion in the so-called maximum approximation, i.e. the likelihood criterion covers only the most likely alignment rather than the set of all alignments:</Paragraph> <Paragraph position="2"> too I. o is I.</Paragraph> <Paragraph position="3"> it J. o J ........................ e I h h d f n a a a e r b c m ' i e a i</Paragraph> <Paragraph position="5"/> </Section> </Section> <Section position="4" start_page="291" end_page="291" type="metho"> <SectionTitle> SOURCE POSITION </SectionTitle> <Paragraph position="0"> To find the optimal alignment, we use dynamic programming for which we have the following typical recursion formula:</Paragraph> <Paragraph position="2"> Here. Q(i. j) is a sort of partial probability as in t.ime alignment for speech recognit.ion (aelinek, 1976). As a result, the training procedure amounts to a sequence of iterat.ions, each of which consists of two steps: * posilion alignm~TH: Given the model paramet.ers, det.ermine the most likely position alignn-lent. null * parame*e-r eslimalion: Given the position alignment. i.e. going along the alignment paths for all sentence pairs, perform maximum likelihood estimation of the model parameters; for modelfree distributions, these estimates result in rela.tive fi'equencies.</Paragraph> <Paragraph position="3"> The IBM model 1 (Brown et al., 1993) is used to find an initial estimate of the translation probabilities.</Paragraph> </Section> <Section position="5" start_page="291" end_page="292" type="metho"> <SectionTitle> 3 Search Algorithm for Translation </SectionTitle> <Paragraph position="0"> For the translation operat.ion, we use a bigram language model, which is given in terms of the condit.ional probability of observing word ei given the predecessor word e.i- 1: p(~ilei-:) Using the conditional probability of the bigram language model, we have the overall search criterion in the maxinmm approximation: max p(eile;_:)lnax l'I \[p(ajla~-:)P(fJlea,)\] &quot; ,,' ti=: ~i ~=: Here and in the following, we omit a special treatment of the start and end conditions like j = 1 or j = J in order to simplify the presentation and avoid confusing details. Having the above criterion in mind, we try t.o associate the language model probabilities with the aligmnents j ~ i - aj. To this purpose, we exploit the monotony property of our alignment model which allows only transitions from aj-i tO aj if the difference 6 = oj-aj-1 is 0,1,2. We define a modified probability p~(el#) for the language model depending on the alignment difference t~. We consider each of the three cases 5 = 0, 1,2 separately:</Paragraph> <Paragraph position="2"> tition): This case corresponds to a target word with two or more aligned source words and therefore requires ~ = # so that there is no contribution fl'om the language model:</Paragraph> <Paragraph position="4"> This case is the regular one, and we can use directly the probability of the bigram language model:</Paragraph> <Paragraph position="6"> This case corresponds to skipping a word. i.e, there is a word in the target string with no aligned word in the source string. We have to find the highest probability of placing a non-aligned word e_- between a predecessor word e' and a successor word e. Thus we optimize the following product, over the non-aligned word g: p~=~(eJe') = maxb~(elg).p(gIe')\] i This maximization is done beforehand and the result is stored in a table.</Paragraph> <Paragraph position="7"> Using this modified probability p~(ele'), we can rewrite the overall search criterion: aT l-I )\].</Paragraph> <Paragraph position="8"> The problem now is to find the unknown mapping: j -- (aj, ca.,) which defines a path through a network with a uniform trellis structure. For this trellis, we can still use Figure 3. However. in each position i along the !nput: source string/l...fj...fJ initialization for each position j = 1,2 ..... d in source sel'ltence do for each position i = 1,2, ...,/maz in target sentence do for each target word e do V Q(i, j, e) = p(fj le)' ma;x{p(i\[i - 6). p~(e\[e'). Q(i - 6. j - 1, e')} 6,e traceback: - find best end hypothesis: max Q(i, J, e) - recover optimal word sequence vertical axis. we have to allow all possible words e of the target vocabulary. Due to the monotony of our alignnaent model and the bigraln language model. we have only first-order type dependencies such that the local probabilities (or costs when using the negative logarithms of the probabilities) depend on-I.q on the arcs (or transitions) in the lattice. Each possible index triple (i.j.e) defines a grid point in the lattice, and we have the following set of possible transitions fi'om one grid point to another grid</Paragraph> <Paragraph position="10"> Each of these transitions is assigned a local probability: null p(ili - 6). p,,(ele') . p(fj le) Using this formulation of the search task, we can now use the method of dynamic programming(DP) to find the best path through the lattice. To this purpose, we introduce the auxiliary quantity: Q(i.j.e): probability of the best. partial path which ends in the grid point (i, j, e).</Paragraph> <Paragraph position="11"> Since we have only first-order dependencies in our model, it is easy to see that the auxiliary quantity nmst satisfy the following DP recursion equation: Q(i.j.e) = p(fjle).</Paragraph> <Paragraph position="12"> max {p(ili- ~). maxp,,(ele'). Q(i- 6, j - 1,e')}. To explicitly construct the unknown word sequence ~. it is convenient to make use of so-called back-pointers which store for each grid point (i.j,e) the best predecessor grid point (Ney et al.. 1992).</Paragraph> <Paragraph position="13"> The DP equation is evaluated recursively to find the best partial path to each grid point (i, j, e). The resuhing algorithm is depicted in Table 1. The complexity of the algorithm is J. I,,,.,. * E'-'. where E is the size of t.he target language vocabulary and I,,,,~. is the n~aximum leng{'h of the target sentence considered. It is possible to reduce this COml)utational complexity by using so-called pruning methods (Ney et al.. 1992): due to space limitatiol~s, they are not discussed here.</Paragraph> </Section> class="xml-element"></Paper>