File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2141_metho.xml

Size: 9,289 bytes

Last Modified: 2025-10-06 14:14:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2141">
  <Title>HMM-Based Word Alignment in Statistical Translation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Review: Translation Model
</SectionTitle>
    <Paragraph position="0"> The goal is the translation of a text given in some language F into a target language E. For convenience, we choose for the following exposition as language pair French and English, i.e. we are given a French string f~ = fx ...fj...fJ, which is to be translated into an English string e / = el...ei...cl.</Paragraph>
    <Paragraph position="1"> Among all possible English strings, we will choose the one with the highest probability which is given by Bayes' decision rule:</Paragraph>
    <Paragraph position="3"> Pr(e{) is the language model of the target language, whereas Pr(fJle{) is the string translation model. The argmax operation denotes the search problem. In this paper, we address the problem of introducing structures into the probabilistic dependencies in order to model the string translation probability Pr(f~ le{).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="837" type="metho">
    <SectionTitle>
3 Alignment Models
</SectionTitle>
    <Paragraph position="0"> A key issne in modeling the string translation probability Pr(J'~le I) is the question of how we define the correspondence between the words of the English sentence and the words of the French sentence. In typical cases, we can assume a sort of pairwise dependence by considering all word pairs (fj, ei) for a given sentence pair I.-/1\[~'J', elqlj' We further constrain this model by assigning each French word to exactly one English word. Models describing these types of dependencies are referred to as alignment models. In this section, we describe two models for word alignrnent in detail: ,. a mixture-based alignment model, which was introduced in (Brown et al., 1990); * an HMM-based alignment model.</Paragraph>
    <Paragraph position="1"> In this paper, we address the question of how to define specific models for the alignment probabilities. The notational convention will be as follows. We use the symbol Pr(.) to denote general  probability distributions with (nearly) no Sl)eeitic asSUml)tions. In contrast, for modcl-t)ased prol)-ability distributions, we use the generic symbol v(.).</Paragraph>
    <Section position="1" start_page="836" end_page="837" type="sub_section">
      <SectionTitle>
3.1 Alignment with Mixture Distri|mtion
</SectionTitle>
      <Paragraph position="0"> Here, we describe the mixture-based alignment model in a fornmlation which is different fronl the original formulation ill (Brown el, a\[., 1990).</Paragraph>
      <Paragraph position="1"> We will ,is(: this model as reference tbr the IIMM-based alignments to lie 1)resented later.</Paragraph>
      <Paragraph position="2"> The model is based on a decomposition of the joint probability \[br ,l'~ into a product over the probabilities for each word J): a j=l wheFe~ fo\[' norll-la\]iz;i, tion 17(~/SOllS~ the 8elltC\]\[ce length probability p(J\] l) has been included. The next step now is to assutne a sort O\['l,airwise interact, ion between tim French word fj an(l each, F,nglish word ci, i = 1, ...l. These dep('ndencies are captured in the lbrm of a rnixtnre distritmtion:</Paragraph>
      <Paragraph position="4"> with the following ingredients:  .p(ilj, 1) = 7 we arrive at the lh'st model proposed t)y (Brown et al., 1990). This model will be referred to as IB M 1 model.</Paragraph>
      <Paragraph position="5"> To train the translation probabilities p(J'fc), we use a bilingual (;orpus consisting of sentence pairs \[:/';4&amp;quot;1 : ', .,s Using the ,,laxin,ul , likelihood criterion, we ol)tain the following iterative L a equation (Brown et al., 1990):</Paragraph>
      <Paragraph position="7"> For unilbrm alignment probabilities, it can be shown (Brown et al., 1990), that there is only one optinnnn and therefore the I,',M algorithm (Baum, 1!)72) always tinds the global optimum.</Paragraph>
      <Paragraph position="8"> For mixture alignment model with nonunilbrm alignment probabilities (subsequently referred to as IBM2 model), there ~tre to() many alignrnent parameters Pill j, I) to be estimated for smMl col pora. Therefore, a specific model tbr tile Mignment in:obabilities is used: r(i-j~-) (~) p(ilj , 1) = l . I Ei':l &amp;quot;( it --&amp;quot; J J-) This model assumes that the position distance relative to the diagonal line of the (j, i) plane is the dominating factor (see Fig. 1). 'lb train this model, we use the ,naximutn likelihood criterion in the so-called ulaximmn al)proximation, i.e. the likelihood criterion covers only tile most lik(-.ly align: inch, rather than the set of all alignm(,nts:</Paragraph>
      <Paragraph position="10"> In training, this criterion amounts to a sequence of iterations, each of which consists of two steps: * posilion alignmcnl: (riven the model parameters, deLerlniim the mosL likely position align\]lient. null * paramc, lcr cstimalion: Given the position alignment, i.e. goiug along the alignment paths for all sentence pairs, perform maxitnulu likelihood estimation of the model parameters; for model-De(' distributions, these estimates result in relative frequencies.</Paragraph>
      <Paragraph position="11"> l)ue to the natnre of tile nfixture tnod(:l, there is no interaction between djacent word positions. Theretbre, the optimal position i for each position j can be determined in(lependently of the neighbouring positions. Thus l.he resulting training procedure is straightforward.</Paragraph>
      <Paragraph position="12"> a.2 Alignment with HMM We now propose all HMM-based alignment model.</Paragraph>
      <Paragraph position="13"> '\['he motivation is that typicMly we have a strong localization effect in aligning the words in parallel texts (for language pairs fi:om \]ndoeuropean languages): the words are not distrilmted arbitrarily over the senteuce \])ositions, but tend to form clusters. Fig. 1 illustrates this effect for the language pair German- 15'nglish.</Paragraph>
      <Paragraph position="14"> Each word of the German sentence is assigned to a word of the English sentence. The alignments have a strong tendency to preserve the local neighborhood when going from the one langnage to the other language. In mm,y cases, although not al~ ways, there is an even stronger restriction: the differeuce in the position index is smMler than 3.</Paragraph>
      <Paragraph position="16"> To describe these word-by-word aligmnents, we introduce the mapping j ---+ aj, which assigns a word fj in position j to a word el in position { = aj. The concept of these alignments is similar to the ones introduced by (Brown et al., 1990), but we wilt use another type of dependence in the probability distributions. Looking at such alignments produced by a hmnan expert, it is evident that the mathematical model should try to capture the strong dependence of aj on the previous aligmnent. Therefore the probability of alignment aj for position j should have a dependence on the previous alignment aj _ 1 : p(ajiaj_l,i) , where we have inchided the conditioning on the total length \[ of the English sentence for normalization reasons. A sinfilar approach has been chosen by (Da.gan et al., 1993). Thus the problem formulation is similar to that of the time alignment problem in speech recognition, where the so-called IIidden Markov models have been successfully used for a long time (Jelinek, 1976). Using the same basic principles, we can rewrite the probability by introducing the 'hidden' alignments</Paragraph>
      <Paragraph position="18"> a I j=l So far there has been no basic restriction of the approach. We now assume a first-order dependence on the alignments aj only: Vr(fj,aslf{ -~, J-* a I , el) where, in addition, we have assmned that tile translation probability del)ends only oil aj and not oil aj-:l. Putting everything together, we have the ibllowing llMM-based model:</Paragraph>
      <Paragraph position="20"> with the following ingredients: * IlMM alignment probability: p(i\]i', I) or p(aj laj_l, I); * translation probabflity: p(f\]e). In addition, we assume that the t{MM alignment probabilities p(i\[i', \[) depend only on the jump width (i - i'). Using a set of non-negative parameters {s(i- i')}, we can write the IIMM alignment probabilities in the form:</Paragraph>
      <Paragraph position="22"> This form ensures that for each word position i', i' = 1, ..., I, the ItMM alignment probabilities satisfy the normMization constraint.</Paragraph>
      <Paragraph position="23"> Note the similarity between Equations (2) and (5). The mixtm;e model can be interpreted as a zeroth-order model in contrast to the first-order tlMM model.</Paragraph>
      <Paragraph position="24"> As with the IBM2 model, we use again the max-</Paragraph>
      <Paragraph position="26"> In this case, the task of finding the optimal alignment is more involved than in the case of the mixture model (lBM2). Thereibre, we have to resort to dynainic programming for which we have the following typical reeursion formula: Q(i, j) = p(fj lel) ,nvax \[p(ili', 1). Q(i', j - 1)\] i =l,.,,I Here, Q(i, j) is a sort of partial probability as in time alignment for speech recognition (Jelinek, 197@.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML