File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0831_metho.xml

Size: 14,302 bytes

Last Modified: 2025-10-06 14:10:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0831">
  <Title>Novel Reordering Approaches in Phrase-Based Statistical Machine Translation</Title>
  <Section position="5" start_page="167" end_page="167" type="metho">
    <SectionTitle>
2 Machine Translation using WFSTs
</SectionTitle>
    <Paragraph position="0"> Let fJ1 and eIi be two sentences from a source and target language. Assume that we have word level alignments A of all sentence pairs from a bilingual training corpus. We denote with ~eJ1 the segmentation of a target sentence eI1 into J phrases such that fJ1 and ~eJ1 can be aligned to form bilingual tuples (fj,~ej). If alignments are only functions of target words Aprime : {1,...,I} - {1,...,J}, the bilingual tuples (fj,~ej) can be inferred with e. g. the GIATI method of (Casacuberta et al., 2004), or with our novel monotonization technique (see Sec. 3). Each source word will be mapped to a target phrase of one or more words or an &amp;quot;empty&amp;quot; phrase e. In particular, the source words which will remain non-aligned due to the alignment functionality restriction are paired with the empty phrase.</Paragraph>
    <Paragraph position="1"> We can then formulate the problem of finding the best translation ^eI1 of a source sentence fJ1 :</Paragraph>
    <Paragraph position="3"> In other words: if we assume a uniform distribution for Pr(A), the translation problem can be mapped to the problem of estimating an m-gram language model over a learned set of bilingual tuples (fj,~ej). Mapping the bilingual language model to a WFST T is canonical and it has been shown in (Kanthak et al., 2004) that the search problem can then be rewritten using finite-state terminology:</Paragraph>
    <Paragraph position="5"> This implementation of the problem as WFSTs may be used to efficiently solve the search problem in machine translation.</Paragraph>
  </Section>
  <Section position="6" start_page="167" end_page="168" type="metho">
    <SectionTitle>
3 Reordering in Training
</SectionTitle>
    <Paragraph position="0"> When the alignment function Aprime is not monotonic, target language phrases ~e can become very long.</Paragraph>
    <Paragraph position="1"> For example in a completely non-monotonic alignment all target words are paired with the last aligned source word, whereas all other source words form tuples with the empty phrase. Therefore, for language pairs with big differences in word order, probability estimates may be poor.</Paragraph>
    <Paragraph position="2"> This problem can be solved by reordering either source or target training sentences such that alignments become monotonic for all sentences. We suggest the following consistent source sentence re-ordering and alignment monotonization approach in which we compute optimal, minimum-cost alignments. null First, we estimate a cost matrix C for each sentence pair (fJ1 ,eI1). The elements of this matrix cij are the local costs of aligning a source word fj to a target word ei. Following (Matusov et al., 2004), we compute these local costs by interpolating state occupation probabilities from the source-to-target and target-to-source training of the HMM and IBM-4 models as trained by the GIZA++ toolkit (Och et al., 2003). For a given alignment A [?] I xJ, we define the costs of this alignment c(A) as the sum of the local costs of all aligned word pairs:</Paragraph>
    <Paragraph position="4"> The goal is to find an alignment with the minimum costs which fulfills certain constraints.</Paragraph>
    <Section position="1" start_page="167" end_page="168" type="sub_section">
      <SectionTitle>
3.1 Source Sentence Reordering
</SectionTitle>
      <Paragraph position="0"> To reorder a source sentence, we require the alignment to be a function of source words A1:</Paragraph>
      <Paragraph position="2"> We do not allow for non-aligned source words. A1 naturally defines a new order of the source words fJ1 which we denote by VfJ1 . By computing this permutation for each pair of sentences in training and applying it to each source sentence, we create a corpus of reordered sentences.</Paragraph>
    </Section>
    <Section position="2" start_page="168" end_page="168" type="sub_section">
      <SectionTitle>
3.2 Alignment Monotonization
</SectionTitle>
      <Paragraph position="0"> In order to create a &amp;quot;sentence&amp;quot; of bilingual tuples ( VfJ1 ,~eJ1) we required alignments between reordered source and target words to be a function of target words A2 : {1,...,I} - {1,...,J}. This alignment can be computed in analogy to Eq. 2 as:</Paragraph>
      <Paragraph position="2"> where Vcij are the elements of the new cost matrix VC which corresponds to the reordered source sentence. We can optionally re-estimate this matrix by repeating EM training of state occupation probabilities with GIZA++ using the reordered source corpus and the original target corpus. Alternatively, we can get the cost matrix VC by reordering the columns of the cost matrix C according to the permutation given by alignment A1.</Paragraph>
      <Paragraph position="3"> In alignment A2 some target words that were previously unaligned in A1 (like &amp;quot;the&amp;quot; in Fig. 1) may now still violate the alignment monotonicity. The monotonicity of this alignment can not be guaranteed for all words if re-estimation of the cost matrices had been performed using GIZA++.</Paragraph>
      <Paragraph position="4"> The general GIATI technique (Casacuberta et al., 2004) is applicable and can be used to monotonize the alignment A2. However, in our experiments the following method performs better. We make use of the cost matrix representation and compute a monotonic minimum-cost alignment with a dynamic programming algorithm similar to the Levenshtein string edit distance algorithm. As costs of each &amp;quot;edit&amp;quot; operation we consider the local alignment costs. The resulting alignment A3 represents a minimum-cost monotonic &amp;quot;path&amp;quot; through the cost matrix. To make A3 a function of target words we do not consider the source words non-aligned in A2 and also forbid &amp;quot;deletions&amp;quot; (&amp;quot;many-to-one&amp;quot; source word alignments) in the DP search.</Paragraph>
      <Paragraph position="5"> An example of such consistent reordering and monotonization is given in Fig. 1. Here, we re-order the German source sentence based on the initial alignment A1, then compute the function of target words A2, and monotonize this alignment to A3 theverybeginningofMaywouldsuitme.</Paragraph>
      <Paragraph position="6"> theverybeginningofMaywouldsuitme.</Paragraph>
      <Paragraph position="7"> sehrgutAnfangMaiwurdepassenmir.</Paragraph>
      <Paragraph position="8"> sehrgutAnfangMaiwurdepassenmir.</Paragraph>
      <Paragraph position="9"> theverybeginningofMaywouldsuitme.</Paragraph>
      <Paragraph position="10"> mir sehrwurde gutAnfangMaipassen.</Paragraph>
      <Paragraph position="11">  gual tuples.</Paragraph>
      <Paragraph position="12"> with the dynamic programming algorithm. Fig. 1 also shows the resulting bilingual tuples ( Vfj,~ej).</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="168" end_page="170" type="metho">
    <SectionTitle>
4 Reordering in Search
</SectionTitle>
    <Paragraph position="0"> When searching the best translation ~eJ1 for a given source sentence fJ1 , we permute the source sentence as described in (Knight et al., 1998):</Paragraph>
    <Paragraph position="2"> Permuting an input sequence of J symbols results in J! possible permutations and representing the permutations as a finite-state automaton requires at least 2J states. Therefore, we opt for computing the permutation automaton on-demand while applying beam pruning in the search.</Paragraph>
    <Section position="1" start_page="168" end_page="169" type="sub_section">
      <SectionTitle>
4.1 Lazy Permutation Automata
</SectionTitle>
      <Paragraph position="0"> For on-demand computation of an automaton in the flavor described in (Kanthak et al., 2004) it is sufficient to specify a state description and an algorithm that calculates all outgoing arcs of a state from the state description. In our case, each state represents a permutation of a subset of the source words fJ1 , which are already translated.</Paragraph>
      <Paragraph position="1"> This can be described by a bit vector bJ1 (Zens et al., 2002). Each bit of the state bit vector corresponds to an arc of the linear input automaton and is set to one if the arc has been used on any path from the initial to the current state. The bit vectors of two states connected by an arc differ only in a single bit.</Paragraph>
      <Paragraph position="2"> Note that bit vectors elegantly solve the problem of recombining paths in the automaton as states with  the same bit vectors can be merged. As a result, a fully minimized permutation automaton has only a single initial and final state.</Paragraph>
      <Paragraph position="3"> Even with on-demand computation, complexity using full permutations is unmanagable for long sentences. We further reduce complexity by additionally constraining permutations. Refer to Figure 2 for visualizations of the permutation constraints which we describe in the following.</Paragraph>
    </Section>
    <Section position="2" start_page="169" end_page="169" type="sub_section">
      <SectionTitle>
4.2 IBM Constraints
</SectionTitle>
      <Paragraph position="0"> The IBM reordering constraints are well-known in the field of machine translation and were first described in (Berger et al., 1996). The idea behind these constraints is to deviate from monotonic translation by postponing translations of a limited number of words. More specifically, at each state we can translate any of the first l yet uncovered word positions. The implementation using a bit vector is straightforward. For consistency, we associate window size with the parameter l for all constraints presented here.</Paragraph>
    </Section>
    <Section position="3" start_page="169" end_page="169" type="sub_section">
      <SectionTitle>
4.3 Inverse IBM Constraints
</SectionTitle>
      <Paragraph position="0"> The original IBM constraints are useful for a large number of language pairs where the ability to skip some words reflects the differences in word order between the two languages. For some other pairs, it is beneficial to translate some words at the end of the sentence first and to translate the rest of the sentence nearly monotonically. Following this idea we can define the inverse IBM constraints. Let j be the first uncovered position. We can choose any position for translation, unless l [?] 1 words on positions jprime &gt; j have been translated. If this is the case we must translate the word in position j. The inverse IBM constraints can also be expressed by invIBM(x) = transpose(IBM(transpose(x))).</Paragraph>
      <Paragraph position="1"> As the transpose operation can not be computed on-demand, our specialized implementation uses bit vectors bJ1 similar to the IBM constraints.</Paragraph>
    </Section>
    <Section position="4" start_page="169" end_page="169" type="sub_section">
      <SectionTitle>
4.4 Local Constraints
</SectionTitle>
      <Paragraph position="0"> For some language pairs, e.g. Italian - English, words are moved only a few words to the left or right. The IBM constraints provide too many alternative permutations to chose from as each word can be moved to the end of the sentence. A solution that allows only for local permutations and therefore has  of a source sentence f1f2f3f4 using a window size of 2 for b) IBM constraints, c) inverse IBM constraints and d) local constraints.</Paragraph>
      <Paragraph position="1"> very low complexity is given by the following permutation rule: the next word for translation comes from the window of l positions1 counting from the first yet uncovered position. Note, that the local constraints define a true subset of the permutations defined by the IBM constraints.</Paragraph>
    </Section>
    <Section position="5" start_page="169" end_page="170" type="sub_section">
      <SectionTitle>
4.5 ITG Constraints
</SectionTitle>
      <Paragraph position="0"> Another type of reordering can be obtained using Inversion Transduction Grammars (ITG) (Wu, 1997).</Paragraph>
      <Paragraph position="1"> These constraints are inspired by bilingual bracketing. They proved to be quite useful for machine translation, e.g. see (Bender et al., 2004). Here, we interpret the input sentence as a sequence of segments. In the beginning, each word is a segment of its own. Longer segments are constructed by recursively combining two adjacent segments. At each  combination step, we either keep the two segments in monotonic order or invert the order. This process continues until only one segment for the whole sentence remains. The on-demand computation is implemented in spirit of Earley parsing.</Paragraph>
      <Paragraph position="2"> We can modify the original ITG constraints to further limit the number of reorderings by forbidding segment inversions which violate IBM constraints with a certain window size. Thus, the resulting reordering graph contains the intersection of the reorderings with IBM and the original ITG constraints. null</Paragraph>
    </Section>
    <Section position="6" start_page="170" end_page="170" type="sub_section">
      <SectionTitle>
4.6 Weighted Permutations
</SectionTitle>
      <Paragraph position="0"> So far, we have discussed how to generate the permutation graphs under different constraints, but permutations were equally probable. Especially for the case of nearly monotonic translation it is make sense to restrict the degree of non-monotonicity that we allow when translating a sentence. We propose a simple approach which gives a higher probability to the monotone transitions and penalizes the non-monotonic ones.</Paragraph>
      <Paragraph position="1"> A state description bJ1 , for which the following condition holds: Mon(j) : bjprime = d(jprime [?] j) [?] 1 [?] jprime [?] J represents the monotonic path up to the word fj. At each state we assign the probability a to that out-going arc where the target state description fullfills Mon(j+1) and distribute the remaining probability mass 1[?]a uniformly among the remaining arcs. In case there is no such arc, all outgoing arcs get the same uniform probability. This weighting scheme clearly depends on the state description and the out-going arcs only and can be computed on-demand.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML