File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1019_metho.xml

Size: 18,389 bytes

Last Modified: 2025-10-06 14:08:09

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1019">
  <Title>A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Alignment Template Translation Models
</SectionTitle>
    <Paragraph position="0"> We present here a derivation of the alignment template translation model (ATTM) (Och et al., 1999; Och, 2002) and give an implementation of the model using weighted finite state transducers (WFSTs). The finite state modeling is performed using the AT&amp;T FSM Toolkit (Mohri et al., 1997).</Paragraph>
    <Paragraph position="1"> In this model, the translation of a source language sentence to a target language sentence is described by a joint probability distribution over all possible segmentations and alignments. This distribution is presented in Figure 1 and Equations 1-7. The components of the overall translation model are the source language model (Term 2), the source segmentation model (Term 3), the phrase permutation model (Term 4), the template sequence model (Term 5), the phrasal translation model (Term 6) and the target language model (Term 7). Each of these conditional distributions is modeled independently and we now define each in turn and present its implementation as a weighted finite state acceptor or transducer.</Paragraph>
    <Paragraph position="3"> We begin by distinguishing words and phrases. We assume that a9 is a phrase in the target language sentence that has length a45 and consists of words a3 a7 a8a15a3a47a46a48a8a50a49a51a49a52a49a51a8a24a3a48a53 . Similarly, a phrase a21 in the source language sentence contains words a26a6a54a48a8a15a26 a7 a8a50a49a51a49a52a49a51a8a15a26a48a55 , where a26a56a54 is the NULL token. We assume that each word in each language can be assigned to a unique class so that a9 unambiguously specifies a class sequence a57</Paragraph>
    <Paragraph position="5"> segmented into phrases a21 a11a7 , we say a21 a11a7 a30a59a26 a27a7 to indicate that the words in the phrase sequence agree with the original sentence.</Paragraph>
    <Paragraph position="6"> Source Language Model The model assigns probability to any sentence a26 a27a7 in the source language; this probability is not actually needed by the translation process when a26 a27a7 is given. As the first component in the model, a finite state acceptor a60 is constructed for a26 a27a7 .</Paragraph>
    <Paragraph position="7"> Source Segmentation Model We introduce the phrase count random variable a23 which specifies the number of phrases in a particular segmentation of the source language sentence. For a sentence of length a61 , there are</Paragraph>
    <Paragraph position="9"> a7a66a65 ways to segment it into a23 phrases. Motivated by this, we choose the distribution a0a2a1a4a23a67a35 a26 a27a7 a28 as</Paragraph>
    <Paragraph position="11"> so that a76 a11 a0a2a1a4a23a36a35 a26 a27a7 a28a31a30a77a74 .</Paragraph>
    <Paragraph position="12"> We construct a joint distribution over all phrase seg-</Paragraph>
    <Paragraph position="14"> a28 is a &amp;quot;unigram&amp;quot; distribution over source language phrases; we assume that we have an inventory of phrases from which this quantity can be estimated. In this way, the likelihood of a particular segmentation is determined by the likelihood of the phrases that result.</Paragraph>
    <Paragraph position="15"> We now describe the finite state implementation of the source segmentation model and show how to compute the most likely segmentation under the model:</Paragraph>
    <Paragraph position="17"> 1. For each source language sentence a26 a27a7 to be translated, we implement a weighted finite state transducer a120 that segments the sentence into all possible phrase sequences a21 a11a7 permissible given the inventory of phrases. The score of a segmentation a21 a11a7</Paragraph>
    <Paragraph position="19"> a lattice of segmentations of a26 a27a7 (implemented as an acceptor a60 ) by composing it with the transducer a120 , i.e. a121 a30 a60a40a122a123a120 .</Paragraph>
    <Paragraph position="20"> 2. We then decompose a121 into a61 disjoint subsets</Paragraph>
    <Paragraph position="22"> contains all segmentations of the source language sentence with exactly a23 phrases. To construct a121 a11 , we create an unweighted acceptor a0 a11 that accepts any phrase sequence of length a23 ; for efficiency, the phrase vocabulary is restricted to the phrases in a121 .</Paragraph>
    <Paragraph position="23"> a121 a11 is then obtained by the finite state composition</Paragraph>
    <Paragraph position="25"> The normalization factors a102 a11 are obtained by summing the probabilities of all segmentations in a121 a11 .</Paragraph>
    <Paragraph position="26"> This sum can be computed efficiently using lattice forward probabilities (Wessel et al., 1998). For a fixed a23 , the most likely segmentation in a121 a11 is found as</Paragraph>
    <Paragraph position="28"> A portion of the segmentation transducer a120 for the French sentence nous avons une inflation galopante is presented in Figure 2. When composed with a60 , a120 generates the following two phrase segmentations: nous avons une inflation galopante and nous avons une inflation galopante. The &amp;quot; &amp;quot; symbol is used to indicate phrases formed by concatenation of consecutive words.</Paragraph>
    <Paragraph position="29"> The phrases specified by the source segmentation model remain in the order that they appear in the source sentence. null  a120 for the sentence &amp;quot;nous avons une inflation galopante&amp;quot;. Phrase Permutation Model We now define a model for the reordering of phrase sequences as determined by the previous model. The phrase alignment sequence a18 a11a7 specifies a reordering of phrases into target language phrase order; the words within the phrases remain in the source language order. The phrase sequence a21 a11a7 is re-ordered into a21 a0 a105 a8a10a21 a0a2a1 a8a130a49a52a49a51a49a51a8a10a21 a0 a82 . The phrase alignment sequence is modeled as a first order Markov process</Paragraph>
    <Paragraph position="31"> bution is constructed to assign decreasing likelihood to phrase re-orderings that diverge from the original word order. Suppose a21 a0a4a3 a30 a26a6a5</Paragraph>
    <Paragraph position="33"> In the above equations, a87 a54 is a tuning factor and we normalize the probabilities a0a2a1a32a18</Paragraph>
    <Paragraph position="35"> The finite state implementation of this model involves two acceptors. We first build a unweighted permutation acceptora23 a132 that contains all permutations of the phrase sequence a21 a11a7 in the source language (Knight and Al-Onaizan, 1998) . We note that a path througha23 a132 corresponds to an alignment sequence a18 a11a7 . Figure 3 shows the acceptora23 a132 for the source phrase sequence nous avons une inflation galopante.</Paragraph>
    <Paragraph position="36"> A source phrase sequence a131 of length a23 words requires a permutation acceptor a23 a132 of a68 a11 states. For long phrase sequences we compute a score a116a118a113a48a119 a18 a0a2a1a32a18</Paragraph>
    <Paragraph position="38"> a30a17a21a16a28 for each arc and then prune the arcs by this score, i.e. phrase alignments containing a18  a30 a99 are included only if this score is above a threshold. Pruning can therefore be applied whilea23  source-language phrase sequence nous avons une inflation galopante.</Paragraph>
    <Paragraph position="39"> The second acceptor a24 in the implementation of the phrase permutation model assigns alignment probabilities (Equation 13) to a given permutation a18 a11a7 of the source phrase sequence a21 a11a7 (Figure 4). In this example, the phrases in the source phrase sequence are specified as follows: a21 a7 a30 a26 a7 (nous), a21a48a46a2a30 a26a6a46 (avons) and a21a26a25a17a30 a26a11a27a25 (une inflation galopante). We now show the computation of some of the alignment probabilities (Equation 13) in this example (a87 a54 a30 a92 a49a29a28 )</Paragraph>
    <Paragraph position="41"> Normalizing these terms gives a0a2a1a4a18a39a25 a30a77a74a19a35 a18a19a46a100a30a40a31a43a28a31a30 a92 a49a41a43a42 and a0a2a1a4a18a30a25 a30 a68 a35 a18a19a46a100a30a32a31a16a28a31a30 a92 a49a36a35a37a31 .</Paragraph>
    <Paragraph position="42"> Template Sequence Model Here we describe the main component of the model. An alignment template</Paragraph>
    <Paragraph position="44"> a28 specifies the allowable alignments be-</Paragraph>
    <Paragraph position="46"> a54 is the class sequence for a21 .</Paragraph>
    <Paragraph position="47"> In Section 4.1, we will outline a procedure to build a library of alignment templates from bitext word-level alignments. Each template a14 a30a129a1 a57</Paragraph>
    <Paragraph position="49"> model has an index a99 in this template library. Therefore any operation that involves a mapping to (from) template sequences will be implemented as a mapping to (from) a sequence of these indices.</Paragraph>
    <Paragraph position="50"> We have described the segmentation and permutation processes that transform a source language sentence into phrases in target language phrase order. The next step is to generate a consistent sequence of alignment templates. We assume that the templates are conditionally independent of each other and depend only on the source language phrase which generated each of them</Paragraph>
    <Paragraph position="52"> We will implement this model using the transducer a0 that maps any permutation a21 a0 a105 a8a24a21 a0a1 a8a130a49a52a49a51a49a52a8a24a21 a0 a82 of the phrase sequence a21 a11a7 into a template sequence a14 a11a7 with probability as in Equation 14. For every phrase a21 , this transducer allows only the templates a14 that are consistent with a21 with probability a0a2a1a4a14a111a35 a21a22a28 , i.e. a0a2a1a4a14</Paragraph>
    <Paragraph position="54"> a28 enforces the consistency between each source phrase and alignment template. null Phrasal Translation Model We assume that a target phrase is generated independently by each alignment template and source phrase</Paragraph>
    <Paragraph position="56"> This allows us to describe the phrase-internal translation model a0a2a1a34a9 a35 a21a89a8a24a14a19a28 as follows. We assume that each word in the target phrase is produced independently and that the consistency is enforced between the words in a9 and the class sequence a57</Paragraph>
    <Paragraph position="58"> We now introduce the word alignment variables a3 a49 a8a15a99a29a30</Paragraph>
    <Paragraph position="60"> The term a0a2a1a32a3 a49 a35 a26 a18 a28 is a translation dictionary (Och and Ney, 2000) and a0a2a1 a3 a49 a30 a21a48a8 a2 a28 is obtained as</Paragraph>
    <Paragraph position="62"> We have assumed that a0a2a1 a3a11a49 a35 a21a89a8 a2 a28 a30 a0a2a1 a3 a49 a35a2 a28 , i.e. that given the template, word alignments do not depend on the source language phrase.</Paragraph>
    <Paragraph position="63"> For a given phrase a21 and a consistent alignment tem-</Paragraph>
    <Paragraph position="65"> a28 , a weighted acceptor a102 can be constructed to assign probability to translated phrases according to Equations 16 and 17. a102 is constructed from four component machines a11 , a12 , a13 and a14 , constructed as follows.</Paragraph>
    <Paragraph position="66"> The first acceptor a11 implements the alignment matrix a2 . It has a45 a47a40a74 states and between any pair of states a99a16a15a38a74 and a99 , each arca21 corresponds to a word alignment variable a3a6a49 a30a22a21 . Therefore the number of transitions between states a99 and a99 a47 a74 is equal to the number of non-zero values of a3 a49. Thea21a18a17a20a19 arc from state a99a21a15 a74 to a99 has probability</Paragraph>
    <Paragraph position="68"> The second machine a12 is an unweighted transducer that maps the index a99 a71a25a73 a92 a8a47a74a48a8a50a49a51a49a51a49a52a8a9a45 a75 in the phrase a21a17a30a91a26</Paragraph>
    <Paragraph position="70"> the corresponding word a26 a49.</Paragraph>
    <Paragraph position="71"> The third transducer is the lexicon transducer a13 that maps the source word a26a84a71 a131a23a22 to the target word a3a44a71 a131 a9 with probability a0a2a1a32a3a22a35 a26a12a28 .</Paragraph>
    <Paragraph position="72"> The fourth acceptor a14 is unweighted and allows all target word sequences a3</Paragraph>
    <Paragraph position="74"> a47a77a74 states. The number of transitions between states a99 a15 a74 and a99 is equal to the number of target language words with class specified by a57 a49.</Paragraph>
    <Paragraph position="75"> Figure 5 shows all the four component FSTs for building the transducer a102 corresponding to an alignment template from our library. Having built these four machines, we obtain a102 as follows. We first compose the four transducers, project the resulting transducer onto the output labels, and determinize it under the a1a47 a8 a44 a28 semiring. This is implemented using AT&amp;T FSM tools as follows fsmcompose O I D C a35 fsmproject -o a35 a0 fsmrmepsilon a35 fsmdeterminize a1 a102 .</Paragraph>
    <Paragraph position="76"> Given an alignment template a14 and a consistent source phrase a21 , we note that the composition and determinization operations assign the probability a0a2a1a34a9 a35 a14 a8a24a21a19a28 (Equation 16) to each consistent target phrase a9 . This summarizes the construction of a transducer for a single alignment template.</Paragraph>
    <Paragraph position="77"> We now implement a transducer a2 that maps sequences of alignment templates to target language word sequences. We identify all templates consistent with the phrases in the source language phrase sequence a21 a11a7 . The transducer a2 is constructed via the FSM union operation of the transducers that implement these templates.</Paragraph>
    <Paragraph position="78"> For the source phrase sequence a21</Paragraph>
    <Paragraph position="80"> une inflation galopante), we show the transducer a2 in Figure 6. Our example library consists of three templates a14 a7 , a14a56a46 and a14a20a25 . a14 a7 maps the source word nous to the target word we via the word alignment matrix  where a74a56a73a48a3 a5a7 a30a129a9 a11a7 a75 enforces the requirement that words in the translation agree with those in the phrase sequence. We note that a0 a9 a1a4a3 a5a7a130a28 is modeled as a standard backoff trigram language model (Stolcke, 2002). Such a language model can be easily compiled as a weighted finite state acceptor (Mohri et al., 2002).</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Alignment and Translation Via WFSTs
</SectionTitle>
    <Paragraph position="0"> We will now describe how the alignment template translation model can be used to perform word-level alignment of bitexts and translation of source language sentences.</Paragraph>
    <Paragraph position="1"> Given a source language sentence a26 a27a7 and a target sentence a3 a5a7 , the word-to-word alignment between the sentences can be found as  a75 specify the alignment between source phrases and target phrases while a110a14 a11a7 gives the word-to-word alignment within the phrase sequences. Given a source language sentence a26 a27a7 , the translation can be found as  where a110a3 a5a7 is the translation of a26 a27a7 . We implement the alignment and translation procedures in two steps. We first segment the source sentence into phrases, as described earlier  mentation process decomposes the source sentence a26 a27a7 into a phrase sequence a110a21 a112a11a7 . This process also tags each source phrase a110a21  with its position a16 in the phrase sequence. We will now describe the alignment and translation processes using finite state operations.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Bitext Word Alignment
</SectionTitle>
      <Paragraph position="0"> Given a collection of alignment templates, it is not guaranteed that every sentence pair in a bitext can be segmented into phrases for which there exist the consistent alignment templates needed to create an alignment between the sentences. We find in practice that this problem arises frequently enough that most sentence pairs are assigned a probability of zero under the template model. To overcome this limitation, we add several types of &amp;quot;dummy&amp;quot; templates to the library that serve to align phrases when consistent templates could not otherwise be found.</Paragraph>
      <Paragraph position="1"> The first type of dummy template we introduce allows any source phrase a110a21 a85 to align with any single word target phrase a9 a49. This template is defined as a triple  fied to be ones. The second type of dummy template allows source phrases to be deleted during the alignment process. For a source phrase a110a21  of template allows for insertions of single word target phrases. For a target phrase a9 a49 we specify this template as  these added templates are not estimated; they are fixed as a global constant which is set so as to discourage their use except when no other suitable templates are available. A lattice of possible alignments between a3 a5a7 and a26 a27a7 is then obtained by the finite state composition</Paragraph>
      <Paragraph position="3"> where a3 is an acceptor for the target sentence a3 a5a7 . We then  ; and insertions of target words a3 a49. To determine the word-level alignment between the sentences a3 a5a7 and a26 a27a7 ,we are primarily interested in the first of these types of alignments. Given that the  contains the the word-to-word alignment between these phrases.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Translation and Translation Lattices
</SectionTitle>
      <Paragraph position="0"> The lattice of possible translations of a26 a27a7 is obtained using the weighted finite state composition:</Paragraph>
      <Paragraph position="2"> The translation with the highest probability (Equation 20) can now be computed by obtaining the path with the highest score in a6 .</Paragraph>
      <Paragraph position="3"> In terms of AT&amp;T FSM tools, this can be done as fol- null A translation lattice (Ueffing et al., 2002) can be generated by pruning a6 based on likelihoods or number of states. Similarly, an alignment lattice can be generated by pruning a2 .</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML