A Weighted Finite State Transducer Implementation of the Alignment
Template Model for Statistical Machine Translation
Shankar Kumar and William Byrne
Center for Language and Speech Processing, Johns Hopkins University,
3400 North Charles Street, Baltimore, MD, 21218, USA
a0 skumar,byrne
a1 @jhu.edu
Abstract
We present a derivation of the alignment tem-
plate model for statistical machine translation
and an implementation of the model using
weighted finite state transducers. The approach
we describe allows us to implement each con-
stituent distribution of the model as a weighted
finite state transducer or acceptor. We show
that bitext word alignment and translation un-
der the model can be performed with standard
FSM operations involving these transducers.
One of the benefits of using this framework
is that it obviates the need to develop special-
ized search procedures, even for the generation
of lattices or N-Best lists of bitext word align-
ments and translation hypotheses. We evaluate
the implementation of the model on the French-
to-English Hansards task and report alignment
and translation performance.
1 Introduction
The Alignment Template Translation Model
(ATTM) (Och et al., 1999) has emerged as a promising
modeling framework for statistical machine translation.
The ATTM attempts to overcome the deficiencies of
word-to-word translation models (Brown et al., 1993)
through the use of phrasal translations. The overall
model is based on a two-level alignment between the
source and the target sentence: a phrase-level alignment
between source and target phrases and a word-level
alignment between words in these phrase pairs.
The goal of this paper is to reformulate the ATTM
so that the operations we intend to perform under a sta-
tistical translation model, namely bitext word alignment
and translation, can be implementation using standard
weighted finite state transducer (WFST) operations. Our
main motivation for a WFST modeling framework lies
in the resulting simplicity of alignment and translation
processes compared to dynamic programming or a2a4a3 de-
coders. The WFST implementation allows us to use stan-
dard optimized algorithms available from an off-the-shelf
FSM toolkit (Mohri et al., 1997). This avoids the need to
develop specialized search procedures, even for the gen-
TEMPLATE
SEQUENCE
MODEL
PERMUTATION
MODEL
PHRASE
PHRASAL
TRANSLATION
MODEL
TARGET
LANGUAGE MODEL
v 2 v 31v
SOURCE
SEGMENTATION
MODEL
u
z
 
TARGET LANGUAGE SENTENCE 
SENTENCESOURCE LANGUAGE
source language phrases
alignment templates
target language phrases
f f f ff f2 3 4 5 6 f7
v 2 1v v 3
z z1 2 3
u u1 2
e1 e2
1
e4 e5 e6 eee3e
3
7 8 9
a aa 2 31
Figure 1: ATTM Architecture.
eration of lattices or N-best lists of bitext word alignment
or translation hypotheses.
Weighted Finite State Transducers for Statistical Ma-
chine Translation (SMT) have been proposed in the
literature to implement word-to-word translation mod-
els (Knight and Al-Onaizan, 1998) or to perform trans-
lation in an application domain such as the call routing
task (Bangalore and Ricardi, 2001). One of the objec-
tives of these approaches has been to provide an imple-
mentation for SMT that uses standard FSM algorithms
to perform model computations and therefore make SMT
techniques accessible to a wider community. Our WFST
implementation of the ATTM has been developed with
similar objectives.
We start off by presenting a derivation of the ATTM
that identifies the conditional independence assumptions
that underly the model. The derivation allows us to spec-
ify each component distribution of the model and imple-
ment it as a weighted finite state transducer. We then
show that bitext word alignment and translation can be
performed with standard FSM operations involving these
transducers. Finally we report bitext word alignment
and translation performance of the implementation on the
Canadian French-to-English Hansards task.
                                                               Edmonton, May-June 2003
                                                               Main Papers , pp. 63-70
                                                         Proceedings of HLT-NAACL 2003
2 Alignment Template Translation Models
We present here a derivation of the alignment template
translation model (ATTM) (Och et al., 1999; Och, 2002)
and give an implementation of the model using weighted
finite state transducers (WFSTs). The finite state model-
ing is performed using the AT&T FSM Toolkit (Mohri et
al., 1997).
In this model, the translation of a source language sen-
tence to a target language sentence is described by a joint
probability distribution over all possible segmentations
and alignments. This distribution is presented in Figure 1
and Equations 1-7. The components of the overall trans-
lation model are the source language model (Term 2),
the source segmentation model (Term 3), the phrase per-
mutation model (Term 4), the template sequence model
(Term 5), the phrasal translation model (Term 6) and the
target language model (Term 7). Each of these condi-
tional distributions is modeled independently and we now
define each in turn and present its implementation as a
weighted finite state acceptor or transducer.
a0a2a1a4a3a6a5a7a6a8a10a9a12a11a7a13a8a15a14a16a11a7a17a8a15a18a19a11a7a20a8a10a21a22a11a7a2a8a24a23a25a8a15a26a12a27a7a29a28a31a30 (1)
a0a2a1a32a26 a27a7 a28a29a33 (2)
a0a2a1a34a21 a11a7 a8a24a23a36a35 a26 a27a7 a28a37a33 (3)
a0a2a1a4a18 a11a7 a35 a21 a11a7 a8a24a23a25a8a15a26 a27a7 a28a37a33 (4)
a0a2a1a4a14a19a11a7a38a35 a18a22a11a7a39a8a10a21a22a11a7a17a8a24a23a25a8a15a26a12a27a7a29a28a29a33 (5)
a0a2a1a34a9a12a11a7 a35 a14a19a11a7 a8a24a18a22a11a7 a8a10a21a22a11a7 a8a15a23a40a8a41a26a12a27a7 a28a42a33 (6)
a0a2a1a4a3a6a5a7a43a35 a9a12a11a7a39a8a24a14a19a11a7a44a8a15a18a22a11a7a20a8a24a21a19a11a7a44a8a15a23a40a8a15a26a12a27a7a42a28 (7)
We begin by distinguishing words and phrases. We as-
sume that a9 is a phrase in the target language sentence
that has length a45 and consists of words a3 a7 a8a15a3a47a46a48a8a50a49a51a49a52a49a51a8a24a3a48a53 .
Similarly, a phrase a21 in the source language sentence con-
tains words a26a6a54a48a8a15a26 a7 a8a50a49a51a49a52a49a51a8a15a26a48a55 , where a26a56a54 is the NULL token.
We assume that each word in each language can be as-
signed to a unique class so that a9 unambiguously spec-
ifies a class sequence a57
a53
a7 and a21 specifies the class se-
quence a58
a55
a54 . Throughout the model, if a sentence
a26 a27a7 is
segmented into phrases a21 a11a7 , we say a21 a11a7 a30a59a26 a27a7 to indi-
cate that the words in the phrase sequence agree with the
original sentence.
Source Language Model The model assigns probabil-
ity to any sentence a26 a27a7 in the source language; this prob-
ability is not actually needed by the translation process
when a26 a27a7 is given. As the first component in the model, a
finite state acceptor a60 is constructed for a26 a27a7 .
Source Segmentation Model We introduce the phrase
count random variable a23 which specifies the number of
phrases in a particular segmentation of the source lan-
guage sentence. For a sentence of length a61 , there are
a62
a27a22a63
a7
a11a64a63
a7a66a65 ways to segment it into
a23 phrases. Motivated by
this, we choose the distribution a0a2a1a4a23a67a35 a26 a27a7 a28 as
a0a2a1a4a23a36a35 a26 a27a7 a28a31a30
a62
a27a22a63
a7
a11a64a63
a7 a65
a68
a27a22a63
a7a70a69
a23a72a71a40a73a22a74a48a8 a68 a8a50a49a51a49a52a49a51a8
a61a37a75
a8 (8)
so that a76 a11 a0a2a1a4a23a36a35 a26 a27a7 a28a31a30a77a74 .
We construct a joint distribution over all phrase seg-
mentations a21 a11a7a77a30a78a21 a7 a8a24a21a48a46a6a8a50a49a51a49a51a49a52a8a24a21 a11 as
a0a2a1a34a21 a11a7 a8a24a23a36a35 a26 a27a7 a28a79a30 a0a2a1a34a21 a11a7 a35 a23a25a8a15a26 a27a7 a28a66a0a2a1a4a23a36a35 a26 a27a7 a28 (9)
where
a0a2a1a34a21 a11a7 a35 a23a25a8a15a26 a27a7 a28a79a30 a80
a7
a81a22a82a84a83
a11a85a41a86
a7a22a87a89a88 a1a34a21
a85
a28a90a21 a11a7a77a30a91a26 a27a7
a92 a93a6a94a96a95 a3a47a97a56a98a100a99a96a101a47a3
a49
The normalization constant
a102
a11
a30
a76a77a103a104
a82
a105
a83
a11a85a41a86
a7a22a87 a88 a1a24a106a21
a85
a28 , is chosen so that
a76 a11a108a107 a104
a82
a105
a0a2a1a4a21 a11a7 a8a15a23a67a35 a26 a27a7 a28a31a30a77a74 .
Here, a87a109a88 a1a4a21
a85
a28 is a “unigram” distribution over source
language phrases; we assume that we have an inventory
of phrases from which this quantity can be estimated. In
this way, the likelihood of a particular segmentation is
determined by the likelihood of the phrases that result.
We now describe the finite state implementation of the
source segmentation model and show how to compute the
most likely segmentation under the model:
a73a111a110a21a20a112a11a7a17a8
a110
a23
a75
a30a78a113a48a114a24a115a117a116a118a113a48a119
a104
a82
a105
a107 a11
a0a2a1a4a21 a11a7a70a35 a23a40a8a41a26 a27a7a37a28a96a0a2a1a32a23a67a35 a26 a27a7a29a28 .
1. For each source language sentence a26 a27a7 to be trans-
lated, we implement a weighted finite state trans-
ducer a120 that segments the sentence into all possible
phrase sequences a21 a11a7 permissible given the inven-
tory of phrases. The score of a segmentation a21 a11a7
under a120 is given by a83 a11a85a41a86 a7a22a87 a88 a1a34a21
a85
a28 . We then generate
a lattice of segmentations of a26 a27a7 (implemented as an
acceptor a60 ) by composing it with the transducer a120 ,
i.e. a121 a30 a60a40a122a123a120 .
2. We then decompose a121 into a61 disjoint subsets
a121
a11
a69
a23a124a71a125a73a19a74a117a8 a68 a8a50a49a51a49a52a49a51a8
a61a37a75
a8a41a126 a27
a11
a86
a7
a121
a11
a30
a121 so that a121
a11
contains all segmentations of the source language
sentence with exactly a23 phrases. To construct a121 a11 ,
we create an unweighted acceptor a0 a11 that accepts
any phrase sequence of length a23 ; for efficiency, the
phrase vocabulary is restricted to the phrases in a121 .
a121
a11 is then obtained by the finite state composition
a121
a11
a30
a121a127a122
a0
a11 .
3. For a23a128a30a129a74a48a8 a68 a8a130a49a52a49a51a49a52a8 a61
The normalization factors a102 a11 are obtained by sum-
ming the probabilities of all segmentations in a121 a11 .
This sum can be computed efficiently using lattice
forward probabilities (Wessel et al., 1998). For
a fixed a23 , the most likely segmentation in a121 a11 is
found as
a110
a131
a11
a30a78a113a48a114a24a115a117a116a118a113a48a119
a132a134a133a34a135a117a136a16a82
a74
a102
a11
a11
a137
a85a41a86
a7
a0 a88 a1a34a21a19a138
a85
a28a41a49 (10)
4. Finally we select the optimal segmentation as
a110
a131a129a30 a113a48a114a24a115a117a116a118a113a6a119
a11
a135a16a139
a7
a107
a46
a107a141a140a141a140a141a140a141a107 a27a43a142
a0a2a1
a110
a131
a11
a35 a23a40a8a41a26a12a27a7a37a28a96a0a2a1a32a23a67a35 a26a12a27a7a37a28a15a49 (11)
A portion of the segmentation transducer a120 for the
French sentence nous avons une inflation galopante is
presented in Figure 2. When composed with a60 , a120 gen-
erates the following two phrase segmentations: nous
avons une inflation galopante and nous avons une in-
flation galopante. The “ ” symbol is used to indicate
phrases formed by concatenation of consecutive words.
The phrases specified by the source segmentation model
remain in the order that they appear in the source sen-
tence.
nous
: 
ε
avons
: ε
: une ε
avons : εnous : ε
inflation: ε
galopante : ε
une: ε
inflation: ε
galopante : ε
ε
: 
nous
/ 0.0024
ε :
avons
/ 0.0003
ε : nous_avons_une/5.82e−6
inflation_galopante/4.8e−7ε :
ε
: 
une_inflation_galopante/4.8e−7
Figure 2: A portion of the phrase segmentation transducer
a120 for the sentence “nous avons une inflation galopante”.
Phrase Permutation Model We now define a model
for the reordering of phrase sequences as determined
by the previous model. The phrase alignment sequence
a18 a11a7 specifies a reordering of phrases into target language
phrase order; the words within the phrases remain in the
source language order. The phrase sequence a21 a11a7 is re-
ordered into a21 a0 a105 a8a10a21 a0a2a1 a8a130a49a52a49a51a49a51a8a10a21 a0 a82 . The phrase alignment se-
quence is modeled as a first order Markov process
a0a2a1a32a18a19a11a7a44a35 a21a19a11a7a44a8a15a23a40a8a15a26a12a27a7a37a28a79a30 a0a2a1a32a18a19a11a7a44a35 a21a19a11a7a39a28 (12)
a30 a0a2a1a32a18
a7
a28
a11
a137
a85a41a86
a46
a0a2a1a4a18
a85
a35 a18
a85
a63
a7
a8a10a21a22a11a7 a28a15a49
with a18
a85
a71a67a73a22a74a48a8 a68 a8a50a49a51a49a51a49a52a8a15a23
a75 . The alignment sequence distri-
bution is constructed to assign decreasing likelihood to
phrase re-orderings that diverge from the original word
order. Suppose a21 a0a4a3 a30 a26a6a5
a133
a5 and
a21
a0a7a3a9a8 a105
a30 a26a11a10
a133
a10 , we set the
Markov chain probabilities as follows (Och et al., 1999)
a0a2a1a32a18
a85
a35 a18
a85
a63
a7
a28a13a12 a87a15a14
a5a63 a10
a133
a63
a7
a14a54
a0a2a1a32a18
a7
a30a17a16 a28a79a30
a74
a23
a69
a16a38a71a25a73a19a74a117a8 a68 a8a130a49a52a49a51a49a51a8a24a23
a75
a49 (13)
In the above equations, a87 a54 is a tuning factor and
we normalize the probabilities a0a2a1a32a18
a85
a35 a18
a85
a63
a7
a28 so that
a76
a11
a18
a86
a7
a107
a18a20a19
a86
a0a3a9a8 a105
a0a2a1a4a18
a85
a30a22a21a111a35 a18
a85
a63
a7
a28a31a30a129a74 .
The finite state implementation of this model involves
two acceptors. We first build a unweighted permutation
acceptora23 a132 that contains all permutations of the phrase
sequence a21 a11a7 in the source language (Knight and Al-
Onaizan, 1998) . We note that a path througha23 a132 corre-
sponds to an alignment sequence a18 a11a7 . Figure 3 shows the
acceptora23 a132 for the source phrase sequence nous avons
une inflation galopante.
A source phrase sequence a131 of length a23 words re-
quires a permutation acceptor a23 a132 of a68 a11 states. For
long phrase sequences we compute a score a116a118a113a48a119 a18 a0a2a1a32a18
a85
a30
a99 a35 a18
a85
a63
a7
a30a17a21a16a28 for each arc and then prune the arcs by this
score, i.e. phrase alignments containing a18
a85
a30 a99 are in-
cluded only if this score is above a threshold. Pruning
can therefore be applied whilea23
a132 is constructed.
nous
avons
avons
une_inflation_galopante
une_inflation_galopante une_inflation_galopante
nous
une_inflation_galopante
nous
avons
avons
nous
Figure 3: The permutation acceptor a23 a132 for the
source-language phrase sequence nous avons
une inflation galopante.
The second acceptor a24 in the implementation of the
phrase permutation model assigns alignment probabil-
ities (Equation 13) to a given permutation a18 a11a7 of the
source phrase sequence a21 a11a7 (Figure 4). In this example,
the phrases in the source phrase sequence are specified as
follows: a21 a7 a30 a26 a7 (nous), a21a48a46a2a30 a26a6a46 (avons) and a21a26a25a17a30 a26a11a27a25
(une inflation galopante). We now show the computa-
tion of some of the alignment probabilities (Equation 13)
in this example (a87 a54 a30 a92 a49a29a28 )
a0a2a1a32a18a30a25 a30 a74a16a35 a18a19a46 a30a32a31a16a28a33a12 a87a34a14
a7
a63 a27 a63
a7
a14a54 a30 a92 a49a36a35a37a28
a0a2a1a32a18 a25 a30 a68 a35 a18 a46 a30a32a31a16a28a33a12 a87 a14
a46
a63 a27 a63
a7
a14a54 a30 a92 a49a29a38a26a38 a49
Normalizing these terms gives a0a2a1a4a18a39a25 a30a77a74a19a35 a18a19a46a100a30a40a31a43a28a31a30 a92 a49a41a43a42
and a0a2a1a4a18a30a25 a30 a68 a35 a18a19a46a100a30a32a31a16a28a31a30 a92 a49a36a35a37a31 .
Template Sequence Model Here we describe the
main component of the model. An alignment template
a14 a30 a1
a57
a53
a7 a8
a58
a55
a54
a8
a2
a28 specifies the allowable alignments be-
tween the class sequences a57
a53
a7 and
a58
a55
a54 .
a2 is a a45 a44
a1a46a45a48a47a91a74a130a28 binary, 0/1 valued matrix which is constructed
as follows: If a57a50a49 can be aligned to a58
a18 , then
a2a51a49
a18
a30 a74 ;
otherwise a2a51a49
a18
a30 a92 . This process may allow
a57a51a49 to align
with the NULL token a58
a54 , i.e.
a2a50a49
a54 a30 a74 , so that words
can be freely inserted in translation. Given a pair of class
sequences a57
a53
a7 and
a58
a55
a54 , we specify exactly one matrix
a2 .
We say that a14 a30 a1 a57
a53
a7 a8
a58
a55
a54
a8
a2
a28 is consistent with the
target language phrase a9 and the source language phrase
nous/0.47nous/0.33
nous/0.45
/0.33
avons
une_inflation_galopante/0.33
avons/0.53
avons/0.53
une_inflation_galopante/0.55
une_inflation_galopante/0.47
Figure 4: Acceptor a24 that assigns probabilities to per-
mutations of the source language phrase sequence nous
avons une inflation galopante (a87 a54 a30 a92 a49a28 ).
a21 if
a57
a53
a7 is the class sequence for a9 and
a58
a55
a54 is the class
sequence for a21 .
In Section 4.1, we will outline a procedure to build
a library of alignment templates from bitext word-level
alignments. Each template a14 a30a129a1 a57
a53
a7 a8
a58
a55
a54
a8
a2
a28 used in our
model has an index a99 in this template library. Therefore
any operation that involves a mapping to (from) template
sequences will be implemented as a mapping to (from) a
sequence of these indices.
We have described the segmentation and permutation
processes that transform a source language sentence into
phrases in target language phrase order. The next step
is to generate a consistent sequence of alignment tem-
plates. We assume that the templates are conditionally
independent of each other and depend only on the source
language phrase which generated each of them
a0a2a1a4a14a19a11a7a70a35 a18a22a11a7a39a8a10a21a22a11a7a17a8a24a23a40a8a41a26a12a27a7a37a28a79a30
a11
a137
a85a41a86
a7
a0a2a1a4a14
a85
a35 a18a22a11a7a39a8a10a21a22a11a7a17a8a24a23a25a8a15a26a12a27a7a37a28
a30
a11
a137
a85a41a86
a7
a0a2a1a4a14
a85
a35 a21
a0a4a3
a28a15a49 (14)
We will implement this model using the transducer a0 that
maps any permutation a21 a0 a105 a8a24a21 a0a1 a8a130a49a52a49a51a49a52a8a24a21 a0 a82 of the phrase se-
quence a21 a11a7 into a template sequence a14 a11a7 with probability
as in Equation 14. For every phrase a21 , this transducer al-
lows only the templates a14 that are consistent with a21 with
probability a0a2a1a4a14a111a35 a21a22a28 , i.e. a0a2a1a4a14
a85
a35 a21
a0a7a3
a28 enforces the consis-
tency between each source phrase and alignment tem-
plate.
Phrasal Translation Model We assume that a tar-
get phrase is generated independently by each alignment
template and source phrase
a0a2a1a34a9 a11a7 a35 a14 a11a7 a8a15a18 a11a7 a8a24a21 a11a7 a8a24a23a25a8a15a26 a27a7 a28
a30
a11
a137
a85a41a86
a7
a0a2a1a34a9
a85
a35 a14a19a11a7a2a8a24a18a22a11a7a39a8a10a21a22a11a7a44a8a15a23a40a8a41a26a12a27a7a29a28
a30
a11
a137
a85a41a86
a7
a0a2a1a34a9
a85
a35 a14
a85
a8a24a21
a0a7a3
a28a15a49 (15)
This allows us to describe the phrase-internal transla-
tion model a0a2a1a34a9 a35 a21a89a8a24a14a19a28 as follows. We assume that each
word in the target phrase is produced independently and
that the consistency is enforced between the words in a9
and the class sequence a57
a53
a7 so that a0a2a1a4a3
a49
a35 a14 a8a24a21a19a28 a30 a92 if
a3
a49a2a1
a71
a57 a49.
We now introduce the word alignment variables a3 a49 a8a15a99a29a30
a74a48a8 a68 a8a130a49a52a49a51a49a52a8
a45 , which indicates that
a3
a49 is aligned to
a26a5a4a7a6
within a9 and a21 .
a0a2a1a34a9 a35 a14 a30 a1
a57
a53
a7 a8
a58
a55
a54
a8
a2
a28a41a8a10a21a22a28
a30
a53
a137
a49
a86
a7
a0a2a1a4a3
a49
a35 a14 a8a24a21a19a28a31a30
a53
a137
a49
a86
a7
a55
a8
a18
a86
a54
a0a2a1a4a3
a49
a8
a3 a49
a30a22a21a109a35 a14 a8a10a21a22a28
a30
a53
a137
a49
a86
a7
a55
a8
a18
a86
a54
a0a2a1a32a3
a49
a35
a3 a49
a30 a21a117a8a24a14 a8a10a21a22a28a96a0a2a1
a3a6a49
a30 a21a111a35
a2
a8a10a21a22a28
a30
a53
a137
a49
a86
a7
a55
a8
a18
a86
a54
a0a2a1a32a3
a49
a35 a26
a18
a28a96a0a2a1
a3a6a49
a30 a21a111a35
a2
a28a41a74a7a9a10a6a41a1a4a3
a49
a28a41a49 (16)
The term a0a2a1a32a3 a49 a35 a26 a18 a28 is a translation dictionary (Och and
Ney, 2000) and a0a2a1 a3 a49 a30 a21a48a8 a2 a28 is obtained as
a0a2a1
a3a6a49
a30 a21a111a35
a2
a28 a30 a2a50a49
a18
a76
a18 a133
a2 a49
a18 a133
a49 (17)
We have assumed that a0a2a1 a3a11a49 a35 a21a89a8 a2 a28 a30 a0a2a1 a3 a49 a35a2 a28 , i.e. that
given the template, word alignments do not depend on the
source language phrase.
For a given phrase a21 and a consistent alignment tem-
plate a14 a30 a1 a57
a53
a7 a8
a58
a55
a54 a8
a2
a28 , a weighted acceptor a102 can be
constructed to assign probability to translated phrases ac-
cording to Equations 16 and 17. a102 is constructed from
four component machines a11 , a12 , a13 and a14 , constructed as
follows.
The first acceptor a11 implements the alignment matrix
a2 . It has a45
a47a40a74 states and between any pair of states a99a16a15a38a74
and a99 , each arca21 corresponds to a word alignment vari-
able a3a6a49 a30a22a21 . Therefore the number of transitions between
states a99 and a99 a47 a74 is equal to the number of non-zero val-
ues of a3 a49. Thea21a18a17a20a19 arc from state a99a21a15 a74 to a99 has probability
a0a2a1
a3 a49
a30a22a21a111a35
a2
a28 (Equation 17).
The second machine a12 is an unweighted transducer that
maps the index a99 a71a25a73 a92 a8a47a74a48a8a50a49a51a49a51a49a52a8a9a45 a75 in the phrase a21a17a30a91a26
a55
a54 to
the corresponding word a26 a49.
The third transducer is the lexicon transducer a13 that
maps the source word a26a84a71 a131a23a22 to the target word a3a44a71 a131 a9
with probability a0a2a1a32a3a22a35 a26a12a28 .
The fourth acceptor a14 is unweighted and allows all tar-
get word sequences a3
a53
a7 which can be specified by the
inflationawayrun
3 /0.5
 
A  
z
F = une inflation galopante
E = run away inflation
i=2 i=3i=1
C
0 : NULL
1 : une
2 : inflation
3 : galopante
D
I
O
inflation
/0.5 /0.01 /0.44
run away
Z
2/0.5
3/1.0 0/1.0
1 2 3
0
1
2
3
E
Finflation
galopante : inflation / 0.04
galopante : run / 0.50
: inflation / 0.85
NULL : away / 0.01
Figure 5: Component transducers to construct the accep-
tor a102 for an alignment template a14 .
class sequence a57
a53
a7 .
a14 has a45
a47a77a74 states. The number
of transitions between states a99 a15 a74 and a99 is equal to the
number of target language words with class specified by
a57 a49.
Figure 5 shows all the four component FSTs for build-
ing the transducer a102 corresponding to an alignment tem-
plate from our library. Having built these four machines,
we obtain a102 as follows. We first compose the four trans-
ducers, project the resulting transducer onto the output la-
bels, and determinize it under the a1a47 a8 a44 a28 semiring. This
is implemented using AT&T FSM tools as follows
fsmcompose O I D C a35 fsmproject -o a35 a0
fsmrmepsilon a35 fsmdeterminize a1 a102 .
Given an alignment template a14 and a consistent source
phrase a21 , we note that the composition and determiniza-
tion operations assign the probability a0a2a1a34a9 a35 a14 a8a24a21a19a28 (Equa-
tion 16) to each consistent target phrase a9 . This summa-
rizes the construction of a transducer for a single align-
ment template.
We now implement a transducer a2 that maps se-
quences of alignment templates to target language word
sequences. We identify all templates consistent with the
phrases in the source language phrase sequence a21 a11a7 . The
transducer a2 is constructed via the FSM union operation
of the transducers that implement these templates.
For the source phrase sequence a21
a25
a7 (nous avons
une inflation galopante), we show the transducer a2 in
Figure 6. Our example library consists of three tem-
plates a14 a7 , a14a56a46 and a14a20a25 . a14 a7 maps the source word nous
to the target word we via the word alignment matrix
a2 specified as a3
a7
a30 a74 . a14a56a46 maps the source word
avons to the target phrase have a via the word align-
ment matrix a2 specified as a3 a7 a30a59a74a117a8 a3 a46 a30 a92 . a14a20a25 maps
: have  ε  ε : a
/0.42 /0.07
: run ε : away ε
 ε: ε
 ε: ε
 ε2z :
 εz3 :
 ε: ε
 ε : we
Z1
z1 : ε
/0.72
/0.44
: ε inflation
/0.5 /0.01
Z3
Z2
Figure 6: Transducer a2 that maps the source template
sequence a14
a25
a7 into target phrase sequences a9
a25
a7 .
the source phrase une inflation galopante to the target
phrase run away inflation via the word alignment matrix
a2 specified as a3
a7
a30 a31 a8
a3
a46 a30 a92 a8
a3
a25 a30 a73 a68 a8a9a31
a75 .
a2 is built out of the three component acceptors
a102
a7 ,
a102 a46 , and a102a33a25 . The acceptor a102
a49 corresponds to the map-
ping from the template a14 a49 and the source phrase a21 a49 to all
consistent target phrases a9 a49.
Target Language Model We specify this model as
a0a2a1a4a3a6a5a7a43a35 a9a12a11a7a20a8a24a14a19a11a7a17a8a15a18a22a11a7a13a8a24a21a19a11a7a17a8a15a23a40a8a15a26a12a27a7a42a28a31a30a91a0 a9a100a1a4a3a6a5a7a130a28a41a74a6a73a6a3a6a5a7 a30 a9a12a11a7
a75
a8
where a74a56a73a48a3 a5a7 a30a129a9 a11a7 a75 enforces the requirement that words
in the translation agree with those in the phrase sequence.
We note that a0 a9 a1a4a3 a5a7a130a28 is modeled as a standard backoff
trigram language model (Stolcke, 2002). Such a language
model can be easily compiled as a weighted finite state
acceptor (Mohri et al., 2002).
3 Alignment and Translation Via WFSTs
We will now describe how the alignment template trans-
lation model can be used to perform word-level alignment
of bitexts and translation of source language sentences.
Given a source language sentence a26 a27a7 and a target sen-
tence a3 a5a7 , the word-to-word alignment between the sen-
tences can be found as
a73 a110a9 a112a11a7 a8a22a110a14 a112a11a7 a8a48a110a18 a112a11a7 a8a43a110a21 a112a11a7 a8
a110
a23
a75
a30
a113a48a114a24a115a117a116a118a113a48a119
a88 a82
a105
a107 a3
a82
a105
a107a0
a82
a105
a107 a104
a82
a105
a107 a11
a0a2a1a34a9 a11a7 a8a15a14 a11a7 a8a24a18 a11a7 a8a10a21 a11a7 a8a24a23a36a35 a3 a5a7 a8a41a26 a27a7 a28a15a49
The variables a73 a110a9 a112a11a7 a8a48a110a18 a112a11a7 a8a43a110a21 a112a11a7 a8
a110
a23
a75 specify the alignment
between source phrases and target phrases while a110a14 a11a7 gives
the word-to-word alignment within the phrase sequences.
Given a source language sentence a26 a27a7 , the translation
can be found as
a73a111a110a3 a112a5a7 a8a19a110a9 a112a11a7 a8a22a110a14 a112a11a7 a8a6a110a18 a112a11a7 a8a43a110a21 a112a11a7 a8
a110
a23
a75
a30
a113a117a114a10a115a43a116a118a113a6a119
a4a6a5
a105
a107
a88 a82
a105
a107 a3
a82
a105
a107a0
a82
a105
a107 a104
a82
a105
a107 a11
a0a2a1a4a3 a5a7 a8a24a9 a11a7 a8a24a14 a11a7 a8a24a18 a11a7 a8a10a21 a11a7 a8a24a23a67a35 a26 a27a7 a28a15a8
where a110a3 a5a7 is the translation of a26 a27a7 .
We implement the alignment and translation proce-
dures in two steps. We first segment the source sentence
into phrases, as described earlier
a73a111a110a21 a112a11a7 a8
a110
a23
a75
a30a78a113a48a114a24a115a117a116a118a113a48a119
a104
a82
a105
a107 a11
a0a2a1a34a21a22a11a7 a35 a23a25a8a15a26a12a27a7 a28a96a0a2a1a32a23a67a35 a26a12a27a7 a28a15a49 (18)
After segmenting the source sentence, the alignment of
a sentence pair a1a4a3 a5a7a6a8a15a26 a27a7a37a28 is obtained as
a73 a110a9 a112a11a7 a8a22a110a14 a112a11a7 a8a48a110a18 a112a11a7
a75
a30 (19)
a113a48a114a24a115a117a116a118a113a48a119
a88 a0a82
a105
a107 a3
a0
a82
a105
a107a0
a0
a82
a105
a0a2a1a34a9 a112a11a7 a8a24a14 a112a11a7 a8a15a18 a112a11a7 a35a34a110a21 a112a11a7 a8
a110
a23a127a8a15a26a12a27a7 a8a24a3a6a5a7 a28a41a49
The translation is the same way as
a73a111a110a3 a112a5a7a56a8a22a110a9 a112a11a7a20a8a19a110a14 a112a11a7a44a8a48a110a18 a112a11a7
a75
a30 (20)
a113a117a114a10a115a43a116a118a113a6a119
a4a6a5
a105
a107
a88 a0a82
a105
a107 a3
a0
a82
a105
a107a0
a0
a82
a105
a0a2a1a4a3 a5a7 a8a10a9 a112a11a7 a8a15a14 a112a11a7 a8a24a18 a112a11a7 a35a34a110a21 a112a11a7 a8
a110
a23 a8a15a26 a27a7 a28a41a49
We have described how to compute the optimal seg-
mentation
a110
a131a125a30 a110a21a20a112a11a7 (Equation 18) in Section 2. The seg-
mentation process decomposes the source sentence a26 a27a7
into a phrase sequence a110a21 a112a11a7 . This process also tags each
source phrase a110a21
a85
with its position a16 in the phrase se-
quence. We will now describe the alignment and trans-
lation processes using finite state operations.
3.1 Bitext Word Alignment
Given a collection of alignment templates, it is not guar-
anteed that every sentence pair in a bitext can be seg-
mented into phrases for which there exist the consistent
alignment templates needed to create an alignment be-
tween the sentences. We find in practice that this prob-
lem arises frequently enough that most sentence pairs
are assigned a probability of zero under the template
model. To overcome this limitation, we add several types
of “dummy” templates to the library that serve to align
phrases when consistent templates could not otherwise
be found.
The first type of dummy template we introduce al-
lows any source phrase a110a21
a85
to align with any single word
target phrase a9 a49. This template is defined as a triple
a14
a49
a85
a30 a73a56a9
a49
a8a43a110a21
a85
a8
a2a13a75 where
a16 a71 a73a19a74a117a8 a68 a8a130a49a52a49a51a49a51a8
a110
a23
a75 and
a99 a71
a73a19a74a117a8 a68 a8a50a49a51a49a52a49a51a8
a12a22a75 . All the entries of the matrix a2 are speci-
fied to be ones. The second type of dummy template al-
lows source phrases to be deleted during the alignment
process. For a source phrase a110a21
a85
we specify this tem-
plate as a14
a85
a30 a1 a110a21
a85
a8
a1
a28a41a8a9a16a77a30 a74a48a8 a68 a8a50a49a51a49a52a49a51a8
a110
a23 . The third type
of template allows for insertions of single word target
phrases. For a target phrase a9 a49 we specify this template as
a14
a49
a30 a1
a1
a8a24a9
a49
a28a15a8a24a99a108a30 a74a117a8 a68 a8a50a49a51a49a52a49a51a8
a12 . The probabilities
a0a2a1a4a14a111a35 a21a19a28 for
these added templates are not estimated; they are fixed as
a global constant which is set so as to discourage their use
except when no other suitable templates are available.
A lattice of possible alignments between a3 a5a7 and a26 a27a7 is
then obtained by the finite state composition
a2 a30
a23a25a112
a132
a122 a24 a122
a0
a122 a2 a122a4a3
a49 (21)
where a3 is an acceptor for the target sentence a3 a5a7 . We then
compute the ML alignment
a110
a5 (Equation 19) by obtain-
ing the path with the highest probability, in a2 . The path
a110
a5 determines three types of alignments: phrasal align-
ment between the source phrase a110a21
a85
and the target phrase
a110a9
a85
; deletions of source phrases a110a21
a85
; and insertions of tar-
get words a3 a49. To determine the word-level alignment be-
tween the sentences a3 a5a7 and a26 a27a7 ,we are primarily interested
in the first of these types of alignments. Given that the
source phrase a110a21
a85
has aligned to the target phrase a110a9
a85
, we
look up the hidden template variable a110a14
a85
that yielded this
alignment. a110a14
a85
contains the the word-to-word alignment
between these phrases.
3.2 Translation and Translation Lattices
The lattice of possible translations of a26 a27a7 is obtained using
the weighted finite state composition:
a6 a30
a23a25a112
a132
a122 a24 a122
a0
a122 a2 a122a8a7
a49 (22)
The translation with the highest probability (Equa-
tion 20) can now be computed by obtaining the path with
the highest score in a6 .
In terms of AT&T FSM tools, this can be done as fol-
lows
fsmbestpath a6 a35 fsmproject a15 a93 a35 a0
fsmrmepsilon a1
a110
a3
A translation lattice (Ueffing et al., 2002) can be gen-
erated by pruning a6 based on likelihoods or number of
states. Similarly, an alignment lattice can be generated
by pruning a2 .
4 Translation and Alignment Experiments
We now evaluate this implementation of the alignment
template translation model.
4.1 Building the Alignment Template Library
To create the template library, we follow the procedure
reported in Och (2002). We first obtain word alignments
of bitext using IBM-4 translation models trained in each
translation direction (IBM-4 F and IBM-4 E), and then
forming the union of these alignments (IBM-4 a58 a126 a57 ).
We extract the library of alignment templates from the
bitext alignment using the phrase-extract algorithm re-
ported in Och (2002). This procedure identifies several
alignment templates a14a127a30 a1 a57
a53
a7 a8
a58
a55
a54
a8
a2
a28 that are consis-
tent with a source phrase a21 . We do not use word classes
in the experiments reported here; therefore templates are
specified by phrases rather than by class sequences. For
a given pair of source and target phrases, we retain only
the matrix of alignments that occurs most frequently in
the training corpus. This is consistent with the intended
application of these templates for translation and align-
ment under the maximum likelihood criterion; in the cur-
rent formulation, only one alignment will survive in any
application of the models and there is no reason to retain
any of the less frequently occuring alignments. We esti-
mate the probability a0a2a1a4a14a111a35 a21a22a28 by the relative frequency of
phrasal translations found in bitext alignments. To restrict
the memory requirements of the model, we extract only
the templates which have at most a35 words in the source
phrase. Furthermore, we restrict ourselves to the tem-
plates which have a probability a0a2a1a32a14a109a35 a21a22a28 a1 a92 a49 a92 a74 for some
source phrase a21 .
4.2 Bitext Word Alignment
We present results on the French-to-English Hansards
translation task (Och and Ney, 2000). We measured
the alignment performance using precision, recall, and
Alignment Error Rate (AER) metrics (Och and Ney,
2000).
Our training set is a subset of the Canadian Hansards
which consists of a35 a92 a8 a92a43a92a117a92 French-English sentence
pairs (Och and Ney, 2000). The English side of the bitext
had a total ofa42a20a41 a31a22a8 a38a31a26a31 words (a74 a0 a8 a41a31 a92 unique tokens) and
the French side containeda0 a74a4a38a22a8 a35a41a30a35 words (a68 a41 a8 a92 a28a26a38 unique
tokens). Our template library consisted of a74a117a8 a92 a42a22a74a48a8a47a74 a68a1a0
templates.
Our test set consists of 500 unseen French sentences
from Hansards for which both reference translations and
word alignments are available (Och and Ney, 2000). We
present the results under the ATTM in Table 1, where we
distinguish word alignments produced by the templates
from the template library against those produced by the
templates introduced for alignment in Section 3.1. For
comparison, we also align the bitext using IBM-4 trans-
lation models.
Model Alignment Metrics (%)
Precision Recall AER
IBM-4 F 88.9 89.8 10.8
IBM-4 E 89.2 89.4 10.7
IBM-4 a2a4a3a6a5 84.3 93.8 12.3
ATTM-C 64.2 63.8 36.2
ATTM-A 94.5 55.8 27.3
Table 1: Alignment Performance on the French-to-
English Hansards Alignment Task.
We first observe that the complete set of word align-
ments generated by the ATTM (ATTM-C) is relatively
poor. However, when we consider only those word align-
ments generated by actual alignment templates (ATTM-
A) (and discard the alignments generated by the dummy
templates introduced as described in Section 3.1), we
obtain very high alignment precision. This implies that
word alignments within the templates are very accurate.
However, the poor performance under the recall measure
suggests that the alignment template library has relatively
poor coverage of the phrases in the alignment test set.
4.3 Translation and Lattice Quality
We next measured the translation performance of ATTM
on the same test set. The translation performance was
measured using the BLEU (Papineni et al., 2001) and the
NIST MT-eval metrics (Doddington, 2002), and Word Er-
ror Rate (WER). The target language model was a trigram
language model with modified Kneser-Ney smoothing
trained on the English side of the bitext using the SRILM
tookit (Stolcke, 2002). The performance of the model is
reported in Table 2. For comparison, we also report per-
formance of the IBM-4 translation model trained on the
same corpus. The IBM Model-4 translations were ob-
tained using the ReWrite decoder (Marcu and Germann,
2002). The results in Table 2 show that the alignment
Model BLEU NIST WER (%)
IBM-4 0.1711 5.0823 67.5
ATTM 0.1941 5.3337 64.7
Table 2: Translation Performance on the French-to-
English Hansards Translation Task.
template model outperforms the IBM Model 4 under all
three metrics. This verifies that WFST implementation of
the ATTM can obtain a performance that compares favor-
ably to other well known research tools.
We generate N-best lists from each translation lattice,
and show the variation of their oracle-best BLEU scores
in Table 3. We observe that the oracle-best BLEU score
Size of N-best list
1 10 100 400 1000
BLEU 0.1941 0.2264 0.2550 0.2657 0.2735
Table 3: Variation of oracle-Best BLEU scores on N-Best
lists generated by the ATTM.
increases with the size of the N-Best List. We can there-
fore expect to rescore these lattices with more sophis-
ticated models and achieve improvements in translation
quality.
5 Discussion
The main motivation for our investigation into this WFST
modeling framework for statistical machine translation
lies in the simplicity of the alignment and translation pro-
cesses relative to other dynamic programming or a2a4a3 de-
coders (Och, 2002). Once the components of the align-
ment template translation model are implemented as WF-
STs, alignment and translation can be performed using
standard FSM operations that have already been imple-
mented and optimized. It is not necessary to develop spe-
cialized search procedures, even for the generation of lat-
tices and N-best lists of alignment and translation alter-
natives.
The derivation of the ATTM was presented with the in-
tent of clearly identifying the conditional independence
assumptions that underly the WFST implementation.
This approach leads to modular implementations of the
component distributions of the translation model. These
components can be refined and improved by changing the
corresponding transducers without requiring changes to
the overall search procedure. However some of the mod-
eling assumptions are extremely strong. We note in par-
ticular that segmentation and translation are carried out
independently in that phrase segmentation is followed by
phrasal translation; performing these steps independently
can easily lead to search errors.
It is a strength of the ATTM that it can be directly
constructed from available bitext word alignments. How-
ever this construction should only be considered an ini-
tialization of the ATTM model parameters. Alignment
and translation can be expected to improve as the model
is refined and in future work we will investigate iterative
parameter estimation procedures.
We have presented a novel approach to generate align-
ments and alignment lattices under the ATTM. These lat-
tices will likely be very helpful in developing ATTM pa-
rameter estimation procedures, in that they can be used
to provide conditional distributions over the latent model
variables. We have observed that that poor coverage of
the test set by the template library may be why the over-
all word alignments produced by the ATTM are relatively
poor; we will therefore also explore new strategies for
template selection.
The alignment template model is a powerful model-
ing framework for statistical machine translation. It is
our goal to improve its performance through new training
procedures while refining the basic WFST architecture.
Acknowledgments
We would like to thank F. J. Och of ISI, USC for pro-
viding us the GIZA++ SMT toolkit, the mkcls toolkit to
train word classes, the Hansards 50K training and test
data, and the reference word alignments and AER met-
ric software. We thank AT&T Labs - Research for use
of the FSM Toolkit and Andreas Stolcke for use of the
SRILM Toolkit. This work was supported by an ONR
MURI grant N00014-01-1-0685.
References
S. Bangalore and G. Ricardi. 2001. A finite-state ap-
proach to machine translation. In Proc. of the North
American Chapter of the Association for Computa-
tional Linguistics, Pittsburgh, PA, USA.
P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, and
R. L. Mercer. 1993. The mathematics of statistical
machine translation: Parameter estimation. Computa-
tional Linguistics, 19(2):263–311.
G. Doddington. 2002. Automatic evaluation of machine
translation quality using n-gram co-occurrence statis-
tics. In Proc. of HLT 2002, San Diego, CA. USA.
K. Knight and Y. Al-Onaizan. 1998. Translation with
finite-state devices. In Proc. of the AMTA Conference,
pages 421–437, Langhorne, PA, USA.
D. Marcu and U. Germann, 2002. The ISI ReWrite
Decoder Release 0.7.0b. http://www.isi.edu/licensed-
sw/rewrite-decoder/.
M. Mohri, F. Pereira, and M. Riley, 1997. ATT
General-purpose finite-state machine software tools.
http://www.research.att.com/sw/tools/fsm/.
M. Mohri, F. Pereira, and M. Riley. 2002. Weighted
finite-state transducers in speech recognition. Com-
puter Speech and Language, 16(1):69–88.
F. Och and H. Ney. 2000. Improved statistical alignment
models. In Proc. of ACL-2000, pages 440–447, Hong
Kong, China.
F. Och, C. Tillmann, and H. Ney. 1999. Improved align-
ment models for statistical machine translation. In
Proc. of the Joint Conf. of Empirical Methods in Nat-
ural Language Processing and Very Large Corpora,
pages 20–28, College Park, MD, USA.
F. Och. 2002. Statistical Machine Translation: From
Single Word Models to Alignment Templates. Ph.D.
thesis, RWTH Aachen, Germany.
K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2001.
Bleu: a method for automatic evaluation of machine
translation. Technical Report RC22176 (W0109-022),
IBM Research Division.
A. Stolcke. 2002. SRILM – an extensible language mod-
eling toolkit. In Proc. of the International Conference
on Spoken Language Processing, pages 901–904, Den-
ver, CO, USA. http://www.speech.sri.com/projects/srilm/.
N. Ueffing, F. Och, and H. Ney. 2002. Generation of
word graphs in statistical machine translation. In Proc.
of the Conference on Empirical Methods in Natural
Language Processing, pages 156–163, Philadelphia,
PA, USA.
F. Wessel, K. Macherey, and R. Schlueter. 1998. Using
word probabilities as confidence measures. In Proc. of
ICASSP-98, pages 225–228, Seattle, WA, USA.
