Efficient Decoding for Statistical Machine Translation
with a Fully Expanded WFST Model
Hajime Tsukada
NTT Communication Science Labs.
2-4 Hikaridai Seika-cho Soraku-gun
Kyoto 619-0237
Japan
tsukada@cslab.kecl.ntt.co.jp
Masaaki Nagata
NTT Cyber Space Labs.
1-1 Hikari-no-Oka Yokosuka-shi
Kanagawa 239-0847
Japan
nagata.masaaki@lab.ntt.co.jp
Abstract
This paper proposes a novel method to compile sta-
tistical models for machine translation to achieve
efficient decoding. In our method, each statistical
submodel is represented by a weighted finite-state
transducer (WFST), and all of the submodels are ex-
panded into a composition model beforehand. Fur-
thermore, the ambiguity of the composition model
is reduced by the statistics of hypotheses while de-
coding. The experimental results show that the pro-
posed model representation drastically improves the
efficiency of decoding compared to the dynamic
composition of the submodels, which corresponds
to conventional approaches.
1 Introduction
Recently, research on statistical machine translation
has grown along with the increase in computational
power as well as the amount of bilingual corpora.
The basic idea of modeling machine translation was
proposed by Brown et al. (1993), who assumed that
machine translation can be modeled on noisy chan-
nels. The source language is encoded from a target
language by a noisy channel, and translation is per-
formed as a decoding process from source language
to target language.
Knight (1999) showed that the translation prob-
lem defined by Brown et al. (1993) is NP-
complete. Therefore, with this model it is al-
most impossible to search for optimal solutions in
the decoding process. Several studies have pro-
posed methods for searching suboptimal solutions.
Berger et al. (1996) and Och et al. (2001) pro-
posed such depth-first search methods as stack de-
coders. Wand and Waibel (1997) and Tillmann and
Ney (2003) proposed breadth-first search methods,
i.e. beam search. Germann (2001) and Watanabe
and Sumita (2003) proposed greedy type decoding
methods. In all of these search algorithms, better
representation of the statistical model in systems
can improve the search efficiency.
For model representation, a search method based
on weighted finite-state transducer (WFST) (Mohri
et al., 2002) has achieved great success in the speech
recognition field. The basic idea is that each statis-
tical model is represented by a WFST and they are
composed beforehand; the composed model is op-
timized by WFST operations such as determiniza-
tion and minimization. This fully expanded model
permits efficient searches. Our motivation is to ap-
ply this approach to machine translation. However,
WFST optimization operations such as determiniza-
tion are nearly impossible to apply to WFSTs in ma-
chine translation because they are much more am-
biguous than speech recognition. To reduce the am-
biguity, we propose a WFST optimization method
that considers the statistics of hypotheses while de-
coding.
Some approaches have applied WFST to sta-
tistical machine translation. Knight and Al-
Onaizan (1998) proposed the representation of
IBM model 3 with WFSTs; Bangalore and Ric-
cardi (2001) studied WFST models in call-routing
tasks, and Kumar and Byrne (2003) modeled
phrase-based translation by WFSTs. All of these
studies mainly focused on the representation of each
submodel used in machine translation. However,
few studies have focued on the integration of each
WFST submodel to improve the decoding efficiency
of machine translation.
To this end, we propose a method that expands
all of the submodels into a composition model, re-
ducing the ambiguity of the expanded model by the
statistics of hypotheses while decoding. First, we
explain the translation model (Brown et al., 1993;
Knight and Al-Onaizan, 1998) that we used as a
base for our decoding research. Second, our pro-
posed method is introduced. Finally, experimental
results show that our proposed method drastically
improves decoding efficiency.
2 IBM Model
For our decoding research, we assume the IBM-
style modeling for translation proposed in Brown et
al. (1993). In this model, translation from Japanese
a0 to English
a1 attempts to find the a1 that maximizes
a2a4a3
a1a6a5
a0a8a7 . Using Bayes’ rule, a2a4a3
a1a6a5
a0a8a7 is rewritten as
a9a11a10a13a12a15a14a16a9a18a17a20a19
a2a4a3
a1a6a5
a0a8a7a22a21
a9a11a10a13a12a15a14a16a9a18a17a20a19
a2a4a3 a0
a5a1
a7 a2a4a3
a1
a7a24a23
where a2a4a3 a1
a7 is referred to as a language model and
a2a4a3 a0
a5a1
a7 is referred to as a translation model. In this
paper, we use word trigram for a language model
and IBM model 3 for a translation model.
The translation model is represented as follows
considering all possible word alignments.
a2a4a3 a0
a5a1
a7a25a21
a26
a2a4a3 a0a27a23
a9
a5a1
a7a24a28
The IBM model only assumes a one-to-many word
alignment, where a Japanese word a0 in the a29 -th po-
sition connects to the English word a1 in the a9a31a30 -th
position.
The IBM model 3 uses the following a2a4a3 a0a27a23 a9 a5a1 a7 .
a2a4a3 a0a27a23
a9
a5a1
a7a22a21
a14a33a32a35a34a37a36
a34a37a36 a38a40a39a42a41a37a43a45a44a47a46
a36
a3a24a48
a32
a38
a36
a7
a44a15a46a50a49
a51
a52a54a53
a36
a34
a52a56a55a11a57
a3
a34
a52
a5a1
a52
a7
a49
a39
a30
a53a8a58
a59 a3
a0
a30
a5a1
a26a61a60
a7a24a62 a3
a29a8a5
a9a63a30
a23
a14
a23a61a64a54a7a24a28 (1)
a34
a52 the a number of
a0 words connecting to
a1
a52 ,
and it is called fertility. Note, however, that a34a37a36
is the number of words connecting to null words.
a57
a3
a34
a5a1
a52
a7 is conditional probability where English
word a1 a52 connects to a34 words in a0 . a57 a3 a34 a5a1 a52 a7 is
called fertility probability. a59 a3 a0 a30 a5a1 a52 a7 is conditional
probability where English word a1 a52 is translated to
Japanese word a0 a30 and called translation probability.
a62 a3
a29a65a5a66
a23a61a64a54a23
a14
a7 is conditional probability where the En-
glish word in the a66 -th position connects to the the
Japanese word in the a29 -th position on condition that
the length of the English sentence a1 and Japanese
sentence a0 are a64 and a14 , respectively. a62 a3 a29a8a5a66 a23a61a64a67a23 a14 a7
is called distortion probability. In our experiment,
we used the IBM model 3 while assuming constant
distortion probability for simplicity.
3 WFST Cascade Model
WFST is a finite-state device in which output sym-
bols and output weights are defined as well as in-
put symbols. Using composition (Pereira and Riley,
1997), we can obtain the combined WFST a68 a58a70a69 a68
a43by connecting each output of
a68
a58 to an input of
a68
a43
.
If we assume that each submodel of Equation (1) is
represented by a WFST, a conventional decoder can
be considered to compose submodels dynamically.
kaku:each/
t(kaku|each)
tekisuto:text/
t(tekisuto|text)ha:NULL/
t(ha|NULL)
Figure 2: T Model
NULL:ε/1-p0
ε:ε/p0
each:each/1.0
tex:text/1.0
Figure 3: NULL Model
The main idea of the proposed approach is to com-
pute the composition beforehand.
Figure 1 shows the translation process modeled
by a WFST cascade. This WFST cascade model
(Knight and Al-Onaizan, 1998) represents the IBM
model 3 described in the previous section. Any
possible permutations of the Japanese sentence are
inputed to the cascade. First, T model(a68 ) trans-
lates the Japanese word to an English word. NULL
model(a71 ) deletes special word NULL. Fertility
model(a72 ) merges the same continuous words into
one word. At each stage, the probability represented
by the weight of a WFST is accumulated. Finally,
the weight of language model (a73 ) is accumulated.
If WFST a74 represents all permutations of the input
sentence, decoding can be considered to search for
the best path of a74 a69 a68 a69 a71 a69 a72 a69 a73 . Therefore, com-
puting a68 a69 a71 a69 a72 a69 a73 in advance can improve the
efficiency of the decoder.
For a68 , a71 , and a72 , we adopt the representation of
Knight and Al-Onaizan (1998). For a73 , we adopt the
representation of Mohri et al. (2002). Figures 2–
5 show examples of submodel representation with
WFSTs. a75 a3 a17 a7 in Figure 5 stands for a back-off pa-
rameter. Conditional branches are represented by
nondeterministic paths in the WFST.
4 Ambiguity Reduction
If we can determinize a fully-expanded WFST, we
can achieve the best performance of the decoder.
kaku tekisuto SGMLdeko-do ka sareruha
each text encoded in SGMLNULL encoded encoded
T Model (T)
NULL Model (N)
Fertility Model (F)
Language Model (L)
each text encoded in SGMLencoded encoded
each text is in SGMLencoded
each text is in SGMLencoded
Figure 1: Translation with WFST Cascade Model
encoded:encoded/
n(1|encoded)
ε:encoded/
n(0|encoded)
encoded:encoded/
n(2|encoded)
encoded:ε/
n(3|encoded)/n(2|encoded)
encoded:ε/
n(4|encoded)/n(3|encoded)
encoded:ε/1.0
encoded:ε/1.0
encoded:ε/1.0
Figure 4: Fertility Model
However, the composed WFST for machine trans-
lation is not obviously determinizable. The word-
to-word translation model a68 strongly contributes to
WFST’s ambiguity while the a76 transition of other
submodels also contributes to ambiguity. Mohri et
al. (2002) proposed a technique that added special
symbols allowing the WFST to be determinizable.
Determinization using this technique, however, is
not expected to achieve efficient decoding in ma-
chine translation because the WFSTs of machine
translation are inherently ambiguous.
To overcome this problem, we propose a novel
WFST optimization approach that uses decoding in-
formation. First, our method merges WFST states
by considering the statistics of hypotheses while de-
coding. After merging the states, redundant edges
whose beginning states, end states, input symbols,
and output symbols are the same are also reduced.
IBM models consider all possible alignments while
a decoder searches for only the most appropriate
alignment. Therefore, there are many redundant
states in the full-expansion WFST from the view-
point of decoding.
We adopted a standard decoding algorithm in
the speech recognition field, where the forward is
beam-search and the backward is a77a79a78 search. Since
beam-search is adopted in the forward pass, the ob-
tained results are not optimal but suboptimal. All
input permutations are represented by a finite-state
acceptor (Figure 6), where each state corresponds to
input positions that are already read. In the forward
search, hypotheses are maintained for each state of
ab
ca bc
a b
c
ε
c:c/P(c|ab)
b:b/P(b|ca)
a:a/P(a|bc)
ε:ε/b(ca)
b:b/P(b|a)
ε:ε/b(ab)
c:c/P(c|b)
ε:ε/b(bc)a:a/P(a|c)
ε:ε/b(c) c:c/P(c)
b:b/P(b)
ε:ε/b(b)ε:ε/b(a)
a:a/P(a)
Figure 5: Trigram Language Model
the finite-state acceptor.
The WFST states that always appear together in
the same hypothesis list of the forward beam-search
should be equated if the states contribute to cor-
rect translation. Let a80 be a full-expansion WFST
model and a81a82a1 a0a61a83 be a WFST that represents the cor-
rect translation of an input sentence a0 . For each
a0 , the states of
a80 that always appear together in
the same hypothesis list in the course of decoding a0
with a80
a69
a81a82a1
a0a61a83 are merged in our method. Simply
merging states of a80 may increase model errors, but
a81a82a1
a0a61a83 corrects the errors caused by merging states.
Unlike ordinary FSA minimization, states are
merged without considering their successor states.
If the weight represents probability, thesum of the
weights of output transitions may not be 1.0 after
merging states, and then thecondition of probability
may be destroyed. Since the decoder does not sum
up all possible paths but searches for the most ap-
propriate paths, this kind of state merging does not
pose a serious problem in practice.
In the following experiment, we measured the
association between states by a34
a43
in Gale and
Church (1991). a34
a43
is a a84
a43
-like statistic that is
bounded between 0 and 1. If the a34
a43
of two states
is higher than the specified threshold, these two
states are merged. The definition of a34
a43
is as fol-
lows, where a9 a21a85a0 a10 a1a31a86 a3 a86 a58 a23 a86
a43
a7 ,
a75
a21a85a0
a10
a1a11a86
a3
a86
a58
a7
a32a87a9 ,
a88
a21a89a0
a10
a1a11a86
a3
a86
a43
a7
a32a90a9 , and
a62a91a21
a71
a32a90a9a92a32
a75
a32a93a88 .
a71
is the total number of hypothesis lists. a0 a10 a1a11a86 a3 a86 a7
(a0 a10 a1a11a86 a3 a86 a58 a23 a86
a43
a7 ) is the number of hypothesis lists in
which a86 appears (both a86
a58 and
a86
a43
appear).
a34
a43
a21
a3
a9
a62
a32
a75
a88
a7
a43
a3
a9a79a94
a75
a7 a3
a9a95a94a96a88
a7 a3
a75
a94
a62a47a7 a3
a88a97a94
a62a40a7
a28
{}
{1}
{2}
{3}
{1,2}
{2,3}
{1,3}
{1,2,3}
Figure 6: FSA for All Input Permutations
Merging the beginning and end states of a tran-
sition whose input is a76 (a76 transition for short) may
cause a problem when decoding. In our implemen-
tation, weight is basically minus a64a54a98 a12 probability, and
its lower bound is 0 in theory. However, there exists
negative a76 transition that originated from the back-
off value of n-gram. If we merge the beginning and
end states of the negative a76 transition, the search
process will not stop due to the negative a76 loop. To
avoid this problem, we rounded the negative weight
to 0 if the negative a76 loop appears during merging.
In the preliminary experiment, a weight-pushing
operation (Mohri and Riley, 2001) was also effec-
tive for deleting negative a76 transition of our full-
expansion models. However, pushing causes an im-
balance of weights among paths if the WFST is not
deterministic. As a result of this imbalance, we can-
not compare path costs when pruning. In fact, our
preliminary experiment showed that pushed full-
expansion WFST does not work well. Therefore,
we adopted a simpler method to deal with a nega-
tive a76 loop as described above.
5 Experiments
5.1 Effect of Full Expansion
To clarify the effectiveness of a full-expansion ap-
proach, we compared the computational costs while
using the same decoder with both dynamic com-
position and static composition, a full-expansion
model in other words. In the forward beam-search,
any hypothesis whose probability is lower than a48a100a99a100a48a100a101
of the top of the hypothesis list is pruned. In this ex-
periment, permutation is restricted, and words can
be moved 6 positions at most. The translation model
was trained by GIZA++ (Och and Ney, 2003), and
the trigram was trained by the CMU-Cambridge
Statistical Language Modeling Toolkit v2 (Clarkson
and Rosenfeld, 1997).
For the experiment, we used a Japanese-to-
English bilingual corpus consisting of example sen-
tences for a rule-based machine translation sys-
tem. Each language sentence is aligned in the cor-
pus. The total number of sentence pairs is 20,204.
We used 17,678 pairs for training and 2,526 pairs
for the test. The average length of Japanese sen-
tences was 8.4 words, and that of English sentences
was 6.7 words. The Japanese vocabulary consisted
of 15,510 words, and the English vocabulary was
11,806 words. Table 1 shows the size of the WFSTs
used in the experiment. In these WFSTs, special
symbols that express beginning and end of sentence
are added to the WFSTs described in the previous
section. The NIST score (Doddington, 2002) and
BLEU Score (Papineni et al., 2002) were used to
measure translation accuracy.
Table 2 shows the experimental results. The full-
expansion model provided translations more than 10
times faster than conventional dynamic composition
submodels without degrading accuracy. However,
the NIST scores are slightly different. In the course
of composition, some paths that do not reach the fi-
nal states are produced. In the full-expansion model
these paths are trimmed. These trimmed paths may
cause a slight difference in NIST scores.
5.2 Effect of Ambiguity Reduction
To show the effect of ambiguity reduction, we com-
pared the translation results of three different mod-
els. Model a102 is the full-expansion model described
above. Model a81 is a reduced model by using our
proposed method with a 0.9 a34
a43
threshold. Model
a81a82a103 is a reduced model with the statistics of the de-
coder without using the correct translation WFST.
In other words, a81a82a103 reduces the states of the full-
expansion model more roughly than a81 . The a34
a43
threshold for a81a82a103 is set to 0.85 so that the size of
the produced WFST is almost the same as a81 . Table
3 shows the model size. To obtain decoder statistics
for calculating a34
a43
, all of the sentence pairs in the
training set were used. When obtaining the statis-
tics, any hypothesis whose probability is lower than
a48a100a99a100a48a100a101
a36a31a104a105
of the top of the hypothesis list is pruned in
the forward beam-search.
The translation experiment was conducted by
successively changing the beam width of the for-
ward search. Figures 7 and 8 show the results of
the translation experiments, revealing that our pro-
posed model can reduce the decoding time by ap-
proximately half. This model can reduce decoding
time to a much greater extent than the rough reduc-
tion model, indicating that our state merging criteria
are valid.
6 Conclusions
We proposed a method to compile statistical mod-
els to achieve efficient decoding in a machine trans-
lation system. In our method, each statistical sub-
model is represented by a WFST, and all submodels
are composed beforehand. To reduce the ambiguity
of the composed WFST, the states are merged ac-
cording to the statistics of hypotheses while decod-
ing. As a result, we reduced decoding time to ap-
proximately a48a100a99 a103 a101 of dynamic composition of sub-
models, which corresponds to the conventional ap-
proach.
In this paper, we applied the state merging
method to a fully-expanded WFST and showed the
effectiveness of this approach. However, the state
merging method itself is general and independent
of the fully-expanded WFST. We can apply this
method to each submodel of machine translation.
More generally, we can apply it to all WFST-like
models, including HMMs.
Acknowledgements
We would like to thank F. J. Och for providing
GIZA++ and mkcls toolkits, and P. R. Clarkson for
the CMU-Cambridge statistical language modeling
toolkit v2. We also thank T. Hori for providing the
n-gram conversion program for WFSTs and F. Bond
and S. Fujita for providing the bilingual corpus.

References
Srinivas Bangalore and Giueseppe Riccardi. 2001. A finite-state approach to machine translation. In Proc. of North American Association of Computational Linguistics (NAACL 2001), May.
Adam L. Berger, Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, Andrew S. Kehler, and Robert L. Mercer. 1996. Language translation apparatus and method of using context-based translation models. United States Patent.
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pitra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311.
P.R. Clarkson and R. Rosenfeld. 1997. Statistical language modeling using the cmu-cambridge toolkit. In Proc. of European Conference on Speech Communication and Technology (EUROSPEECH’97).
G. Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proc. of HLT 2002.
William A. Gale and Kenneth W. Church. 1991. Identifying word correspondences in parallel texts. In Proc. of Fourth DARPA Speech and Natural Language Processing Workshop, pages 152–157.
Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2001. Fast decoding and optinal decoding for machine translation. In Proc. of the 39th Annual Meeting of the Association for Computational Linguistics (ACL), pages 228–235, July.
Kevin Knight and Yaser Al-Onaizan. 1998. Translation with finite-state devices. In Proc. of the 4th AMTA Conference.
Kevin Knight. 1999. Decoding complexity in word-replacement translation models. Computational Linguistics, 25(4):607–615.
Shankar Kumar and William Byrne. 2003. A weighted finite state transducer implementation of the alignment template model for statistical machine translation. In Proc. of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pages 142–149, May - June.
Mehryar Mohri and Michael Riley. 2001. A weight pushing algorithm for large vocabulary speech recognition. In Proc. of European Conference on Speech Communication and Technology (EUROSPEECH’01), September.
Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech and Language, 16(1):69–88.
Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
Franz Josef Och, Nicola Ueffing, and Hermann Ney. 2001. An efficient a77 a78 search algorithm for statistical machine translation. In Proc. of the ACL2001 Workshop on Data-Driven Machine Translation, pages 55–62, July.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLUE: a method for automatic evaluation of machine translation. In Proc. of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 311–318, July.
Fernando Pereira and Michael Riley. 1997. Speech recognition by composition of weighted finite automata. In Emmanuel Roche and Yves Schabes, editors, Finite-State Language Processing, chapter 15, pages 431–453. MIT Press, Cambridge, Massachusetts.
Christoph Tillmann and Hermann Ney. 2003. Word reordering and a dynamic programming beam search algorithm for statistical machine translation. Computational Linguistics, 29(1):97–133, March.
Ye-Yi Wang and Alex Waibel. 1997. Decoding algorithm in statistical machine translation. In Proc. of the 35th Annual Meeting of the Association for Computational Linguistics. 
Taro Watanabe and Eiichiro Sumita. 2003. Example-based decoding for statistical machine translation. In Proc. of MT Summit IX.
