Proceedings of the Workshop on Statistical Machine Translation, pages 64–71,
New York City, June 2006. c©2006 Association for Computational Linguistics
Generalized Stack Decoding Algorithms for Statistical Machine Translation∗
Daniel Ortiz Mart´ınez
Inst. Tecnol´ogico de Inform´atica
Univ. Polit´ecnica de Valencia
46071 Valencia, Spain
dortiz@iti.upv.es
Ismael Garc´ıa Varea
Dpto. de Informatica
Univ. de Castilla-La Mancha
02071 Albacete, Spain
ivarea@info-ab.uclm.es
Francisco Casacuberta Nolla
Dpto. de Sist Inf. y Comp.
Univ. Polit´ec. de Valencia
46071 Valencia, Spain
fcn@dsic.upv.es
Abstract
In this paper we propose a generalization
of the Stack-based decoding paradigm for
Statistical Machine Translation. The well
known single and multi-stack decoding
algorithms defined in the literature have
been integrated within a new formalism
which also defines a new family of stack-
based decoders. These decoders allows
a tradeoff to be made between the ad-
vantages of using only one or multiple
stacks. The key point of the new formal-
ism consists in parameterizeing the num-
ber of stacks to be used during the de-
coding process, and providing an efficient
method to decide in which stack each par-
tial hypothesis generated is to be inserted-
during the search process. Experimental
results are also reported for a search algo-
rithm for phrase-based statistical transla-
tion models.
1 Introduction
The translation process can be formulated from a
statistical point of view as follows: A source lan-
guage string fJ1 = f1 ...fJ is to be translated into
a target language string eI1 = e1 ...eI. Every tar-
get string is regarded as a possible translation for the
source language string with maximum a-posteriori
probability Pr(eI1|fJ1 ). According to Bayes’ theo-
rem, the target string ˆeI1 that maximizes1 the product
∗This work has been partially supported by the Spanish
project TIC2003-08681-C02-02, the Agencia Valenciana de
Ciencia y Tecnolog´ıa under contract GRUPOS03/031, the Gen-
eralitat Valenciana, and the project HERMES (Vicerrectorado
de Investigaci´on - UCLM-05/06)
1Note that the expression should also be maximized by I;
however, for the sake of simplicity we suppose that it is known.
of both the target language model Pr(eI1) and the
string translation model Pr(fJ1 |eI1) must be chosen.
The equation that models this process is:
ˆeI1 = argmax
eI1
{Pr(eI1) ·Pr(fJ1 |eI1)} (1)
The search/decoding problem in SMT consists in
solving the maximization problem stated in Eq. (1).
In the literature, we can find different techniques to
deal with this problem, ranging from heuristic and
fast (as greedy decoders) to optimal and very slow
decoding algorithms (Germann et al., 2001). Also,
under certain circumstances, stack-based decoders
can obtain optimal solutions.
Many works (Berger et al., 1996; Wang and
Waibel, 1998; Germann et al., 2001; Och et al.,
2001; Ort´ız et al., 2003) have adopted different types
of stack-based algorithms to solve the global search
optimization problem for statistical machine trans-
lation. All these works follow two main different
approaches according to the number of stacks used
in the design and implementation of the search algo-
rithm (the stacks are used to store partial hypotheses,
sorted according to their partial score/probability,
during the search process) :
• On the one hand, in (Wang and Waibel, 1998;
Och et al., 2001) a single stack is used. In
that case, in order to make the search feasible,
the pruning of the number of partial hypothe-
ses stored in the stack is needed. This causes
many search errors due to the fact that hy-
potheses covering a different number of source
(translated) words compete in the same condi-
tions. Therefore, the greater number of covered
words the higher possibility to be pruned.
• On the other hand (Berger et al., 1996; Ger-
mann et al., 2001) make use of multiple stacks
64
(one for each set of source covered/translated
words in the partial hypothesis) in order to
solve the disadvantages of the single-stack ap-
proach. By contrast, the problem of finding
the best hypothesis to be expanded introduces
an exponential term in the computational com-
plexity of the algorithm.
In (Ort´ız et al., 2003) the authors present an em-
pirical comparison (about efficiency and translation
quality) of the two approaches paying special atten-
tion to the advantages and disadvantages of the two
approaches.
In this paper we present a new formalism consist-
ing of a generalization of the classical stack-based
decoding paradigm for SMT. This new formalism
defines a new family of stack-based decoders, which
also integrates the well known stack-based decoding
algorithms proposed so far within the framework of
SMT, that is single and multi-stack decoders.
The rest of the paper is organized as follows: in
section 2 the phrase-based approach to SMT is de-
picted; in section 3 the main features of classical
stack-based decoders are presented; in section 4 the
new formalism is presented and in section 5 exper-
imental results are shown; finally some conclusions
are drawn in section 6.
2 Phrase Based Statistical Machine
Translation
Different translation models (TMs) have been pro-
posed depending on how the relation between the
source and the target languages is structured; that is,
the way a target sentence is generated from a source
sentence. This relation is summarized using the con-
cept of alignment; that is, how the constituents (typ-
ically words or group-of-words) of a pair of sen-
tences are aligned to each other. The most widely
used single-word-based statistical alignment mod-
els (SAMs) have been proposed in (Brown et al.,
1993; Ney et al., 2000). On the other hand, models
that deal with structures or phrases instead of single
words have also been proposed: the syntax trans-
lation models are described in (Yamada and Knight,
2001) , alignment templates are used in (Och, 2002),
and the alignment template approach is re-framed
into the so-called phrase based translation (PBT)
in (Marcu and Wong, 2002; Zens et al., 2002; Koehn
et al., 2003; Tom´as and Casacuberta, 2003).
For the translation model (Pr(fJ1 |eI1)) in Eq. (1),
PBT can be explained from a generative point of
view as follows (Zens et al., 2002):
1. The target sentence eI1 is segmented into K
phrases (˜eK1 ).
2. Each target phrase ˜ek is translated into a source
phrase ˜f.
3. Finally, the source phrases are reordered in or-
der to compose the source sentence ˜fK1 = fJ1 .
In PBT, it is assumed that the relations between
the words of the source and target sentences can
be explained by means of the hidden variable ˜aK1 ,
which contains all the decisions made during the
generative story.
Pr(fJ1 |eI1) =
summationdisplay
K,˜aK1
Pr(, ˜fK1 ,˜aK1 |˜eK1 )
=
summationdisplay
K,˜aK1
Pr(˜aK1 |˜eK1 )Pr( ˜fK1 |˜aK1 ,˜eK1 )
(2)
Different assumptions can be made from the pre-
vious equation. For example, in (Zens et al., 2002)
the following model is proposed:
pθ(fJ1 |eI1) = α(eI1)
summationdisplay
K,˜aK1
Kproductdisplay
k=1
p( ˜fk|˜e˜ak) (3)
where ˜ak notes the index of the source phrase ˜e
which is aligned with the k-th target phrase ˜fk and
that all possible segmentations have the same proba-
bility. In (Tom´as and Casacuberta, 2001; Zens et al.,
2002), it also is assumed that the alignments must be
monotonic. This led us to the following equation:
pθ(fJ1 |eI1) = α(eI1)
summationdisplay
K,˜aK1
Kproductdisplay
k=1
p( ˜fk|˜ek) (4)
In both cases the model parameters that have to be
estimated are the translation probabilities between
phrase pairs (θ = {p( ˜f|˜e)}), which typically are es-
timated as follows:
p( ˜f|˜e) = N(
˜f,˜e)
N(˜e) (5)
65
where N( ˜f|˜e) is the number of times that ˜f have
been seen as a translation of ˜e within the training
corpus.
3 Stack-Decoding Algorithms
The stack decoding algorithm, also called A∗ algo-
rithm, was first introduced by F. Jelinek in (Jelinek,
1969). The stack decoding algorithm attempts to
generate partial solutions, called hypotheses, until a
complete translation is found2; these hypotheses are
stored in a stack and ordered by their score. Typi-
cally, this measure or score is the probability of the
product of the translation and the language models
introduced above. The A∗ decoder follows a se-
quence of steps for achieving a complete (and possi-
bly optimal) hypothesis:
1. Initialize the stack with an empty hypothesis.
2. Iterate
(a) Pop h (the best hypothesis) off the stack.
(b) If h is a complete sentence, output h and
terminate.
(c) Expand h.
(d) Go to step 2a.
The search is started from a null string and obtains
new hypotheses after an expansion process (step 2c)
which is executed at each iteration. The expansion
process consists of the application of a set of op-
erators over the best hypothesis in the stack, as it
is depicted in Figure 1. Thus, the design of stack
decoding algorithms involves defining a set of oper-
ators to be applied over every hypothesis as well as
the way in which they are combined in the expansion
process. Both the operators and the expansion algo-
rithm depend on the translation model that we use.
For the case of the phrase-based translation models
described in the previous section, the operator add is
defined, which adds a sequence of words to the tar-
get sentence, and aligns it with a sequence of words
of the source sentence.
The number of hypotheses to be stored during the
search can be huge. In order then to avoid mem-
2Each hypothesis has associated a coverage vector of length
J, which indicates the set of source words already cov-
ered/translated so far. In the following we will refer to this
simply as coverage.
Figure 1: Flow chart associated to the expansion of
a hypothesis when using an A⋆ algorithm.
ory overflow problems, the maximum number of hy-
potheses that a stack may store has to be limited. It
is important to note that for a hypothesis, the higher
the aligned source words, the worse the score. These
hypotheses will be discarded sooner when an A∗
search algorithm is used due to the stack length lim-
itation. Because of this, the multi-stack algorithms
were introduced.
Multi-stack algorithms store those hypotheses
with different subsets of source aligned words in dif-
ferent stacks. That is to say, given an input sentence
fJ1 composed of J words, multi-stack algorithms
employes 2J stacks to translate it. Such an organi-
zation improves the pruning of the hypotheses when
the stack length limitation is exceeded, since only
hypotheses with the same number of covered posi-
tions can compete with each other.
All the search steps given for A∗ algorithm can
also be applied here, except step 2a. This is due
to the fact that multiple stacks are used instead of
only one. Figure 2 depicts the expansion process
that the multi-stack algorithms execute, which is
slightly different than the one presented in Figure 1.
Multi-stack algorithms have the negative property of
spending significant amounts of time in selecting the
hypotheses to be expanded, since at each iteration,
the best hypothesis in a set of 2J stacks must be
searched for (Ort´ız et al., 2003). By contrast, for the
A∗ algorithm, it is not possible to reduce the length
of the stack in the same way as in the multi-stack
case without loss of translation quality.
Additionally, certain translation systems, e.g. the
Pharaoh decoder (Koehn, 2003) use an alternative
66
Figure 2: Flow chart associated to the expansion of
a hypothesis when using a multi-stack algorithm.
approach which consists in assigning to the same
stack, those hypotheses with the same number of
source words covered.
4 Generalized Stack-Decoding Algorithms
As was mentioned in the previous section, given a
sentence fJ1 to be translated, a single stack decod-
ing algorithm employs only one stack to perform the
translation process, while a multi-stack algorithm
employs 2J stacks. We propose a possible way to
make a tradeoff between the advantages of both al-
gorithms that introduces a new parameter which will
be referred to as the granularity of the algorithm.
The granularity parameter determines the number of
stacks used during the decoding process.
4.1 Selecting the granularity of the algorithm
The granularity (G) of a generalized stack algorithm
is an integer which takes values between 1 and J,
where J is the number of words which compose the
sentence to translate.
Given a sentence fJ1 to be translated, a general-
ized stack algorithm with a granularity parameter
equal to g, will have the following features:
• The algorithm will use at most 2g stacks to per-
form the translation
• Each stack will contain hypotheses which have
2J−g different coverages of fJ1
• If the algorithm can store at most S = s hy-
potheses, then, the maximum size of each stack
will be equal to s2g
4.2 Mapping hypotheses to stacks
Generalized stack-decoding algorithms require a
mechanism to decide in which stack each hypothesis
is to be inserted. As stated in section 4.1, given an
input sentence fJ1 and a generalized stack-decoding
algorithm with G = g, the decoder will work with
2g stacks, and each one will contain 2J−g different
coverages. Therefore, the above mentioned mecha-
nism can be expressed as a function which will be
referred to as the µ function. Given a hypothesis
coverage composed of J bits, the µ function return
a stack identifier composed of only g bits:
µ : ({0,1})J −→ ({0,1})g (6)
Generalized stack algorithms are strongly in-
spired by multi-stack algorithms; however, both
types of algorithms differ in the way the hypothesis
expansion is performed. Figure 3 shows the expan-
sion algorithm of a generalized stack decoder with
a granularity parameter equal to g and a function µ
which maps hypotheses coverages to stacks.
Figure 3: Flow chart associated to the expansion of
a hypothesis when using a generalized-stack algo-
rithm.
The function µ can be defined in many ways,
but there are two essential principles which must be
taken into account:
• The µ function must be efficiently calculated
• Hypotheses whose coverage have a similar
number of bits set to one must be assigned to
the same stack. This requirement allows the
pruning of the stacks to be improved, since the
67
hypotheses with a similar number of covered
words can compete fairly
A possible way to implement the µ function,
namely µ1, consists in simply shifting the coverage
vector J −g positions to the right, and then keeping
only the first g bits. Such a proposal is very easy
to calculate, however, it has a poor performance ac-
cording to the second principle explained above.
A better alternative to implement the µ function,
namely µ2, can be formulated as a composition of
two functions. A constructive definition of such a
implementation is detailed next:
1. Let us suppose that the source sentence is com-
posed by J words, we order the set of J bit
numbers as follows: first the numbers which do
not have any bit equal to one, next, the numbers
which have only one bit equal to one, and so on
2. Given the list of numbers described above, we
define a function which associates to each num-
ber of the list, the order of the number within
this list
3. Given the coverage of a partial hypothesis, x,
the stack on which this partial hypothesis is to
be inserted is obtained by a two step process:
First, we obtain the image of x returned by the
function described above. Next, the result is
shifted J −g positions to the right, keeping the
first g bits
Let β be the function that shifts a bit vector J −g
positions to the right, keeping the first g bits; and let
α be the function that for each coverage returns its
order:
α : ({0,1})J −→ ({0,1})J (7)
Then, µ2 is expressed as follows:
µ2(x) = β ◦ α(x) (8)
Table 1 shows an example of the values which re-
turns the µ1 and the µ2 functions when the input sen-
tence has 4 words and the granularity of the decoder
is equal to 2. As it can be observed, µ2 function
performs better than µ1 function according to the
second principle described at the beginning of this
section.
x µ1(x) α(x) µ2(x)
0000 00 0000 00
0001 00 0001 00
0010 00 0010 00
0100 01 0011 00
1000 10 0100 01
0011 00 0101 01
0101 01 0110 01
0110 01 0111 01
1001 10 1000 10
1010 10 1001 10
1100 11 1010 10
0111 01 1011 10
1011 10 1100 11
1101 11 1101 11
1110 11 1110 11
1111 11 1111 11
Table 1: Values returned by the µ1 and µ2 function
defined as a composition of the α and β functions
4.3 Single and Multi Stack Algorithms
The classical single and multi-stack decoding al-
gorithms can be expressed/instantiated as particular
cases of the general formalism that have been pro-
posed.
Given the input sentence fJ1 , a generalized stack
decoding algorithm with G = 0 will have the fol-
lowing features:
• The algorithm works with 20 = 1 stacks.
• Such a stack may store hypotheses with 2J dif-
ferent coverages. That is to say, all possible
coverages.
• The mapping function returns the same stack
identifier for each coverage
The previously defined algorithm has the same
features as a single stack algorithm.
Let us now consider the features of a generalized
stack algorithm with a granularity value of J:
• The algorithm works with 2J stacks
• Each stack may store hypotheses with only
20 = 1 coverage.
• The mapping function returns a different stack
identifier for each coverage
The above mentioned features characterizes the
multi-stack algorithms described in the literature.
68
EUTRANS-I XEROX
Spanish English Spanish English
Training
Sentences 10,000 55,761
Words 97,131 99,292 753,607 665,400
Vocabulary size 686 513 11,051 7,957
Average sentence leng. 9.7 9.9 13.5 11.9
Test
Sentence 2,996 1,125
Words 35,023 35,590 10,106 8,370
Perplexity (Trigrams) – 3.62 – 48.3
Table 2: EUTRANS-I and XEROX corpus statistics
5 Experiments and Results
In this section, experimental results are presented for
two well-known tasks: the EUTRANS-I (Amengual
et al., 1996), a small size and easy translation task,
and the XEROX (Cubel et al., 2004), a medium size
and difficult translation task. The main statistics of
these corpora are shown in Table 2. The translation
results were obtained using a non-monotone gener-
alized stack algorithm. For both tasks, the training
of the different phrase models was carried out us-
ing the publicly available Thot toolkit (Ortiz et al.,
2005).
Different translation experiments have been car-
ried out, varying the value of G (ranging from 0 to
8) and the maximum number of hypothesis that the
algorithm is allow to store for all used stacks (S)
(ranging from 28 to 212). In these experiments the
following statistics are computed: the average score
(or logProb) that the phrase-based translation model
assigns to each hypothesis, the translation quality
(by means of WER and Bleu measures), and the av-
erage time (in secs.) per sentence3.
In Figures 4 and 5 two plots are shown: the av-
erage time per sentence (left) and the average score
(right), for EUTRANS and XEROX corpora respec-
tively. As can be seen in both figures, the bigger the
value of G the lower the average time per sentence.
This is true up to the value of G = 6. For higher
values of G (keeping fixed the value of S) the aver-
age time per sentence increase slightly. This is due
to the fact that at this point the algorithm start to
spend more time to decide which hypothesis is to be
expanded. With respect to the average score similar
values are obtained up to the value of G = 4. Higher
3All the experiments have been executed on a PC with a
2.60 Ghz Intel Pentium 4 processor with 2GB of memory. All
the times are given in seconds.
values of G slightly decreases the average score. In
this case, as G increases, the number of hypothe-
ses per stack decreases, taking into account that the
value of S is fixed, then the “optimal” hypothesis
can easily be pruned.
In tables 3 and 4 detailed experiments are shown
for a value of S = 212 and different values of G, for
EUTRANS and XEROX corpora respectively.
G WER Bleu secsXsent logprob
0 6.6 0.898 2.4 -18.88
1 6.6 0.898 1.9 -18.80
2 6.6 0.897 1.7 -18.81
4 6.6 0.898 1.3 -18.77
6 6.7 0.896 1.1 -18.83
8 6.7 0.896 1.5 -18.87
Table 3: Translation experiments for EUTRANS cor-
pus using a generalized stack algorithm with differ-
ent values of G and a fixed value of S = 212
G WER Bleu secsXsent logProb
0 32.6 0.658 35.1 -33.92
1 32.8 0.657 20.4 -33.86
2 33.1 0.656 12.8 -33.79
4 32.9 0.657 7.0 -33.70
6 33.7 0.652 6.3 -33.69
8 36.3 0.634 13.7 -34.10
Table 4: Translation experiments for XEROX cor-
pus using a generalized stack algorithm with differ-
ent values of G and a fixed value of S = 212
According to the experiments presented here we
can conclude that:
• The results correlates for the two considered
tasks: one small and easy, and other larger and
difficult.
• The proposed generalized stack decoding
paradigm can be used to make a tradeoff be-
69
 0
 0.5
 1
 1.5
 2
 2.5
 0  1  2  3  4  5  6  7  8
time
G
S=512
S=1024
S=2048
S=4096
-20
-19.5
-19
-18.5
-18
 0  1  2  3  4  5  6  7  8
Avg. Score
G
S=512
S=1024
S=2048
S=4096
Figure 4: Average time per sentence (in secs.) and average score per sentence. The results are shown for
different values of G and S for the EUTRANS corpus.
 0
 5
 10
 15
 20
 25
 30
 35
 40
 0  1  2  3  4  5  6  7  8
time
G
S=512
S=1024
S=2048
S=4096
-37
-36
-35
-34
-33
-32
-31
 0  1  2  3  4  5  6  7  8
Avg. Score
G
S=512
S=1024
S=2048
S=4096
Figure 5: Average time per sentence (in secs.) and average score per sentence. The results are shown for
different values of G and S for the XEROX corpus.
tween the advantages of classical single and
multi-stack decoding algorithms.
• As we expected, better results (regarding effi-
ciency and accuracy) are obtained when using
a value of G between 0 and J.
6 Concluding Remarks
In this paper, a generalization of the stack-decoding
paradigm has been proposed. This new formalism
includes the well known single and multi-stack de-
coding algorithms and a new family of stack-based
algorithms which have not been described yet in the
literature.
Essentially, generalized stack algorithms use a pa-
rameterized number of stacks during the decoding
process, and try to assign hypotheses to stacks such
that there is ”fair competition” within each stack,
i.e., brother hypotheses should cover roughly the
same number of input words (and the same words)
if possible.
The new family of stack-based algorithms allows
a tradeoff to be made between the classical single
and multi-stack decoding algorithms. For this pur-
pose, they employ a certain number of stacks be-
tween 1 (the number of stacks used by a single stack
algorithm) and 2J (the number of stacks used by a
multiple stack algorithm to translate a sentence with
J words.)
According to the experimental results, it has been
proved that an appropriate value of G yields in a
stack decoding algorithm that outperforms (in effi-
70
ciency and acuraccy) the single and multi-stack al-
gorithms proposed so far.
As future work, we plan to extend the experimen-
tation framework presented here to larger and more
complex tasks as HANSARDS and EUROPARL cor-
pora.

References
J.C. Amengual, J.M. Bened´ı, M.A. Castao, A. Marzal,
F. Prat, E. Vidal, J.M. Vilar, C. Delogu, A. di Carlo,
H. Ney, and S. Vogel. 1996. Definition of a ma-
chine translation task and generation of corpora. Tech-
nical report d4, Instituto Tecnol´ogico de Inform´atica,
September. ESPRIT, EuTrans IT-LTR-OS-20268.
Adam L. Berger, Peter F. Brown, Stephen A. Della Pietra,
Vincent J. Della Pietra, John R. Gillett, A. S. Kehler,
and R. L. Mercer. 1996. Language translation ap-
paratus and method of using context-based translation
models. United States Patent, No. 5510981, April.
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della
Pietra, and R. L. Mercer. 1993. The mathematics of
statistical machine translation: Parameter estimation.
Computational Linguistics, 19(2):263–311.
E. Cubel, J. Civera, J. M. Vilar, A. L. Lagarda,
E. Vidal, F. Casacuberta, D. Pic´o, J. Gonz´alez, and
L. Rodr´ıguez. 2004. Finite-state models for computer
assisted translation. In Proceedings of the 16th Euro-
pean Conference on Artificial Intelligence (ECAI04),
pages 586–590, Valencia, Spain, August. IOS Press.
Ulrich Germann, Michael Jahr, Kevin Knight, Daniel
Marcu, and Kenji Yamada. 2001. Fast decoding and
optimal decoding for machine translation. In Proc.
of the 39th Annual Meeting of ACL, pages 228–235,
Toulouse, France, July.
F. Jelinek. 1969. A fast sequential decoding algorithm
using a stack. IBM Journal of Research and Develop-
ment, 13:675–685.
P. Koehn, F. J. Och, and D. Marcu. 2003. Statisti-
cal phrase-based translation. In Proceedings of the
HLT/NAACL, Edmonton, Canada, May.
Phillip Koehn. 2003. Pharaoh: a beam search decoder
for phrase-based statistical machine translation mod-
els. User manual and description. Technical report,
USC Information Science Institute, December.
Daniel Marcu and William Wong. 2002. A phrase-based,
joint probability model for statistical machine transla-
tion. In Proceedings of the EMNLP Conference, pages
1408–1414, Philadelphia, USA, July.
Hermann Ney, Sonja Nießen, Franz J. Och, Hassan
Sawaf, Christoph Tillmann, and Stephan Vogel. 2000.
Algorithms for statistical translation of spoken lan-
guage. IEEE Trans. on Speech and Audio Processing,
8(1):24–36, January.
Franz J. Och, Nicola Ueffing, and Hermann Ney. 2001.
An efficient A* search algorithm for statistical ma-
chine translation. In Data-Driven Machine Transla-
tion Workshop, pages 55–62, Toulouse, France, July.
Franz Joseph Och. 2002. Statistical Machine Trans-
lation: From Single-Word Models to Alignment Tem-
plates. Ph.D. thesis, Computer Science Department,
RWTH Aachen, Germany, October.
D. Ort´ız, Ismael Garc´ıa-Varea, and Francisco Casacu-
berta. 2003. An empirical comparison of stack-based
decoding algorithms for statistical machine transla-
tion. In New Advance in Computer Vision, Lec-
ture Notes in Computer Science. Springer-Verlag. 1st
Iberian Conference on Pattern Recongnition and Im-
age Analysis -IbPRIA2003- Mallorca. Spain. June.
D. Ortiz, I. Garca-Varea, and F. Casacuberta. 2005. Thot:
a toolkit to train phrase-based statistical translation
models. In Tenth Machine Translation Summit, pages
141–148, Phuket, Thailand, September.
J. Tom´as and F. Casacuberta. 2001. Monotone statistical
translation using word groups. In Procs. of the Ma-
chine Translation Summit VIII, pages 357–361, Santi-
ago de Compostela, Spain.
J. Tom´as and F. Casacuberta. 2003. Combining phrase-
based and template-based models in statistical ma-
chine translation. In Pattern Recognition and Image
Analisys, volume 2652 of LNCS, pages 1021–1031.
Springer-Verlag. 1st bPRIA.
Ye-Yi Wang and Alex Waibel. 1998. Fast decoding
for statistical machine translation. In Proc. of the
Int. Conf. on Speech and Language Processing, pages
1357–1363, Sydney, Australia, November.
Kenji Yamada and Kevin Knight. 2001. A syntax-
based statistical translation model. In Proc. of the 39th
Annual Meeting of ACL, pages 523–530, Toulouse,
France, July.
R. Zens, F.J. Och, and H. Ney. 2002. Phrase-based sta-
tistical machine translation. In Advances in artificial
intelligence. 25. Annual German Conference on AI,
volume 2479 of Lecture Notes in Computer Science,
pages 18–32. Springer Verlag, September.
