Architectures for speech-to-speech translation
using finite-state models
Francisco Casacuberta Enrique Vidal
Dpt. de Sistemes Inform`atics i Computaci´o &
Institut Tecnol`ogic d’Inform`atica
Universitat Polit`ecnica de Val`encia
46071 Val`encia, SPAIN.
fcn@iti.upv.es, evidal@iti.upv.es
Juan Miguel Vilar
Dpt. de Llenguatges i Sistemes Inform`atics
Universitat Jaume I
Castell´o, SPAIN.
jvilar@lsi.uji.es
Abstract
Speech-to-speech translation can be ap-
proached using finite state models and
several ideas borrowed from automatic
speech recognition. The models can be
Hidden Markov Models for the accous-
tic part, language models for the source
language and finite state transducers for
the transfer between the source and target
language. A “serial architecture” would
use the Hidden Markov and the language
models for recognizing input utterance
and the transducer for finding the transla-
tion. An “integrated architecture”, on the
other hand, would integrate all the mod-
els in a single network where the search
process takes place. The output of this
search process is the target word sequence
associated to the optimal path. In both
architectures, HMMs can be trained from
a source-language speech corpus, and the
translation model can be learned automat-
ically from a parallel text training cor-
pus. The experiments presented here cor-
respond to speech-input translations from
Spanish to English and from Italian to En-
glish, in applications involving the inter-
action (by telephone) of a customer with
the front-desk of a hotel.
1 Introduction
Present finite-state technology allows us to build
speech-to-speech translation (ST) systems using
ideas very similar to those of automatic speech
recognition (ASR). In ASR the acoustic hidden
Markov models (HMMs) can be integrated into the
language model, which is typically a finite-state
grammar (e.g. a N-gram). In ST the same HMMs
can be integrated in a translation model which con-
sists in a stochastic finite-state transducer (SFST).
Thanks to this integration, the translation process
can be efficiently performed by searching for an
optimal path of states through the integrated net-
work by using well-known optimization procedures
such as (beam-search accelerated) Viterbi search.
This “integrated architecture” can be compared with
the more conventional “serial architecture”, where
the HMMs, along with a suitable source language
model, are used as a front-end to recognize a se-
quence of source-language words which is then pro-
cessed by the translation model. A related approach
has been proposed in (Bangalore and Ricardi, 2000;
Bangalore and Ricardi, 2001).
In any case, a pure pattern-recognition approach
can be followed to build the required systems.
Acoustic models can be trained from a suffi-
ciently large source-language speech training set,
in the very same way as in speech recognition.
On the other hand, using adequate learning algo-
rithms (Casacuberta, 2000; Vilar, 2000), the trans-
lation model can also be learned from a sufficiently
large training set consisting of source-target parallel
text.
In this paper, we comment the results obtained us-
ing this approach in EUTRANS, a five-year joint ef-
fort of four European institutions, partially funded
by the European Union.
                                            Association for Computational Linguistics.
                           Algorithms and Systems, Philadelphia, July 2002, pp. 39-44.
                          Proceedings of the Workshop on Speech-to-Speech Translation:
2 Finite-state transducers and speech
translation
The statistical framework allow us to formulate the
speech translation problem as follows: Let x be an
acoustic representation of a given utterance; typi-
cally a sequence of acoustic vectors or “frames”.
The translation of x into a target-language sentence
can be formulated as the search for a word se-
quence, ˆt, from the target language such that:
ˆt = argmax
t
Pr(t|x). (1)
Conceptually, the translation can be viewed as a
two-step process (Ney, 1999; Ney et al., 2000):
x → s → t,
where s is a sequence of source-language words
which would match the observed acoustic sequence
x and t is a target-language word sequence associ-
ated with s. Consequently,
Pr(t|x) =
summationdisplay
s
Pr(t,s|x), (2)
and, with the natural assumption that Pr(x|s,t) does
not depend on the target sentence t,
ˆt = argmax
t
parenleftBiggsummationdisplay
s
Pr(s,t) · Pr(x|s)
parenrightBigg
. (3)
Using a SFST as a model for Pr(s,t) and HMMs
to model Pr(x|s), Eq. 3 is transformed in the opti-
mization problem:
ˆt = argmax
t
parenleftBiggsummationdisplay
s
PrT (s,t) · PrM(x|s)
parenrightBigg
, (4)
where PrT (s,t) is the probability supplied by the
SFST and PrM(x|s) is the density value supplied
by the corresponding HMMs associated to s for the
acoustic sequence x.
2.1 Finite-state transducers
A SFST, T , is a tuple 〈Q,Σ,∆,R,q0,F, P〉, where
Q is a finite set of states; q0 is the initial state; Σ and
∆ are finite sets of input symbols (source words) and
output symbols (target words), respectively (Σ∩∆ =
∅); R is a set of transitions of the form (q,a,ω,qprime)
for q,qprime ∈ Q, a ∈ Σ, ω ∈ ∆star and1 P : R → IR+
(transition probabilities) and F : Q → IR+ (final-
state probabilities) are functions such that ∀q ∈ Q:
F(q) +
summationdisplay
∀(a,ω,qprime) ∈ Σ×∆star ×Q :
(q,a,ω,qprime) ∈ R
P(q,a,ω,qprime) = 1.
Fig. 1 shows a small fragment of a SFST for Spanish
to English translation.
A particular case of finite-state transducers are
known as subsequential transducers (SSTs). These
are finite-state transducers with the restriction of be-
ing deterministic (if (q,a,ω,q), (q,a,ωprime,qprime) ∈ R,
then ω = ωprime and q = qprime). SSTs also have output
strings associated to the (final) states. This can fit
well under the above formulation by simply adding
an end-off-sentence marker to each input sentence.
For a pair (s,t) ∈ Σstar × ∆star, a translation form,
φ, is a sequence of transitions in a SFST T :
φ : (q0,s1,˜t1,q1),(q1,s2,˜t2,q2),
...,(qI−1,sI,˜tI ,qI),
where ˜tj denotes a substring of target words (the
empty string for ˜tj is also possible), such that
˜t1 ˜t2 ...˜tI = t and I is the length of the source sen-
tence s. The probability of φ is
PrT (φ) = F(qI) ·
Iproductdisplay
i=0
P(qi−1,si,˜ti,qi). (5)
Finally, the probability of the pair (s,t) is
PrT (s,t) =
summationdisplay
φ∈d(s,t)
PrT (φ) (6)
≈ max
φ∈d(s,t)
PrT (φ), (7)
where d(s,t) is the set of all translation forms for
the pair (s,t).
These models have implicit source and target lan-
guage models embedded in their definitions, which
are simply the marginal distributions of PrT . In
practice, the source (target) language model can be
obtained by removing the target (source) words from
each transition of the model.
1By∆star andΣstar we denote the sets of finite-length strings on
∆ and Σ, respectively
0 1una / a  (0.5)la / the (0.5)
2habitación / room (0.1)
4habitación / room (0.3)3habitación / λ   (0.6)
doble / with two beds (1)
doble / double room (0.3)
individual / single room (0.7)
Figure 1: Example of SFST. λ denotes the empty string. The source sentence “una habitaci´on doble” can
be translated to either “a double room” or “a room with two beds”. The most probable translation is the
first one with probability of 0.09.
The structural (states and transitions) and the
probabilistic components of a SFST can be learned
automatically from training pairs in a single process
using the MGTI technique (Casacuberta, 2000). Al-
ternatively, the structural component can be learned
using the OMEGA technique (Vilar, 2000), while
the probabilistic component is estimated in a second
step using maximum likelihood or other possible cri-
teria (Pic´o and Casacuberta, 2001). One of the main
problems that appear during the learning process is
the modelling of events that have not been seen in
the training set. This problem can be confronted,
in a similar way as in language modelling, by using
smoothing techniques in the estimation process of
the probabilistic components of the SFST (Llorens,
2000). Alternatively, smoothing can be applied in
the process of learning both components (Casacu-
berta, 2000).
2.2 Architectures for speech translation
Using Eq. 7 as a model for Pr(s,t) in Eq. 4,
ˆt = argmax
t
parenleftBiggsummationdisplay
s
max
φ∈d(s,t)
PrT (φ) · PrM(x|s)
parenrightBigg
,
(8)
For the computation of PrM(x|s) in Eq. 8, let
b be an arbitrary segmentation of x into I acous-
tic subsequences, each of which associated with a
source word (therefore, I is the number of words in
s). Then:
PrM(x|s) =
summationdisplay
b
Iproductdisplay
i=1
PrM(¯xi|si), (9)
where ¯xi is the i-th. acoustic segment of b, and each
source word si has an associated HMM that supplies
the density value PrM(¯xi|si).
Finally, by substituting Eq. 5 and Eq. 9 into Eq. 8
and approximating sums by maximisations:
ˆt = argmax
φ∈d(s,t),b
Iproductdisplay
i=1
P(qi−1,si,¯ti,qi) · PrM(¯xi|si).
(10)
Solving this maximisation yields (an approximation
to) the most likely target-language sentence ˆt for the
observed source-language acoustic sequence x.
This computation can be accomplished using the
well known Viterbi algorithm. It searches for an op-
timal sequence of states in an integrated network (in-
tegrated architecture) which is built by substituting
each edge of the SFST by the corresponding HMM
of the source word associated to the edge.
This integration process is illustrated in Fig. 2. A
small SFST is presented in the first panel (a) of this
figure. In panel (b), the source words in each edge
are substituted by the corresponding phonetic tran-
scription. In panel (c) each phoneme is substituted
by the corresponding HMM of the phone. Clearly,
this direct integration approach often results in huge
finite-state networks. Correspondingly, a straight-
forward (dynamic-programming) search for an op-
timal target sentence may require a prohibitively
high computational effort. Fortunately, this compu-
tational cost can be dramatically reduced by means
of standard heuristic acceleration techniques such as
beam search.
An alternative, which sacrifices optimality more
drastically, is to break the search down into two
steps, leading to a so-called “serial architecture”. In
the first step a conventional source-language speech
decoding system (using just a source-language lan-
guage model) is used to obtain a single (may be mul-
tiple) hypothesis for the sequence of uttered words.
In the second step, this text sequence is translated
into a target-language sentence.
0 1la / the 2maleta / λ 3bolsa / λ 4azul / blue suitcaseazul / blue bag
a) Original FST.
0 l / λ 1a / the m / λb / λ a / λo / λ l / λ e / λ
t / λ 2t / λ a / λ a / λ z / λs / λ
l / λ s / λ
3s / λ a / λ a / λ
z / λs / λ u / λ 4l / blue suitcaseu / λ l / blue bag
b) Lexical expansion.
0 l a 1   the   
m
b
a l e
t
t
a
2 a
 
o l s
s
a
3 a
 
z
s u l
4
             blue suitcase           
z
s u
l        blue bag
c) Phonetic expansion.
Figure 2: Example of the integration process of the lexical knowledge (figure b) and the phonetic knowledge
(figure c) in a FST (figure a). λ denotes the empty string in panels a and b. In panel c, source symbols are
typeset in small fonts, target strings are typeset in large fonts and edges with no symbols denote empty
transitions.
Using Pr(s,t) = Pr(t | s) · Pr(s) in Eq. 3 and
approximating the sum by the maximum, the opti-
mization problem can be presented as
(ˆt,ˆs) = argmax
t,s
(Pr(t|s) · Pr(s) · Pr(x|s)),
(11)and the two-step approximation reduces to
ˆs ≈ argmax
s
{Pr(s) · Pr(x|s)}, (12)
ˆt ≈ argmax
t
Pr(t|ˆs) (13)
= argmax
t
Pr(ˆs,t). (14)
In other words, the search for an optimal target-
language sentence is now approximated as follows:
1. Word decoding of x. A source-language sen-
tence ˆs is searched for using a source language
model, PrN(s), for Pr(s) and the correspond-
ing HMMs, PrM(x|s), to model Pr(x|s):
ˆs ≈ argmax
s
(PrN(s) · PrM(x|s)).
2. Translation of ˆs. A target-language sentence ˆt
is searched for using a SFST, PrT (ˆs,t), as a
model of Pr(ˆs,t)
ˆt ≈ argmax
t
PrT (ˆs,t).
A better alternative for this crude “two-step” ap-
proach is to use Pr(s,t) = Pr(s | t)·Pr(t) in Eq. 3.
Now, approximating the sum by the maximum, the
optimization problem can be presented as
(ˆt,ˆs) = argmax
t,s
(Pr(s | t) · Pr(t) · Pr(x | s)),
(15)and now the two-step approximation reduces to
ˆs ≈ argmax
s
{Pr(s | t) · Pr(x | s)}, (16)
ˆt ≈ argmax
t
Pr(ˆs | t) · Pr(t) (17)
= argmax
t
Pr(ˆs,t). (18)
The main problem of this approach is the term
t that appears in the first maximisation (Eq. 16).
A possible solution is to follow an iterative proce-
dure where t, that is used for computing ˆs, is the
one obtained from argmaxt Pr(ˆs,t) in the previous
iteration (Garc´ıa-Varea et al., 2000). In this case,
Pr(s | t) can be modelled by a source language
model that depends on a previously computed ˜t:
PrN,˜t(s). In the first iteration no ˆt is known, but
PrN,˜t(s) can be approximated by PrN(s). Follow-
ing this idea, the search can be formulated as:
Initialization:
Let PrN,t(s) be approximated by a source lan-
guage model PrN(s).
while not convergence
1. Word decoding of x. A source-language sen-
tence ˆs is searched for using a source lan-
guage model that depends on the target sen-
tence, PrN,˘t(s), for Pr(s | t) (˘t is the ˆt com-
puted in the previous iteration) and the corre-
sponding HMMs, PrM(x | s), to model Pr(x |
s):
ˆs ≈ argmax
s
parenleftBig
PrN,˘t(s) · PrM(x | s)
parenrightBig
.
2. Translation of ˆs. A target-language sentence ˆt
is searched for using a SFST, PrT (ˆs,t), as a
model of Pr(ˆs,t)
ˆt ≈ argmax
t
PrT (ˆs,t).
end of while
The first iteration corresponds to the sequential ar-
chitecture proposed above.
While this seems a promising idea, only very
preliminary experiments were carried out (Garc´ıa-
Varea et al., 2000) and it has not been considered in
the experiments presented in the present paper.
3 Experiments and results
Three sets of speech-to-speech translation proto-
types have been implemented for Spanish to English
and for Italian to English. In all of them, the appli-
cation was the translation of queries, requests and
complaints made by telephone to the front desk of
a hotel. Three tasks of different degree of difficulty
have been considered.
In the first one (EUTRANS-0), Spanish-to-English
translation systems were learned from a big and
well controlled training corpus: about 170k differ-
ent pairs (≈ 2M running words), with a lexicon of
about 700 words. In the second one (EUTRANS-
I), also from Spanish to English, the systems were
learned from a random subset of 10k pairs (≈ 100k
running words) from the previous corpus; this was
established as a more realistic training corpus for the
kind of application considered. In the third and most
difficult one, from Italian to English (EUTRANS-II),
the systems were learned from a small training cor-
pus that was obtained from a transcription of a spon-
taneous speech corpus: about 3k pairs (≈ 60k run-
ning words), with a lexicon of about 2,500 words.
For the serial architecture, the speech decoding
was performed in a conventional way, using the
same acoustic models as with the integrated archi-
tecture and trigrams of the source language models.
For the integrated architecture, the speech decoding
of an utterance is a sub-product of the translation
process (the sequence of source words associated to
the optimal sequence of transitions that produces the
sequence of target words).
The acoustic models of phone units were trained
with the HTK Toolkit (Woodland, 1997). For the
EUTRANS-0 and EUTRANS-I prototypes, a training
speech corpus of 57,000 Spanish running words was
used, while the EUTRANS-II Italian acoustic models
were trained from another corpus of 52,000 running
words
Performance was assessed on the base of 336
Spanish sentences in the case of EUTRANS-0
and EUTRANS-I and 278 Italian sentences in
EUTRANS-II. In all the cases, the test sentences (as
well as the corresponding speakers) were different
from those appearing in the training data.
For the easiest task, EUTRANS-0, (well controlled
and a large training set), the best result was achieved
with an integrated architecture and a SFST obtained
with the OMEGA learning technique. A Transla-
tion Word Error Rate of 7.6% was achieved, while
the corresponding source-language speech decoding
Word Error Rate was 8.4%. Although these figures
may seem strange (and they would certainly be in
the case of a serial architecture), they are in fact con-
sistent with the fact that, in this task (corpus), the tar-
get language exhibits a significantly lower perplex-
ity than the source language.
For the second, less easy task EUTRANS-I, (well
controlled task but a small training set), the best
result was achieved with an integrated architecture
and a SFST obtained with the MGTI learning tech-
nique (10.5% of word error rate corresponding to the
speech decoding and 12.6% of translation word er-
ror rate).
For the most difficult task, EUTRANS-II (spon-
taneous task and a small training set), the best result
was achieved with a serial architecture and a SFST
obtained with the MGTI learning technique (22.1%
of word error rate corresponding to the speech de-
coding and 37.9% of translation word error rate).
4 Conclusions
Several systems have been implemented for speech-
to-speech translation based on SFSTs. Some of them
were implemented for translation from Italian to En-
glish and the others for translation from Spanish to
English. All of them support all kinds of finite-state
translation models and run on low-cost hardware.
They are currently accessible through standard tele-
phone lines with response times close to or better
than real time.
From the results presented, it appears that the in-
tegrated architecture allows for the achievement of
better results than the results achieved with a serial
architecture when enough training data is available
to train the SFST. However, when the training data
is insufficient, the results obtained by the serial ar-
chitecture were better than the results obtained by
the integrated architecture. This effect is possible
because the source language models for the exper-
iments with the serial architecture were smoothed
trigrams. In the case of sufficient training data, the
source language model associated to a SFST learnt
by the MGTI or OMEGA is better than trigrams
(Section 2.1). However, in the other case (not suf-
ficient training data) these source languages were
worse than trigrams. Consequently an important
degradation is produced in the implicit decoding of
the input utterance.
Acknowledgments
The authors would like to thank the researchers that
participated in the EUTRANS project and have de-
veloped the methodologies that are presented in this
paper.
This work has been partially supported by the Eu-
ropean Union under grant IT-LTR-OS-30268, by the
project TT2 in the “IST, V Framework Programme”,
and Spanish project TIC 2000-1599-C02-01.

References
S. Bangalore and G. Ricardi. 2000. Stochastic finite-
state models for spoken language machine translation.
In Workshop on Embeded Machine Translation Sys-
tems.
S. Bangalore and G. Ricardi. 2001. A finite-state ap-
proach to machine translation. In The Second Meeting
of the North American Chapter of the Association for
Computational Linguistics.
F. Casacuberta. 2000. Inference of finite-state trans-
ducers by using regular grammars and morphisms.
In Grammatical Inference: Algorithms and Applica-
tions, volume 1891 of Lecture Notes in Artificial Intel-
ligence, pages 1–14. Springer-Verlag.
I. Garc´ıa-Varea, A. Sanchis, and F. Casacuberta. 2000.
A new approach to speech-input statistical translation.
In Proceedings of the International Conference on Pat-
tern Recognition (ICPR2000), volume 2, pages 907–
910, Barcelona, Sept. IAPR, IEEE Press.
D. Llorens. 2000. Suavizado de aut´omatas y traduc-
tores finitos estoc´asticos. Ph.D. thesis, Universitat
Polit`ecnica de Val`encia.
H. Ney, S. Nießen, F. Och, H. Sawaf, C. Tillmann, and
S. Vogel. 2000. Algorithms for statistical translation
of spoken language. IEEE Transactions on Speech and
Audio Processing, 8(1):24–36.
H. Ney. 1999. Speech translation: Coupling of recogni-
tion and translation. In Proceedins of the IEEE Inter-
national Conference on Acoustic, Speech and Signal
Processing, pages 517–520, Phoenix, AR, March.
D. Pic´o and F. Casacuberta. 2001. Some statistical-
estimation methods for stochastic finite-state transduc-
ers. Machine Learning, 44:121–141.
J.M. Vilar. 2000. Improve the learning of subsequen-
tial transducers by using alignments and dictionaries.
In Grammatical Inference: Algorithms and Applica-
tions, volume 1891 of Lenture Notes in Artificial Intel-
ligence, pages 298–312. Springer-Verlag.
S. Young; J. Odell; D. Ollason; V. Valtchev; P. Wood-
land. 1997. The HTK Book (Version 2.1). Cambridge
University Department and Entropic Research Labora-
tories Inc.
