A Uni ed Approach in Speech-to-Speech Translation: Integrating
Features of Speech Recognition and Machine Translation
Ruiqiang Zhang and Genichiro Kikui and Hirofumi Yamamoto
Taro Watanabe and Frank Soong and Wai Kit Lo
ATR Spoken Language Translation Research Laboratories
2-2 Hikaridai, Seiika-cho, Soraku-gun, Kyoto, 619-0288, Japan
fruiqiang.zhang, genichiro.kikuig@atr.jp
Abstract
Based upon a statistically trained speech
translation system, in this study, we try
to combine distinctive features derived from
the two modules: speech recognition and
statistical machine translation, in a log-
linear model. The translation hypotheses
are then rescored and translation perfor-
mance is improved. The standard trans-
lation evaluation metrics, including BLEU,
NIST, multiple reference word error rate
and its position independent counterpart,
were optimized to solve the weights of the
features in the log-linear model. The exper-
imental results have shown signi cant im-
provement over the baseline IBM model 4
in all automatic translation evaluation met-
rics. The largest was for BLEU, by 7.9%
absolute.
1 Introduction
Current translation systems are typically of a
cascaded structure: speech recognition followed
by machine translation. This structure, while
explicit, lacks some joint optimality in per-
formance since the speech recognition module
and translation module are running rather in-
dependently. Moreover, the translation module
of a speech translation system, a natural o -
spring of text-input based translation system,
usually takes a single-best recognition hypoth-
esis transcribed in text and performs standard
text-based translation. Lots of supplementary
information available from speech recognition,
such as N-best recognition recognition hypothe-
ses, likelihoods of acoustic and language models,
is not well utilized in the translation process.
The information can be e ective for improving
translation quality if employed properly.
The supplementary information can be ex-
ploited by a tight coupling of speech recognition
and machine translation (Ney, 1999) or keeping
the cascaded structure unchanged but using an
integration model, log-linear model, to rescore
the translation hypotheses. In this study the
last approach was used due to its explicitness.
In this paper we intended to improve speech
translation by exploiting these information.
Moreover, a number of advanced features from
the machine translation module were also added
in the models. All the features from the speech
recognition and machine translation module
were combined by the log-linear models seam-
lessly.
In order to test our results broadly, we used
four automatic translation evaluation metrics:
BLEU, NIST, multiple word error rate and po-
sition independent word error rate, to measure
the translation improvement.
In the following, in section 2 we introduce the
speech translation system. In section 3, we de-
scribe the optimization algorithm used to  nd
the weight parameters in the log-linear model.
In section 4 we demonstrate the e ectiveness
of our technique in speech translation experi-
ments. In the  nal two sections we discuss the
results and present our conclusions.
2 Feature-based Log-linear Models
in Speech Translation
The speech translation experimental system
used in this study illustrated in Fig. 1 is a typi-
cal, statistics-based one. It consists of two ma-
jor cascaded components: an automatic speech
recognition (ASR) module and a statistical ma-
chine translation (SMT) module. Additionally,
a third module, ‘Rescore’, has been added to the
system and it forms a key component in the sys-
tem. Features derived from ASR and SMT are
combined in this module to rescore translation
candidates.
Without loss of its generality, in this paper
we use Japanese-to-English translation to ex-
plain the generic speech translation process. Let
X denote acoustic observations of a Japanese
X
utterance
recognized
text
target
translation
E
best
translation
JN1ASR SMT ENxK1 Rescore
Figure 1: Current framework of speech transla-
tion
utterance, typically a sequence of short-time
spectral vectors received at a frame rate of ev-
ery centi-second. It is  rst recognized as a
Japanese sentence, J. The recognized sentence
is then translated into a corresponding English
sentence, E.
The conversion from X to J is performed in
the ASR module. Based on Bayes’ rule, P(JjX)
can be written as
P(JjX) = Pam(XjJ)Plm(J)=P(X)
where Pam(XjJ) is the acoustic model likeli-
hood of the observations given the recognized
sentence J; Plm(J), the source language model
probability; and P(X), the probability of all
acoustic observations.
In the experiment we generated a set of N-
best hypotheses, JN1 = fJ1;J2;   ;JNg 1 and
each Ji is determined by
Ji = arg maxJ2 
i
Pam(XjJ)Plm(J)
where  i is the set of all possible source sen-
tences excluding all higher ranked Jk’s, 1  k  
i  1.
The conversion from J to E in Fig. 1 is
the machine translation process. According
to the statistical machine translation formal-
ism (Brown et al., 1993), the translation process
is to search for the best sentence bE such that
bE = arg max
E P(EjJ) = arg maxE P(JjE)P(E)
where P(JjE) is a translation model charac-
terizing the correspondence between E and J;
P(E), the English language model probability.
In the IBM model 4, the translation model
P(JjE) is further decomposed into four sub-
models:
 Lexicon Model { t(jje): probability of a
word j in the Japanese language being
translated into a word e in the English lan-
guage.
1Hereafter, J1 is called the single-best hypothesis of
speech recognition; JN1 , the N-best hypotheses.
 Fertility model { n( je): probability of
a English language word e generating  
words.
 Distortion model { d: probability of distor-
tion, which is decomposed into the distor-
tion probabilities of head words and non-
head words.
 NULL translation model { p1: a  xed prob-
ability of inserting a NULL word after de-
termining each English word.
In the above we listed seven features: two
from ASR (Pam(XjJ), Plm(J)) and  ve from
SMT (P(E), t(jje), n( je), d, p1).
The third module in Fig. 1 is to rescore trans-
lation hypotheses from SMT by using a feature-
based log-linear model. All translation can-
didates output through the speech recognition
and translation modules are re-evaluated by us-
ing all relevant features and searching for the
best translation candidate of the highest score.
The log-linear model used in our speech trans-
lation process, P(EjX), is
P (EjX) = exp(
PM
i=1  ifi(X;E))P
E0 exp(
PM
i=1  ifi(X;E
0))  = f 
M
1 g
(1)
In the Eq. 1, fi(X;E) is the logarithm value
of the i-th feature;  i is the weight of the i-
th feature. Integrating di erent features in the
equation results in di erent models. In the ex-
periments performed in section 4, four di erent
models will be trained by increasing the number
of features successively to investigate the e ect
of di erent features for improving speech trans-
lation.
In addition to the above seven features, the
following features are also incorporated.
 Part-of-speech language models: English
part-of-speech language models were used.
POS dependence of a translated English
sentence is an e ective constraint in prun-
ing English sentence candidates. In our ex-
periments 81 part-of-speech tags and a 5-
gram POS language model were used.
 Length model P(ljE;J): l is the length
(number of words) of a translated English
sentence.
 Jump weight: Jump width for adjacent
cepts in Model 4 (Marcu and Wong, 2002).
 Example matching score: The translated
English sentence is matched with phrase
translation examples. A score is derived
based on the count of matches (Watanabe
and Sumita, 2003).
 Dynamic example matching score: Similar
to the example matching score but phrases
were extracted dynamically from sentence
examples (Watanabe and Sumita, 2003).
Altogether, we used M(=12) di erent fea-
tures. In section 3, we review Powell’s algo-
rithm (Press et al., 2000) as our tool to opti-
mize model parameters,  M1 , based on di erent
objective translation metrics.
3 Parameter Optimization Based
on Translation Metrics
The denominator in Eq. 1 can be ignored since
the normalization is applied equally to every hy-
pothesis. Hence, the choice of the best transla-
tion, bE, out of all possible translations, E, is
independent of the denominator,
bE = arg max
E
MX
i=1
 ilogPi(X;E) (2)
where we write features, fi(X;E), explicitly in
logarithm, logPi(X;E).
The e ectiveness of the model in Eq. 2 de-
pends upon the parameter optimization of the
parameter set  M1 , with respect to some objec-
tively measurable but subjectively relevant met-
rics.
Suppose we have L speech utterances and
for each utterance, we generate N best speech
recognition hypotheses. For each recogni-
tion hypothesis, K English language transla-
tion hypotheses are generated. For the l-th
input speech utterance, there are then Cl =
fEl1;   ;ElN Kg translations. All L speech ut-
terances generate L N K translations in to-
tal.
Our goal is to minimize the translation \dis-
tortion" between the reference translations, R,
and the translated sentences, bE.
 M1 = optimize D( bE;R) (3)
where bE = f bE1;   ; bELg is a set of translations
of all utterances. The translation bEl of the l-
th utterance is produced by the (Eq. 2), where
E 2 Cl.
Let R = fR1;   ;RLg be the set of transla-
tion references for all utterances. Human trans-
lators paraphrased 16 reference sentences for
each utterance, i.e., Rl contains 16 reference
candidates for the l-th utterance.
D( bE;R) is a translation \distortion" or an
objective translation assessment. The following
four metrics were used speci cally in this study:
 BLEU (Papineni et al., 2002): A weighted
geometric mean of the n-gram matches be-
tween test and reference sentences multi-
plied by a brevity penalty that penalizes
short translation sentences.
 NIST : An arithmetic mean of the n-gram
matches between test and reference sen-
tences multiplied by a length factor which
again penalizes short translation sentences.
 mWER (Niessen et al., 2000): Multiple ref-
erence word error rate, which computes the
edit distance (minimum number of inser-
tions, deletions, and substitutions) between
test and reference sentences.
 mPER: Multiple reference position inde-
pendent word error rate, which computes
the edit distance without considering the
word order.
The BLEU score and NIST score are calcu-
lated by the tool downloadable 2.
Because the objective function in the model
(Eq. 3) is not smoothed function, we used Pow-
ell’s search method to  nd a solution. The Pow-
ell’s algorithm used in this work is similar as the
one from (Press et al., 2000) but we modi ed the
line optimization codes, a subroutine of Powell’s
algorithm, with reference to (Och, 2003).
Finding a global optimum is usually di cult
in a high dimensional vector space. To make
sure that we had found a good local optimum,
we restarted the algorithm by using various ini-
tializations and used the best local optimum as
the  nal solution.
4 Experiments
4.1 Corpus & System
The data used in this study was the Basic
Travel Expression Corpus (BTEC) (Kikui et al.,
2003), consisting of commonly used sentences
listed in travel guidebooks and tour conversa-
tions. The corpus were designed for developing
multiple language speech-to-speech translation
systems. It contains four di erent languages:
Chinese, Japanese, Korean and English. Only
Japanese-English parallel data was used in this
2http://www.nist.gov/speech/tests/mt/
Table 1: Training, development and test data
from Basic Travel Expression Corpus(BTEC)
Japanese English
Train Sentences 162,318
Words 1,288,767 949,377
Dev. Sentences 510
Words 4015 2983
Test Sentences 508
Words 4112 2951
study. The speech data was recorded by multi-
ple speakers and was used to train the acoustic
models, while the text database was used for
training the language and translation models.
The standard BTEC training corpus, the  rst
 le and the second  le from BTEC standard test
corpus #01 were used for training, development
and test respectively. The statistics of corpus is
shown in table 1.
The speech recognition engine used in the ex-
periments was an HMM-based, large vocabu-
lary continuous speech recognizer. The acoustic
HMMs were triphone models with 2,100 states
in total, using 25 dimensional, short-time spec-
trum features. In the  rst and second pass of
decoding, a multiclass word bigram of a lexicon
of 37,000 words plus 10,000 compound words
was used. A word trigram was used in rescor-
ing the results.
The machine translation system is a graph-
based decoder (Ue ng et al., 2002). The  rst
pass of the decoder generates a word-graph, a
compact representation of alternative transla-
tion candidates, using a beam search based on
the scores of the lexicon and language mod-
els. In the second pass an A* search traverses
the graph. The edges of the word-graph, or
the phrase translation candidates, are gener-
ated by the list of word translations obtained
from the inverted lexicon model. The phrase
translations extracted from the Viterbi align-
ments of the training corpus also constitute the
edges. Similarly, the edges are also created from
dynamically extracted phrase translations from
the bilingual sentences (Watanabe and Sumita,
2003). The decoder used the IBM Model 4
with a trigram language model and a 5-gram
part-of-speech language model. The training of
IBM model 4 was implemented by the GIZA++
package (Och and Ney, 2003).
4.2 Model Training
In order to quantify translation improvement by
features from speech recognition and machine
translation respectively, we built four log-linear
models by adding features successively. The
four models are:
 Standard translation model(stm): Only
features from the IBM model 4 (M=5) de-
scribed in section 2 were used in the log-
linear models. We did not perform parame-
ter optimization on this model. It is equiv-
alent to setting all the  M1 to 1. This model
was the standard model used in most sta-
tistical machine translation system. It is
referred to as the baseline model.
 Optimized standard translation models
(ostm): This model consists of the same
features as the previous model \stm" but
the parameters were optimized by Powell’s
algorithm. We intended to exhibit the ef-
fect of parameter optimization by compar-
ing this model with the baseline \stm".
 Optimized enhanced translation models
(oetm): We incorporated additional trans-
lation features described in section 2 to
enrich the model \ostm". In this model
the number of the total features, M, is 10.
Model parameters were optimized. We in-
tended to show how much the enhanced
features can improve translation quality.
 Optimized enhanced speech translation
models (oestm): Features from speech
recognition, likelihood scores of acoustic
and language models, were incorporated
additionally into the model \oetm". All
the 12 features described in section 2 were
used. Model parameters were optimized.
To optimize  parameters of the log-linear
models, we used the development data of 510
speech utterances. We adopted an N-best
hypothesis approach (Och, 2003) to train  .
For each input speech utterance, N K candi-
date translations were generated, where N is
the number of generated recognition hypothe-
ses and K is the number of translation hypothe-
ses. A vector of dimension M, corresponding to
multiple features used in the translation model,
was generated for each translation candidate.
The Powell’s algorithm was used to optimize
these parameters. We used a large K to ensure
that promising translation candidates were not
Table 2: Comparisons of single-best and N-best
hypotheses of speech recognition performance
in terms of word accuracy, sentence accuracy,
insertion, deletion and substitution error rates
word sent ins del sub
acc(%) acc(%) (%) (%) (%)
single-best 93.5 78.7 2.0 0.8 3.6
N-best 96.1 87.0 1.2 0.3 2.2
pruned out. In the training, we set N=100 and
K=1;000.
By using di erent objective translation eval-
uation metrics described in section 3, for each
model we obtained four sets of optimized pa-
rameters with respect to BLEU, NIST, mWER
and mPER metrics, respectively.
4.3 Translation Improvement by
Additional Features
All 508 utterances in the test data were used to
evaluate the models. Similar to processing the
development data, the speech recognizer gen-
erated N-best (N=100) recognition hypothe-
ses for each test speech utterance. Table 2
shows speech recognition results of the test data
set in single-best and N-best hypotheses. We
observed that over 8% sentence accuracy im-
provement was obtained from the single-best to
the N-best recognition hypotheses. The recog-
nized sentences were then translated into corre-
sponding English sentences. 1,000 such trans-
lation candidates were produced for each recog-
nition hypothesis. These candidates were then
rescored by each of the four models with four
sets of optimized parameters obtained in the
training respectively. The candidates with the
best score were chosen.
The best translations generated by a model
were evaluated by the translation assessment
metrics used to optimize the model parameters
in the development. The experimental results
are shown in Table 3.
In the experiments we changed the number
of speech recognition hypotheses, N, to see how
translation performance is changed as N. We
found that the best translation was achieved
when a relatively smaller set of hypotheses,
N=5, was used. Hence, the values in Table 3
were obtained when N was set to 5.
We test each model by employing the single-
best recognition hypothesis translations and
the N-best recognition hypothesis translations.
Table 3: Translation improvement from the
baseline model(stm) to the optimized enhanced
speech translation model(oestm): Models are
optimized using the same metric as shown in
the columns. Numbers are in percentage except
NIST score.
BLEU NIST mWER mPER
Single-best recognition hypothesis translation
stm 54.2 7.5 39.8 34.8
ostm 59.0 8.9 36.2 34.0
oetm 59.2 9.9 34.3 31.5
N-best recognition hypothesis translation
stm 55.5 7.3 39.8 35.4
ostm 61.1 8.8 36.4 33.9
oetm 61.1 10.0 34.0 31.1
oestm 62.1 10.2 33.7 29.4
The single-best translation was from the trans-
lation of the single best hypotheses of the speech
recognition and the N-best hypothesis trans-
lation was from the translations of all the hy-
potheses produced by speech recognition.
In Table 3, we observe that a large improve-
ment is achieved from the baseline model \stm"
to the  nal model \oestm". The BLEU, NIST,
mWER, mPER scores are improved by 7.9%,
2.7, 6.1%, 5.4% respectively. Note that a high
value of BLEU and NIST score means a good
translation while a worse translation for mWER
and mPER. Consistent performance improve-
ment was achieved in the single-best and N-
best recognition hypotheses translations. We
observed that the improvement were due to the
following reasons:
 Optimization. Models with optimized pa-
rameters yielded a better translation than
the models with unoptimized parameters.
It can be seen by comparing the model
\stm" with the model \ostm" for both the
single-best and the N-best results.
 N-best recognition hypotheses. In major-
ity of the cells in Table 3, translation per-
formance of the N-best recognition is bet-
ter than of the corresponding single-best
recognition. N-best BLEU score of \ostm"
improved over the single-best of \ostm" by
2.1%. However, NIST score is indi erent
to the change. It appears that NIST score
is insensitive to detect slight translation
changes.
Table 4: Translation improvement of incorrectly
recognized utterances from single-best(oetm) to
N-best(oestm)
BLEU NIST mWER mPER
single-best 29.0 6.1 59.7 51.8
N-best 36.3 7.2 54.4 47.9
 Enhanced features. Translation perfor-
mance is improved steadily when more fea-
tures are incorporated into the log-linear
models. Translation performance of model
\oetm" is better than model \ostm" be-
cause more e ective translation features
are used. Model \oestm" is better than
model \oetm" due to its enhanced speech
recognition features. It con rms that our
approach to integrate features from speech
recognition and translation features works
very well.
4.4 Recognition Improvement of
Incorrectly Recognized Sentences
In previous experiments we demonstrated that
speech translation performance was improved
by the proposed enhanced speech translation
model \oestm". In this section we want to show
that this improvement is because of the signi -
cant improvement of incorrectly recognized sen-
tences when N-best recognition hypotheses are
used.
We carried out the following experiments.
Only incorrectly recognized sentences were ex-
tracted for translation and re-scored by the
model \oetm" for the single-best case and the
model \oestm" for the N-best case. The trans-
lation results are shown in Table 4. Translation
of incorrectly recognized sentences are improved
signi cantly as shown in the table.
Because we used N-best recognition hypothe-
ses, the log-linear model chose the recogni-
tion hypothesis among the N hypotheses which
yielded the best translation. As a result, speech
recognition could be improved if the higher ac-
curate recognition hypotheses was chosen for
translation. This e ect can be observed clearly
if we extracted the chosen recognition hypothe-
ses of incorrectly recognized sentences. Table 5
shows the word accuracy and sentence accuracy
of the recognition hypotheses selected by the
translation module. The sentence accuracy of
incorrectly recognized sentences was improved
by 7.5%. The word accuracy was also improved.
Table 5: Recognition accuracy of incorrectly
recognized utterance improved by N-best hy-
pothesis translation.
word acc. (%) sent. acc. (%)
single-best 74.6 0
N-best BLEU 76.4 7.5
mWER 75.9 6.5
5 Discussions
As regards to integrating speech recognition
with translation, a coupling structure (Ney,
1999) was proposed as a speech translation in-
frastructure that multiplies acoustic probabili-
ties with translation probabilities in a one-step
decoding procedure. But no experimental re-
sults have been given on whether and how this
coupling structure improved speech translation.
(Casacuberta et al., 2002) used a  nite-state
transducer where scores from acoustic infor-
mation sources and lexicon translation models
were integrated together. Word pairs of source
and target languages were tied in the decoding
graph. However, this method was only tested
for a pair of similar languages, i.e., Spanish to
English. For translating between languages of
di erent families where the syntactic structures
can be quite di erent, like Japanese and En-
glish, rigid tying of word pair still remains to be
shown its e ectiveness for translation.
Our approach is rather general, easy to imple-
ment and  exible to expand. In the experiments
we incorporated features from acoustic models
and language models. But this framework is
 exible to include more e ective features. In-
deed, the proposed speech translation paradigm
of log-linear models have been shown e ective in
many applications (Beyerlein, 1998) (Vergyri,
2000) (Och, 2003).
In order to use speech recognition features,
the N-best speech recognition hypotheses were
needed. Using N-best could bear computing
burden. However, our experiments have shown
a smaller N seems to be adequate to achieve
most of the translation improvement without
signi cant increasing of computations.
6 Conclusion
In this paper we presented our approach of in-
corporating both speech recognition and ma-
chine translation features into a log-linear
speech translation model to improve speech
translation.
Under this new approach, translation perfor-
mance was signi cantly improved. The perfor-
mance improvement was con rmed by consis-
tent experimental results and measured by us-
ing various objective translation metrics. In
particular, BLEU score was improved by 7.9%
absolute.
We show that features derived from speech
recognition: likelihood of acoustic and language
models, helped improve speech translation. The
N-best recognition hypotheses are better than
the single-best ones when they are used in trans-
lation. We also show that N-best recogni-
tion hypothesis translation can improve speech
recognition accuracy of incorrectly recognized
sentences.
The success of the experiments owes to the
use of statistical machine translation and log-
linear models so that various of e ective fea-
tures can be jointed and balanced to output the
optimal translation results.
Acknowledgments
We would like to thank for assistance from Ei-
ichiro Sumita, Yoshinori Sagisaka, Seiichi Ya-
mamoto and the anonymous reviewers.
The research reported here was supported in
part by a contract with the National Institute
of Information and Communications Technol-
ogy of Japan entitled \A study of speech dia-
logue translation technology based on a large
corpus".

References

Peter Beyerlein. 1998. Discriminative model
combination. In Proc.of ICASSP’1998, vol-
ume 1, pages 481{484.

Peter F. Brown, Vincent J. Della Pietra,
Stephen A. Della Pietra, and Robert L. Mer-
cer. 1993. The mathematics of statistical
machine translation: Parameter estimation.
Computational Linguistics, 19(2):263{311.

Francisco Casacuberta, Enrique Vidal, and
Juan M. Vilar. 2002. Architectures for
speech-to-speech translation using  nite-state
models. In Proc. of speech-to-speech trans-
lation workshop, pages 39{44, Philadelphia,
PA, July.

Genichiro Kikui, Eiichiro Sumita, Toshiyuki
Takezawa, and Seiichi Yamamoto. 2003. Cre-
ating corpora for speech-to-speech transla-
tion. In Proc.of EUROSPEECH’2003, pages
381{384, Geneva.

Daniel Marcu and William Wong. 2002. A
phrase-based, joint probability model for sta-
tistical machine translation. In Proc. of
EMNLP-2002, Philadelphia, PA, July.

Hermann Ney. 1999. Speech translation: Cou-
pling of recognition and translation. In Proc.
of ICASSP’1999, volume 1, pages 517{520,
Phoenix, AR, March.

Sonja Niessen, Franz J. Och, Gregor Leusch,
and Hermann Ney. 2000. An evaluation tool
for machine translation: Fast evaluation for
machine translation research. In Proc.of the
LREC (2000), pages 39{45, Athens, Greece,
May.

Franz Josef Och and Hermann Ney. 2003. A
systematic comparison of various statistical
alignment models. Computational Linguis-
tics, 29(1):19{51.

Franz Josef Och. 2003. Minimum error rate
training in statistical machine translation. In
Proc. of ACL’2003, pages 160{167.

Kishore A. Papineni, Salim Roukos, Todd
Ward, and Wei-Jing Zhu. 2002. Bleu: A
method for automatic evaluation of machine
translation. In Proc. of ACL’2002, pages
311{318, Philadelphia, PA, July.

William H. Press, Saul A. Teukolsky,
William T. Vetterling, and Brian P. Flan-
nery. 2000. Numerical Recipes in C++.
Cambridge University Press, Cambridge,
UK.

Nicola Ue ng, Franz Josef Och, and Hermann
Ney. 2002. Generation of word graphs in sta-
tistical machine translation. In Proc. of the
Conference on Empirical Methods for Natu-
ral Language Processing (EMNLP02), pages
156{163, Philadelphia, PA, July.

Dimitra Vergyri. 2000. Use of word level side
information to improve speech recognition. In
Proc. of the IEEE International Conference
on Acoustics, Speech and Signal Processing,
2000.

Taro Watanabe and Eiichiro Sumita. 2003.
Example-based decoding for statistical ma-
chine translation. In Machine Translation
Summit IX, pages 410{417, New Orleans,
Louisiana.
