Evaluation of Direct Speech Translation Method Using Inductive Learning
for Conversations in the Travel Domain
Koji MURAKAMI
Makoto HIROSHIGE
Kenji ARAKI
Graduate school of Engineering
Hokkaido University, Japan
fmura, hiro, arakig@media.eng.hokudai.ac.jp
Koji TOCHINAI
Graduate school of Business Administration
Hokkai Gakuen University, Japan
tochinai@econ.hokkai-s-u.ac.jp
Abstract
This paper evaluates a direct speech trans-
lation Method with waveforms using the
Inductive Learning method for short con-
versation. The method is able to work
without conventional speech recognition
and speech synthesis because syntactic ex-
pressions are not needed for translation in
the proposed method. We focus only on
acoustic characteristics of speech wave-
forms of source and target languages with-
out obtaining character strings from ut-
terances. This speech translation method
can be utilized for any language because
the system has no processing dependent
on an individual character of a speci c
language. Therefore, we can utilize the
speech of a handicapped person who is
not able to be treated by conventional
speech recognition systems, because we
do not need to segment the speech into
phonemes, syllables, or words to realize
speech translation. Our method is real-
ized by learning translation rules that have
acoustic correspondence between two lan-
guages inductively. In this paper, we deal
with a translation between Japanese and
English.
1 Introduction
Speech is the most common means of communi-
cation for us because the information contained in
a0a2a1a4a3a6a5a8a7a9a1a8a10 a1a8a11a8a12 a13a15a14a15a10a17a16a19a18a8a14a20a10 a21a23a22a24a12 a22
a25 a21a9a18a4a26a27a14a15a13a28a26a29a12 a13a30a16a19a18a8a14a20a10 a21a31a22a32a12 a22
a33a35a34a15a14a36a26a38a37a8a3a39a34a41a40a43a42a8a26a38a3a39a14a15a13a28a26a29a12 a1a4a18
a25 a1a4a37a8a3a39a13a15a34a41a44a45a14a46a18a9a11a4a37a8a14a15a11a17a34
a25 a21a8a18a4a26a29a14a46a13a28a26a29a12 a13a41a47a48a3a39a14a46a18a31a22a24a49a29a1a17a3a39a50
a25 a5a9a34a15a34a15a13a46a7 a25 a21a8a18a4a26a27a7a8a34a20a22a24a12 a22
a47a48a14a46a3a51a11a17a34a28a26a23a44a45a14a46a18a8a11a17a37a8a14a15a11a8a34
a52 a16a19a53a48a54a55a1a4a18a8a56a8a34a46a18a4a26a6a12 a1a17a18a8a14a20a10a17a16a57a5a8a5a8a3a39a1a8a14a15a13a46a7
a25 a5a31a34a15a34a15a13a28a7a59a58a60a34a15a13a15a1a17a11a17a18a9a12 a26a29a12 a1a4a18
a52a27a61 a53a45a62a63a37a17a3a35a16a64a5a17a5a9a3a39a1a8a14a15a13a28a7
a25 a34a46a14a46a3a51a13a46a7a59a1a9a49a35a47a48a3a39a14a46a18a9a22a32a10 a14a28a26a27a12 a1a17a18a59a58a30a37a9a10 a34a20a22
a33a35a1a17a3a65a58a30a37a9a10 a34a41a66a19a12 a13a28a26a29a12 a1a17a18a8a14a46a3a29a21
a25 a1a4a37a8a3a39a13a15a34a41a44a45a14a46a18a8a11a4a37a9a14a15a11a17a34
a33a65a34a46a14a28a26a38a37a8a3a39a34a41a40a43a42a8a26 a3a51a14a15a13a36a26a29a12 a1a4a18
a25 a5a9a34a15a34a15a13a46a7 a25 a21a8a18a17a26a38a7a8a34a20a22a24a12 a22a45a67a4a21a68a13a15a1a4a18a9a13a15a14a28a26a38a34a46a18a8a14a28a26a29a12 a1a17a18
a1a8a49a43a22a51a26a27a1a4a3a51a34a46a69a70a22a51a5a31a34a15a34a46a13a46a7a59a71a19a14a46a56a8a34a20a49a29a1a17a3a39a50
a47a48a14a46a3a39a11a8a34a28a26a31a44a72a14a28a18a9a11a4a37a9a14a15a11a17a34
a22
a25 a22
a25
a25
a25 a25 a22
a52
a25
a25
a25
a25 a25
a25
a25
a25a25
a25a25a25
a25 a25a25 a25
Figure 1: Comparison of conventional and our ap-
proach.
speech is suf cient to play a fundamental role in
conversation. Thus, it is much better that the pro-
cessing deals with speech directly. However, con-
ventional approaches of speech translation need a
text result, obtained by speech recognition, for ma-
chine translation although several errors or unrecog-
nized portions may be included in the result.
A text is translated through morphological anal-
ysis, syntactic analysis, and parsing of the sentence
of the target language. Finally, the speech synthesis
stage produces speech output of the target language.
Figure 1(A) shows the whole procedure of a tradi-
tional speech translation approach.
The procedure has several complicated processes
that do not give satisfying results. Therefore, the
lack of accuracy in each stage culminates into a poor
 nal result. For example, character strings obtained
by speech recognition may represent different infor-
                                            Association for Computational Linguistics.
                           Algorithms and Systems, Philadelphia, July 2002, pp. 45-52.
                          Proceedings of the Workshop on Speech-to-Speech Translation:
a73a75a74a15a76a78a77 a79a8a80a46a81a15a82 a83a15a84a35a82 a79a27a76a51a85a87a86a29a88 a76a51a82 a77 a80a36a85
a89 a84a78a76a51a79 a85a15a77 a85a46a90
a86a27a82 a76a91a90a46a84
a81 a77 a79a29a86a38a82
a86a27a82 a84a51a74
a92 a93a95a94a97a96a72a98
a99a101a100a24a102
a92a20a103a104a94a105a98
a106 a100a35a107
a86a29a84a24a108a78a80a32a85a15a109
a86a38a82 a84a51a74
a108a78a80a36a110a55a110a60a80a32a85a111a74a15a76a51a79 a82 a86
a109a46a77 a81 a81 a84a51a79a29a84a51a85a28a82a46a74a20a76a51a79 a82 a86
a92 a73 a94a113a112a114a98
a115a48a100a24a116
a92a46a117a118a94a105a98
a119 a100a31a120
a108a78a80a32a110a55a110a41a80a32a85a111a74a46a76a91a79 a82 a86
a109a28a77 a81 a81 a84a51a79a27a84a51a85a28a82a15a74a15a76a51a79 a82 a86
a121
a121
a121
a73a101a74a15a76a78a77 a79a9a80a46a81a20a122a28a82 a82 a84a51a79a27a76a51a85a15a108a78a84a48a86a29a76a91a110a55a74a15a88 a84a24a86
a122a36a85a46a123a28a85a46a80a36a124a48a85a30a77 a85a28a74a28a122a28a82
a77 a85a30a76a43a86a6a80a32a122a28a79a27a108a78a84
a88 a76a51a85a46a90a36a122a15a76a91a90a15a84
a125a46a84a78a76a91a79a27a108a51a83a15a77 a85a46a90a55a86a38a122a20a77 a82 a76a51a126a20a88 a84
a84a78a88 a84a51a110a41a84a51a85a28a82 a86a59a80a46a81a20a79 a122a20a88 a84a24a86a31a77 a85
a127 a122a46a88 a84a43a109a28a77 a108a51a82 a77 a80a32a85a15a76a51a79 a128
a127 a84a91a74a28a79a27a80a28a109a32a122a15a108a51a82 a77 a80a32a85a55a80a46a81a20a82 a83a15a84a43a77 a85a28a74a28a122a28a82a15a126a28a128
a108a24a80a32a110a55a126a15a77 a85a46a77 a85a46a90a111a84a24a88 a84a51a110a41a84a51a85a28a82 a86a31a80a46a81a20a79 a122a15a88 a84a32a86
a73 a102 a112 a117 a107
a112 a80a32a110a55a126a15a77 a85a46a77 a85a46a90a45a82 a83a15a84a65a84a24a88 a84a91a110a60a84a51a85a28a82 a86a31a80a46a81a20a79 a122a15a88 a84a32a86a31a108a78a80a32a79 a79a27a84a32a86a38a74a15a80a32a85a15a109a28a84a78a109
a124a72a77 a82 a83a60a86a38a122a15a77 a82 a76a51a126a15a88 a84a65a84a78a88 a84a51a110a60a84a91a85a28a82 a86a31a77 a85a111a82 a83a15a84a35a82 a76a51a79a38a90a46a84a51a82a87a88 a76a51a85a46a90a32a122a20a76a91a90a46a84
a73a111a109a28a80a32a74a28a82 a84a78a109a45a79 a122a15a88 a84a32a86
a129 a73a41a130 a112a48a131a78a132 a102 a129 a117 a130 a131a78a132 a107
a133 a79a27a76a51a85a87a86a29a88 a76a91a82 a77 a80a32a85
a73a111a108a78a134a32a122a15a77 a86a6a77 a82 a77 a80a32a85a30a80a28a81a4a84a78a88 a84a51a110a60a84a51a85a46a82 a86
a81 a80a24a79a17a82 a79a27a76a91a85a20a86a29a88 a76a51a82 a77 a80a32a85a72a79 a122a15a88 a84a24a86
a73a75a74a46a76a24a77 a79a9a80a46a81a15a122a46a82 a82 a84a51a79a29a76a51a85a15a108a78a84a48a86a6a76a51a110a55a74a15a88 a84a32a86
a125a15a80a32a122a36a79a29a108a78a84a43a88 a76a91a85a28a90a36a122a20a76a91a90a46a84
a86a29a77 a109a28a84
a133 a76a91a79 a90a15a84a51a82a87a88 a76a91a85a28a90a36a122a20a76a91a90a46a84
a86a29a77 a109a36a84
a73a72a108a78a134a32a122a15a77 a86a29a77 a82 a77 a80a32a85a30a80a46a81a4a84a78a88 a84a51a110a41a84a51a85a28a82 a86
a81 a80a32a79a17a82 a79a29a76a51a85a20a86a29a88 a76a51a82 a77 a80a32a85a72a79a38a122a46a88 a84a24a86
a127 a84a91a90a46a77 a86a27a82 a84a51a79a27a77 a85a46a90a111a108a78a80a36a110a55a110a60a80a32a85a30a76a51a85a15a109a72a109a46a77 a81 a81 a84a51a79a29a84a51a85a28a82a15a74a15a76a51a79 a82 a86
a108a78a80a36a110a55a110a60a80a32a85a111a74a15a76a51a79 a82 a86
a109a28a77 a81 a81 a84a51a79a27a84a91a85a36a82a15a74a15a76a91a79 a82 a86
a133 a79a27a76a51a85a87a86a29a88 a76a91a82 a77 a80a24a85
a86a38a82 a76a91a90a46a84
a108a24a80a32a110a55a110a41a80a24a85a111a74a15a76a51a79a38a82 a86
a109a28a77 a81 a81 a84a51a79a27a84a91a85a28a82a46a74a20a76a51a79 a82 a86
a89 a84a78a76a91a79 a85a15a77 a85a28a90a4a135
a108a78a80a32a110a30a110a60a80a32a85a72a74a20a76a51a79 a82 a86
a109a28a77 a81 a81 a84a51a79a27a84a51a85a28a82a15a74a15a76a51a79 a82 a86
a136a29a137 a136a29a138a65a136a29a139a91a140 a141a20a142 a143a51a144a28a144 a145a78a137 a136a39a141
a92 a73 a94a146a112a114a98 a92a15a117a41a94a105a98a127 a122a15a88 a84a15a147
a127 a122a15a88 a84a91a148
a127 a122a20a88 a84a78a149
a120
a119
a115
a116
a150a35a151a91a152a24a153a31a154a91a155 a152a32a156a27a157 a158a46a153a64a159a55a160a9a155 a161a111a162a64a157 a163a32a156a27a157 a158a46a153a9a152a32a151a29a164
a77
a89
a81 a77
a92
a106
a82 a86
a81 a81 a82 a86
a92 a73
a119
a82 a86
a81 a81 a82 a86
a82 a86
a81 a81 a82 a86
a82 a86
a81 a81 a82 a86
a92 a73
a119
a82 a86
a81 a81 a82 a86
a82 a86
a81 a81 a82 a86
a121
a121
a121
a82
a77
a88
a82 a84
a85
a127 a77 a128
a127 a77
a73 a102 a112 a117 a107
a112
a82 a82
a129 a102 a129 a117 a130 a107a129 a102 a129 a117 a130 a107a129 a102 a129 a117 a130 a107
a133 a77
a82 a77 a86
a81 a77
a82
a133
a82 a77 a86
a81 a77
a127 a81 a81 a82 a86
a82 a86
a81 a81 a82 a86
a82 a86
a81 a81 a82 a86
a133 a77
a86
a81 a81 a82 a86
a86
a81 a81 a82 a86
a89
a82 a86
a81 a81 a82 a86
a92 a73 a127a92 a73 a127a127a127
a127
a127
a120
a119
a115
a116
Figure 2: Processing structure.
mation than the original speech.
Murakami et al.(1997) attempted to recognize
several vowels and consonants using Neural Net-
works that had different structures with TDNN
(ATR Lab., 1995), however, they could not obtain
a high accuracy of recognition. They con rmed that
distinguishing the boundaries of words, syllables, or
phonemes is a task of great dif culty. Then, they
only focused on speech waveform itself, not charac-
ter strings obtained by speech recognition to realize
speech translation. Murakami et.al decided on deal-
ing with the correspondence of acoustic characteris-
tics of speech waveform instead of character strings
between two utterances.
Our approach handles the acoustic characteris-
tics of speech without lexical expression through
a much simpler structure than the reports of Tak-
izawa et al.(1998) , Mcurrency1uller et al.(1999) or Lavie et
al.(1997) because we believe that simpli cation of
the system would prevent inaccuracies in the trans-
lation. Figure 1(B) shows the processing stages of
our approach. If speech translation can be realized
by analyzing the correspondence in character strings
obtained by speech recognition, we can also build
up speech translation by dealing with the correspon-
dence in acoustic characteristics. In our method, we
extract acoustic common parts and different parts
by comparing two examples of acoustic characteris-
tics of speech between two translation pairs within
the same language. Then we generate translation
rules and register them in a translation dictionary.
The rules also have the location information of ac-
quired parts for speech synthesis on time-domain.
The translation rules are acquired not only by com-
paring speech utterances but also using the Inductive
Learning Method (K. Araki et al., 2001), still keep-
ing acoustic information within the rules. Deciding
the correspondence of meaning between two lan-
guages is a unique condition to realize our method.
In a translation phase, when an unknown utterance
of a source language is applied to be translated, the
system compares this sentence with all acoustic in-
formation of all rules within the source language.
Then several matched rules are utilized and referred
to their corresponding parts of the target language.
Finally, we obtain roughly synthesized target speech
by simply concatenating several suitable parts of
rules in the target language according to the infor-
mation of location. Figure 2 shows an overview of
the processing structure of our method.
Our method has several advantages over other ap-
proaches. First, the performance of the translation is
not affected by the lack of accuracy in speech recog-
nition because we do not need the segmentation of
speech into words, syllables, or phonemes. There-
fore, our method can be applied for all languages
without having to make processing changes in the
machine translation stage because there is no pro-
cessing dependent on any speci c language. With
conventional methods, several processes in the ma-
chine translation stage must be altered if the tar-
get language is to be changed because morpholog-
ical analysis and syntactic analysis are dependent on
each individual character of language completely.
Any difference in language has no affect on the
ability of the proposed method, fundamentally be-
cause we focus on the acoustic characteristics of
speech, not on the character strings of languages.
It is very important to approach speech translation
with a new methodology that is independent of indi-
vidual characters of any language.
We also expect our approach can be utilized
in speech recuperation systems for people with a
speech impediment because our method is able to
deal with various types of speech that is not able to
be treated by conventional speech recognition sys-
tems for normal voice.
Murakami et al.(2002) have successfully ob-
tained several samples of translation by applying our
method using local recorded speech data and spon-
taneous conversation speech.
In this paper, we adopt speech data of travel con-
versations to the proposed method. We evaluate the
performance of the method through experiments and
offer discussion on behaviors of the system.
2 Speech processing
2.1 Speech data
It is necessary to extract time-varying spectral char-
acteristics in utterances and apply them to the sys-
tem. We used several conversation sets from an
English conversation book (GEOS Publishing Inc.,
1999). The Japanese speech data was recorded with
a 48kHz sampling rate on DAT, and downsampled
to 8kHz. All speech data in the source language
was spoken by Japanese male students of our lab-
oratory. The speech data was spoken by 2 people in
the source and target languages, respectively.
The content of the data sets consists of conversa-
tions between a client and the front desk at a hotel
and conversations between a client and train station
staff.
Table 1: Experimental conditions of speech process-
ing.
Size of frame 30msec
Frame cycle 10msec
Speech window Hamming Window
AR Order 14
2.2 Spectral characteristics of speech
In our approach, the acoustic characteristics of
speech are very important because we must  nd
common and different acoustic parts by comparing
them. It is assumed that acoustic characteristics are
not dependent on any language. Table 1 shows the
conditions for speech analysis. The same conditions
and the same kind of characteristic parameters of
speech are used throughout the experiments.
In this report, the LPC coef cients are applied as
spectral parameters because Murakami et al.(2002)
could obtain better results by using these parameters
than other representations of speech characteristics.
2.3 Searching for the start point of parts
between utterances
When speech samples were being compared, we had
to consider how to normalize the elasticity on time-
domain. Many methods were investigated to resolve
this problem. We tried meditating a method that
is able to obtain a result similar to dynamic pro-
gramming (H. Sakoe et al., 1978; H. F. Silverman
et al., 1990) to execute time-domain normalization.
We adopted a method to investigate the difference
between two characteristic vectors of speech sam-
ples for determining common and different acous-
tic parts. The Least-Squares Distance Method was
adopted for the calculation of the similarity between
these vectors.
Two sequences of characteristic vectors named
 test vector and  reference vector are prepared.
The  test vector is picked out from the test speech
by a window that has de nite length. At the time, the
a165a45a166a78a167a38a166a91a168a29a166a78a169a15a170a32a166
a171
a166a78a170a78a172 a173a36a168
a174
a174
a175
a176a177
a178a179a180a181
a182
a183a46a184a36a185a23a186 a184a32a187 a188 a189a36a190a78a191a24a192a43a193a32a185a29a194a87a195a78a188a15a196a8a194 a185a39a188 a185a6a194 a185a29a190a36a197a39a185a8a198a32a185a51a197a29a189 a195a91a194a46a199a36a195a51a194 a189 a187 a195a91a190
a200a31a166a32a201a27a172
a171
a166a32a170a91a172 a173a36a168
a202a75a203 a173a46a204 a169a28a172a87a205a55a204 a172 a206a30a207a41a204 a169a20a204 a207a60a208a15a207
a209 a204 a201a27a172 a210a91a169a20a170a24a166a48a204 a169a60a210a43a211a28a168a29a210 a203 a206
a212a36a203 a166a24a170a91a172 a168a38a208a15a207 a209 a204 a201a27a172 a210a91a169a20a170a32a166
a171
a174
a174
a175
a176a177
a178a179a180a181
a182
a188 a189 a187
a171
a172
a209 a204 a203 a206
a209 a204
Figure 3: Comparison of vector sequences.
a213 a214 a215a17a213 a215a9a214 a216a48a213 a216a45a214 a217a65a213a218a217a35a216
a213
a214
a215a17a213
a215a9a214
a216a43a213
a216a45a214
a217a65a213
a217a35a216
a219a43a220a9a221a19a222a51a220a31a223a38a224a27a225a31a226a8a227a8a228a230a229a31a221a46a231a45a232a8a224a65a233a41a231a91a221a20a224a6a221a46a231a51a221a28a226a23a234a15a221a41a235a9a221a20a234a28a225a29a232a4a231a43a236a31a232a87a231a39a225a27a223 a232a17a226
a237
a238 a239a240
a238a241a242a243
a244a245
a246
a247 a239a248
a249
a242
a250
a243
a239
a240
a243
a251a239a252
a243
a249
a248
a253
a249
a248
a243
a241
a249
a244
a213 a214a213 a214
a213
a214
a213
a214
a237
a238 a239a240
a238a241a242a243
a244a245
a246
a247 a239a248
a249
a242
a250
a243
a239
a240
a243
a251a239a252
a243
a249
a248
a253
a249
a248
a243
a241
a249
a244
Figure 4: Difference between utterances(1):  All
right, Mr. Brown. 
a254 a255 a0a8a254 a0a9a255 a1a43a254 a1a45a255 a2a43a254a3a2a4a1
a254
a255
a0a17a254
a0a9a255
a1a43a254
a1a5a2
a6a5a7a9a8a11a10a12a7a14a13a16a15a18a17a9a19a21a20a23a22a25a24a9a8a27a26a29a28a23a15a4a30a31a26a32a8a33a15a34a8a35a26a36a8a35a19a9a37a38a8a31a39a9a8a33a37a27a17a18a28a40a26a5a41a9a28a40a26a34a17a42a13 a28a40a19
a43
a44 a45
a46
a44a47a48a49
a50a51
a52
a53 a45a54
a55
a48
a56
a49
a45
a46
a49
a57a45a58
a49
a55
a54
a59
a55
a54
a49
a47
a55
a50
a254 a255 a0 a0 a1 a1 a2a254 a255 a0 a0 a1 a1 a2
a254
a255
a0
a0
a1
a254
a255
a0
a0
a1
a43
a44 a45
a46
a44a47a48a49
a50a51
a52
a53 a45a54
a55
a48
a56
a49
a45
a46
a49
a57a45a58
a49
a55
a54
a59
a55
a54
a49
a47
a55
a50
Figure 5: Difference between utterances(2): All
right, Mr. Brown. -  Good afternoon. 
 reference vector is also prepared from the refer-
ence speech. A distance value is calculated by com-
paring the present  test vector and a portion of the
 reference vector . Then, we repeat the calculation
between the current  test vector and all portions
of the  reference vector picked out and shifted in
each moment with constant interval on time-domain.
When a portion of the  reference vector reaches the
end of the whole reference vector, a sequence of dis-
tance values is obtained as a result. The procedure of
comparing two vectors is shown as Figure 3. Next,
the new  test vector is picked out by the constant
interval, then the calculation mentioned above is re-
peated until the end of the  test vector . Finally, we
should get several distance curves as the result be-
tween two speech samples.
Figure 4 and Figure 5 show examples of the differ-
ence between two utterances. These applied speech
samples are spoken by the same speaker. The con-
tents of the compared utterances are the same in Fig-
ure 4, and are quite different in Figure 5. The hori-
zontal axis shows the shift number of reference vec-
tor on time-domain and the vertical axis shows the
shift number of test vector, i.e., the portion of test
speech. In the  gures, a curve in the lowest loca-
tion has been drawn by comparing the top of the test
speech and whole reference speech. If a distance
value in a distance curve is obviously lowest than
other distance values, it means that the two vectors
have much acoustic similarity.
As shown in Figure 5, the obvious local minimum
distance point is not discovered even if there is the
lowest point in each distance curve. On the other
hand, as shown in Figure 4, when the test and refer-
ence speech have the same content, the minimum
distance values are found sequentially in distance
curves. According to these results, if there is a po-
sition of the obviously smallest distance point in a
distance curve, that portion should be regarded as a
 common part . Moreover, if these points sequen-
tially appear among several distance curves, they
will be considered a common part. At the time,
there is a possibility that the part corresponds to sev-
eral semantic segments, longer than a phoneme and
a syllable.
2.4 Evaluation of the obvious minimal distance
value
To determine that the obviously lowest distance
value in the distance curve is a common part, we
adopt a threshold calculated by statistical informa-
tion. We calculate the variance of distance values
shown as  and the mean value within the curve.
The threshold is conducted as  = 4 2 from the
equation of the Gaussian distribution and the stan-
dardized normal distribution.
A point of the smallest distance value within a
curve is represented by x and a parameter m shows
the mean value of distances. A common part is de-
tected if (x  m)2 >  , because the portion of
reference speech has much similarity with the  test
vector of the distance curve in a point, and that
common part is represented by  0 . Otherwise the
speech portion for  test vector is regarded as a dif-
ferent part and represented by  1 . If several com-
mon parts are decided continuously, we deal with
them as one common part, and the  rst point in that
part will be the start point  nally. In our method,
the acoustic similarities evaluated by several calcu-
lations are only the factor for judgment in classifying
common or different parts in the speech samples.
3 Generation and application of
translation rule
3.1 Correction of acquired parts
The two reference speech samples are divided into
several common and different parts by comparison.
However, there is a possibility that these parts in-
clude several errors of elasticity normalization be-
cause the distance calculation is not perfect to re-
solve this problem on time-domain. We attempt to
correct incomplete common and different parts us-
ing heuristic techniques when a common part is di-
vided by a discrete different part, or a different part
is divided by a discrete common part.
3.2 Acquisition of translation rules
Common and different parts corrected in 3.1 are ap-
plied to determine the rule elements needed to gen-
erate translation rules. Figure 6 and 7 show the re-
sults of comparing utterances. In the  rst case, a
part containing continuous values of  0 represents
a common part. In the second case, a part consisting
of only  1 is regarded as a different part. In Fig-
ure 6, two utterances are calculated as a long com-
mon part. On the contrary, two utterances are cal-
culated as a long different part in Figure 7. These
results are comparable with lexical contents because
the syntactic sentence structures are the same in both
cases.
Moreover, when a sentence structure includes
common and different parts at the same time, we can
treat this structure as a third case. We deal with these
three cases of sentence structure as rule types. In all
the above-mentioned cases, several sets of common
and different parts are acquired if those utterances
were almost matching or did not match at all. Com-
bining sets of common parts of the source and target
languages become elements of the translation rules
for its generation. At this time, the set of common
parts extracted from the source language, that have
a60
a61a23a62a64a63a65a63a66a62a9a67a69a68a71a70a40a72a74a73a76a75
a60a4a77
a78a80a79a12a81a82a81a74a83
a72
a83
a67a80a73a84a68a71a70a40a72a74a73a76a75a86a85
a77
a87a4a88
a89a5a90a14a91a93a92a36a90a9a94a16a95a18a96a9a97a99a98a99a100a102a101a23a91a35a103a29a104a23a95a80a105a106a96a18a91a86a92a36a96a14a107a9a91a38a108a35a96a42a104a86a103a29a104a40a97a109a96a18a94 a100a110a91a38a111a12a112a99a104a40a100a113a105a33a94 a97
a60
a72
a83 a77
a97
Figure 6: Common and different parts(1): All right,
Mr. Brown. 
a114 a115a5a116
a117a9a118a14a119a65a119a66a118a9a120a69a121a123a122a86a124a74a125a127a126
a114a4a128
a129a5a130a34a131a74a131a82a132
a124
a132
a120a80a125a84a121a71a122a40a124a74a125a76a126a40a133
a128
a134a5a135a14a136a93a137a36a135a9a138a16a139a18a140a9a141a23a142a21a143a102a144a9a136a27a145a29a146a23a139a80a147a106a140a18a136a40a137a36a140a9a148a9a136a33a149a27a140a42a146a86a145a29a146a40a141a109a140a42a138 a143a113a136a33a150a34a151a99a146a40a143a110a147a38a138 a141
a114
a124
a132 a128
a141
Figure 7: Common and different parts(2).  All right,
Mr. Brown. -  Good afternoon. 
a correspondence of meaning with a set of common
parts in target language, are kept. The sets of differ-
ent parts become elements of the translation rules as
well.
Finally, these translation rules are generated by
completing all elements as below. It is very im-
portant the rules are acquired if the types of sen-
tences in both languages are the same. When the
types of sentence structures are different, it is im-
possible that translation rules are obtained and reg-
istered in the rule dictionary because we can not
decide the correspondence between two languages
samples uniquely. Acquired rules are categorized in
the following types:
Rule type 1: those with a very high sentence simi-
larity
Rule type 2: those with sentences including com-
mon and different parts
Rule type 3: those with very low sentence similar-
ity
When a new rule containing the information of sev-
eral common parts is generated, the rule should
keep the sentence form so that different parts in the
speech sample are replaced as variables. Informa-
tion that a translation rule has are as follows:
 rule types as mentioned above
 index number of a source language’s utterance
 sets of start and end points of each common and
different part
a152a110a153a38a154a82a155 a156a99a157a35a158a33a156 a159a38a160 a161a163a162 a164a35a157a165a159a165a156a18a166a82a161a84a160 a154a36a167a35a168a165a159a38a154a32a168a38a161a5a162a34a155 a169a27a161a171a170a23a154a32a156 a168a35a161a36a172a40a160 a154a36a167a35a168a165a159a38a154a32a168a38a161a5a162a34a155 a169a27a161
a173a34a174a23a175 a176a127a177a34a178 a179
a173a34a174a23a175 a176a127a177 a180a32a179
a152a110a153a35a154a74a155 a156a23a157a35a158a33a156 a159a35a160 a161a74a162
a152a71a166a74a181a74a159a33a155 a162a34a155 a172 a155 a157a163a167a76a157a35a158a40a161a82a160 a161a32a182a106a161a36a167a27a172 a162
a158 a157a163a156a40a167a33a161a36a183a110a172 a156a18a154a32a167a38a162a34a160 a154a36a172 a155 a157a165a167a71a156 a159a38a160 a161a163a162
a152a123a166a82a181a163a159a38a155 a162a34a155 a172 a155 a157a163a167a127a157a27a158a40a161a82a160 a161a36a182a106a161a32a167a165a172 a162
a158 a157a74a156a21a167a38a161a36a183a184a172 a156a18a154a36a167a33a162a34a160 a154a36a172 a155 a157a163a167a71a156a16a159a35a160 a161a74a162
a185 a161a32a168a35a155 a162a16a172 a161a36a156a18a155 a167a38a168a123a166a82a157a165a182a76a182a106a157a163a167a127a154a36a167a38a169a123a169a165a155 a158 a158 a161a32a156a16a161a32a167a165a172a33a153a35a154a32a156 a172 a162
a186a16a187a189a188 a187a191a190
a192 a193a195a194
a166a82a157a163a182a76a182a106a157a165a167a71a153a38a154a36a156 a172 a162
a169a165a155 a158 a158 a161a32a156a16a161a32a167a165a172a33a153a35a154a32a156 a172 a162
a186 a152 a187a197a196a198a187a109a190
a199a201a200a203a202
a166a82a157a163a182a127a182a106a157a163a167a71a153a38a154a32a156 a172 a162
a169a27a155 a158 a158 a161a36a156a42a161a36a167a27a172a38a153a38a154a36a156 a172 a162
a166a82a157a163a182a76a182a106a157a165a167a71a153a38a154a32a156 a172 a162
a169a27a155 a158 a158 a161a32a156a18a161a36a167a27a172a38a153a38a154a36a156 a172 a162
a166a82a157a163a182a127a182a106a157a163a167a71a153a38a154a32a156 a172 a162
a169a27a155 a158 a158 a161a36a156a42a161a36a167a27a172a38a153a38a154a36a156 a172 a162
a186 a152 a187a204a187a205a196a71a190
a206a66a200a29a207
a186a99a188a198a187a189a187a191a190
a208 a200a184a209a199
a210 a161a36a183a110a156 a159a33a160 a161 a208 a200a71a209a206a93a200a165a207
a186 a152 a187a211a187a212a196a71a190 a186a16a188a69a187a204a187a213a190a210 a161a32a183a110a156 a159a38a160 a161
a210 a161a32a183a110a156 a159a38a160 a161a186 a152 a187a214a196a215a187a213a190 a186a18a187a216a188a69a187a213a190
a210 a161a36a183a110a156a16a159a38a160 a161 a192a213a193a38a194a199a113a200a165a202
a217a123a218a82a219a86a220a80a221a165a222a16a219a86a223a34a224a16a225a9a220a110a226a93a227a4a222a16a228a11a229a110a224 a230a33a223a36a224 a225a9a220a64a219a86a218a36a231
a230a86a225a9a218a32a218a82a228a23a221a32a232a4a225a9a220a64a233a14a228a40a220a64a230a40a228
a179
a172 a155 a162
a158 a155
a172 a155 a162
a158 a155
a185 a158 a158 a172 a162
a172 a155 a162
a158 a155
a172 a155 a162
a158 a155
a185 a158 a158 a172 a162
a192
a172 a162
a158 a158 a172 a162
a186 a152a172 a162
a158 a158 a172 a162
a172 a162
a158 a158 a172 a162
a172 a162
a158 a158 a172 a162
a186 a152
a208
a192
a172 a162
a158 a158 a172 a162
a186 a152a172 a162
a158 a158 a172 a162
a192
a172 a162
a158 a158 a172 a162
a172 a162
a158 a158 a172 a162
a186 a152a172 a162
a158 a158 a172 a162
a172 a162
a158 a158 a172 a162
a172 a162
a158 a158 a172 a162
a172 a162
a158 a158 a172 a162
a186 a152
a208
a172 a162
a158 a158 a172 a162
a172 a162
a158 a158 a172 a162
a186 a152
a208
a172 a162
a158 a158 a172 a162
a172 a162
a158 a158 a172 a162
a172 a162
a158 a158 a172 a162
a172 a162
a158 a158 a172 a162
a186 a152
a208a199
a210 a161 a208
a186 a152 a210 a161
a210 a161a186 a152
a210 a161
a199a210 a161 a208
a186 a152 a210 a161
a199a210 a161 a208a199a210 a161 a208
a186 a152 a210 a161a186 a152 a210 a161
a210 a161a186 a152
a210 a161
a210 a161a186 a152 a210 a161a186 a152
a210 a161a210 a161
Figure 8: Rule aquisition using the Inductive Learning Method
 index number of an utterance in the target lan-
guage
3.3 Translation and speech synthesis
When an unknown speech utterance of a source lan-
guage is adapted to get the result of translation,
acoustic information of acquired parts in the trans-
lation rules are compared in turn with the unknown
speech, and several matched rules become the candi-
dates to translate. The inputted utterance should be
reproduced by a combination of several candidates
of rules. Then, the corresponding parts of the tar-
get language in candidate rules are referred to obtain
translated speech. Although the  nal synthesized
target speech may be produced roughly, speech can
directly be concatenated by several suitable parts of
rules in the target language using the location infor-
mation on time-domain in rules.
4 The Inductive Learning Method
The Inductive Learning that Araki et al.(2001) pro-
posed acquires rules by extracting common and dif-
ferent parts through the comparison between two
samples. This method is designed from an assump-
tion that a human being is able to  nd out common
and different parts between two samples although
these are unknown. The method is also able to ob-
tain rules by repetition of the acquired rules regis-
tered in the rule dictionary.
Figure 8 shows an overview of recursive rule ac-
quisition by this learning method. Two rules ac-
quired as rule(i) and rule(j) are prepared and com-
pared to extract common and different acoustic parts
as well as comparisons between speech samples.
Then, these obtained parts are designed as new rules.
If the compared rules consist of several common
or different parts, the calculation is repeated within
each part. It is assumed that these new rules are
much more reliable for translation.
If several rules are not useful for translation, they
will be eliminated by generalizing the rule dictio-
nary optimally to keep a designed size of memory.
The ability of optimal generalization in the Induc-
tive Learning Method is an advantage, as less exam-
ples have to be prepared beforehand. Much sample
data is needed to acquire many suitable rules with
conventional approaches.
5 Evaluation Experiments
5.1 Experiments of rule acquisition
All data in experiments are achieved through several
speech processes explained in 2.1. Table 2 shows
the conditions for experiments. The parameters con-
cerning frame settings have been decided from the
results of several preliminary experiments for rule
acquisition.
Table 2: Conditions for experiments.
Frame length of test vector 400msec
Frame rate of both vectors 50msec
The rate of agreement 95%
for adopting rules
Table 3: Translation rules.
Set of data Utterances Registed rules
Hotel 50 8,500
Station 32 22,846
Table 4: Appropriately acquired parts with correspondence.
Sentence ID Rule Type Corresponded Part/Length Speech
ja110g common (22-40)/41 SOREDEWA, BRAUN-SAMA.
ja110t common (106-124)/128 KOCHIRANI GOKICYOWO
ONEGAIITASHIMASU, BRAUN-SAMA.
en110g common (17-32)/33 All right, Mr. Brown.
en110t common (57-69)/71 Please  ll out this form, Mr.Brown.
Many sets of common and different parts were
extracted by comparing acoustic characteristics of
speech in each language, and translation rules were
registered in the translation rule dictionary. Table 3
shows the number of speech utterances and regis-
tered translation rules between two languages.
5.2 Experimental results of translation
If an unknown speech utterance of a source language
can be replaced with acoustic information from rules
in the dictionary, the speech will be translated and
synthesized roughly without losing it’s meaning.
Each matched rule includes certain equivalent cor-
respondence parts of the target language. The sys-
tem needs to decide the most suitable candidates of
rules from the rule dictionary for each translation.
If the level of similarity between the whole applied
unknown speech and all parts of the rules is higher
than a rate of agreement as in Table 2, the rules that
include appropriate parts can become candidates for
current translation.
82 utterances of limited domain have been ap-
plied to the system for translation. Regretfully, we
could not obtain any complete translated utterances,
although several samples have been incompletely
translated by adapting translation rules.
5.3 Discussion
We have to investigate several sources of the exper-
imental results. The  rst cause of the failure in the
translation can be found in speech data utilized in
these experiments. The contents of these utterances
do not exactly include the same expression because
Table 5: Failures of rule acquisition.
whole rule the case of the
acquisition same content
The number
of failure 527 22
a234a5a235a9a236a11a237a36a235a23a238a18a239a18a240a9a241a99a242a99a243a25a244a9a236a27a245a29a246a23a239a80a247a106a245a32a236a86a239a34a236a27a245a32a236a27a241a14a248a38a236a106a249a9a236a33a248a27a240a42a246a40a245a5a250a23a246a40a245a12a240a18a238a16a246a40a241
a251
a252 a253a254
a252a255a0
a1
a2a3
a4
a5a253a6
a7
a0
a8
a1a253
a254
a1a9
a253a10
a1
a7
a6
a11
a7
a6
a1a255
a7
a2
a12 a13 a14a15a12 a14a16a13 a17a18a12 a17a19a13 a17a21a20
a12
a13
a14a15a12
a14a16a13
a17a18a12
a17a18a13
a17a21a20
a251
a252 a253a254
a252a255a0
a1
a2a3
a4
a5a253a6
a7
a0
a8
a1a253
a254
a1a9
a253a10
a1
a7
a6
a11
a7
a6
a1a255
a7
a2
a251
a252 a253a254
a252a255a0
a1
a2a3
a4
a5a253a6
a7
a0
a8
a1a253
a254
a1a9
a253a10
a1
a7
a6
a11
a7
a6
a1a255
a7
a2
a12 a13a12 a13
a12
a13
a12
a13
Figure 9: Difference between utterances:  Good af-
ternoon. 
a22a19a23a25a24a26a24a27a23a19a28a30a29a32a31a34a33a36a35a38a37a40a39a42a41
a43a45a44a47a46a40a46a36a48
a33
a48
a28a49a35a50a29a51a31a34a33a40a35a52a37a34a53a54a41
a39 a53a49a53
a53a34a55a27a53a21a56
a53a34a57 a55a49a57
a55a59a58 a55a45a60
a61a45a62a64a63a66a65a67a62a19a68a70a69a72a71a19a73a18a74a21a75a77a76a19a63a15a78a59a79a18a69a49a80a81a71a72a63a34a65a67a71a19a82a19a63a84a83a15a71a85a79a54a78a59a79a34a73a86a71a85a68a75a87a63a84a88a47a89a90a79a34a75a91a80a92a68a73
a33
a48
a39
a73
Figure 10: A failed result of parts extraction: Good
afternoon. 
contents of speech samples are prepared with vari-
ous ways of speaking even if the semantic informa-
tion is the same among them.
Moreover, a small amount of speech data also is
another factor because more translation rules should
be acquired and adapted for translation.
The system has performed the task because many
suitable rules are registered in the rule dictionary. A
sample of parts acquired properly is shown as Ta-
ble 4. In this table, Japanese words are expressed
with an italic font. These parts are successfully ac-
quired through the learning stage, so that many suit-
able rules can be applied to other unknown speech
utterances.
Therefore, we need to increase the number of
speech samples to obtain more translation rules, and
it is also necessary to consider the contents of utter-
ances for more effective rule acquisition and appli-
cation.
In addition, we have paid attention to the parts
themselves acquired as translation rules. We have
to consider several causes where the same type of
sentences is not determined correctly even when the
contents are the same. Table 5 shows the number of
failures in whole rule acquisition and in the case of
comparisons of the same utterances. The types of
sentences are determined by the results of the parts
extraction stage. In this stage, thresholds have a
much important role for deciding common and dif-
ferent parts. Figure 9 shows the distance curves of
the same utterances that were not determined as a
common part by a threshold. And Figure 10 shows
the result of the extraction of common and different
parts. Several minimum points of distance curves
have been determined as different parts by thresh-
old although two portions of utterances also have the
highest similarity in these points. This kind of fail-
ure means that the de nition of the threshold has a
problem. Therefore, the de nition of the threshold
needs to be reconsidered for extracting common and
different parts much more correctly.
6 Conclusion and future works
In this paper, we have described the proposed
method and have evaluated the translation perfor-
mance for conversations on travel English. We have
con rmed that much appropriate acoustic informa-
tion is extracted by comparing speech, and rules
have been generated even if no target speech was
obtained through the system.
Many rules have been decided as candidates for
each translation by calculating all registered rules
with a high calculation cost. Therefore, we will
need to apply a method for selecting most suitable
rules from candidates and a clustering algorithm to
decrease the number of registered rules and the cal-
culation cost.
We will consider adopting a new approach for re-
alizing a more effective threshold without statistical
information.
We will also consider a possibility of the direct
speech translation system from speech by a person
with a handicap in the speech production organ to
normal speech because conventional speech recog-
nition methods are not able to assist those with a
speech impediment.
Acknowledgement This work is partially sup-
ported by the Grants from the Government subsidy
for aiding scienti c researches (No.14658097) of the
Ministry of Education, Culture, Sports, Science and
Technology of Japan.

References
A. Lavie, A. Waibel, L. Levin, M. Finke, D. Gates and M.
Gavald a. 1997. Janus-iii: Speech-to-speech transla-
tion in multiple languages. In Proceedings of ICASSP
’94, pages 99 102.
ATR Lab. 1995. Application of Neural Network.
GEOS Publishing Inc., 1999. English for Salespeople.
H. F. Silverman and D. P. Morgan. 1990. The appli-
cation of dynamic programming to connected speech
recognition. In IEEE, ASSP Magazine, pages 6 25.
H. Sakoe and S. Chiba. 1978. Dynamic programming
algorithm optimization for spoken word recognition.
In IEEE, Trans. on ASSP, pages 43 49.
J. Mcurrency1uller and H. Stahl. 1999. Speech understanding and
speech translation by maximum a-posteriori semantic
decoding. In Proceedings of Arti cial Intelligence in
Engineering, pages 373 384.
K. Araki and K. Tochinai. 2001. Effectiveness of natural
language processing method using inductive learning.
In Arti cial Intelligence and Soft Computing(ASC)’01,
pages 295 300.
K. Murakami, M. Hiroshige, K. Araki and K. Tochi-
nai. 2002. Evaluation of rule acquisition for a new
speech translation method with waveforms using in-
ductive learning. In Proceedings of Applied Informat-
ics ’02, pages 288 293.
K. Murakami, M. Hiroshige, K. Araki and K. Tochinai.
2002. Behaviors and problem of the speech machine
translation system for various speechdata. In Pro-
ceedings of the 2002 spring meeting of the ASJ, pages
385 386.
K. Murakami, M. Hiroshige, Y. Miyanaga and K. Tochi-
nai. 1997. A prototype system for continuous speech
recognition using group training based on neural net-
work. In Proc. ITC-CSCC ’97, pages 1013 1023.
T. Takizawa, T. Morimoto, Y. Sagisaka, N. Campbell, H.
Iida, F. Sugaya, A. Yokoo and S. Yamamoto. 1998.
A Japanese-to-English speech translation system:atr-
matrix. In Proc. of ICSLP ’98, pages 2779 2782.
