Quality-Sensitive Test Set Selection for a Speech Translation System 
Fumiaki Sugaya
1
, Keiji Yasuda
2
, Toshiyuki Takezawa and Seiichi Yamamoto 
ATR Spoken Language Translation Research Laboratories 
2-2-2 Hikari-dai Seika-cho, Soraku-gun, Kyoto, 619-0288, Japan 
{fumiaki.sugaya, keiji.yasuda, toshiyuki.takezawa,
seiichi.yamamoto}@atr.co.jp
 
 
                                                           
1 
2 
1
Current affiliation: KDDI R&D Laboratories. Also at Graduate School of Science and Technology, Kobe University. 
2
Also at Graduate School of Engineering, Doshisha University. 
  
Abstract 
We propose a test set selection method to �
sensitively evaluate the performance of a 
speech translation system. The proposed 
method chooses the most sensitive test 
sentences by removing insensitive 
sentences iteratively. Experiments are 
conducted on the ATR-MATRIX speech 
translation system, developed at ATR 
Interpreting Telecommunications 
Research Laboratories. The results show 
the effectiveness of the proposed method. 
According to the results, the proposed 
method can reduce the test set size to less 
than 40% of the original size while 
improving evaluation reliability. 
Introduction 
The translation paired comparison method 
precisely measures the capability of a speech 
translation system.  In this method, native speakers 
compare a system’s translation and the translations, 
made by examinees who have various TOEIC 
scores. The method requires two human costs: the 
data collection of examinees’ translations and the 
comparison by native speakers.  In this paper, we 
propose a test set size reduction method that 
reduces the number of test set utterances.  The 
method chooses the most sensitive test utterances 
by removing the most insensitive utterances 
iteratively.    
In section 2, the translation paired comparison 
method is described. Section 3 explains the 
proposed method. In section 4, evaluation results 
for ATR-MATRIX are shown. Section 5 discusses 
the experimental results. In section 6, we state our 
conclusions. 
Translation paired comparison method 
The translation paired comparison method  
(Sugaya, 2000) is an effective evaluation method 
for precisely measuring the capability of a speech 
translation system. In this section, a description of 
the method is given. 
2.1 Methodology of the translation paired 
comparison method 
Figure 1 shows a diagram of the translation paired 
comparison method in the case of Japanese to 
English translation. The Japanese native-speaking 
examinees are asked to listen to Japanese text and 
provide an English translation on paper.  The 
Japanese text is spoken twice within one minute, 
with a pause in-between. To measure the English 
capability of the Japanese native speakers, the 
TOEIC score is used. The examinees are requested 
to present an official TOEIC score certificate 
showing that they have taken the test within the 
past six months. A questionnaire is given to them 
and the results show that the answer time is 
moderately difficult for the examinees. 
The test text is the SLTA1 test set, which 
consists of 330 utterances in 23 conversations from 
a bilingual travel conversation database (Morimoto, 
1994; Takezawa, 1999). The SLTA1 test set is 
                                            Association for Computational Linguistics.
                         Algorithms and Systems, Philadelphia, July 2002, pp. 109-116.
                          Proceedings of the Workshop on Speech-to-Speech Translation:
open for both speech recognition and language 
translation. The answers written on paper are typed. 
In the proposed method, the typed translations 
made by the examinees and the outputs of the 
system are merged into evaluation sheets and are 
then compared by an evaluator who is a native 
English speaker. Each utterance information is 
shown on the evaluation sheets as the Japanese test 
text and the two translation results, i.e., translations 
by an examinee and by the system.  The two 
translations are presented in random order to 
eliminate bias by the evaluator.  The evaluator is 
asked to follow the procedure illustrated in Figure 
2. The four ranks in Figure 2 are the same as those 
used in Sumita (1999). The ranks A, B, C, and D 
indicate: (A) Perfect: no problems in both 
information and grammar; (B) Fair: easy-to-
understand � with some unimportant information 
missing or flawed grammar; (C) Acceptable: 
broken but understandable with effort; (D) 
Nonsense: important information has been 
translated incorrectly. 
2.2 Evaluation result using the translation 
paired comparison method 
Figure 3 shows the result of a comparison between 
a language translation subsystem (TDMT) and the 
examinees. The input for TDMT included accurate 
transcriptions. The total number of examinees was 
thirty, with five people having scores in every 
hundred-point TOEIC range between the 300s and 
800s. In Figure 3, the horizontal axis represents the 
TOEIC score and the vertical axis the system 
winning rate (SWR) given by following equation: 
Translation 
Result by 
Human 
Evaluation 
Sheet 
Japanese Test 
Text 
Typing Paired Comparison 
Accurate Text 
 
 
 
 
where N
TOTAL
 denotes the total number of 
utterances in the test set, N
TDMT
 represents the 
number of  "TDMT won" utterances,  and N
EVEN
, 
indicates the number of  even (non-winner) 
utterances, i.e., no difference between the results of 
the TDMT and humans. The SWR ranges from 0 
to 1.0, signifying the degree of capability of the 
MT system relative to that of the examinee.  An 
SWR of 0.5 means that the TDMT has the same 
capability as the human examinee. 
Figure 3 shows that the SWR of TDMT is 
greater than 0.5 at TOEIC scores of around 300 
and 400, i.e., the TDMT system wins over humans 
with TOEIC scores of 300 and 400. Examinees, in 
contrast, win at scores of around 800. The 
capability balanced area is around a score of 600 to 
(1)                    
0.5
TOTAL
EVEN
TDMT
N
NN
SWR
×+
=
Figure 1: Diagram of translation pair comparison method 
Japanese-to-English 
Language Translation 
(J-E TDMT) 
Japanese Recognition
(Japanese SPREC) 
Choose A, B, C, or D rank 
 
No 
Same rank?
 
Yes 
Consider naturalness 
 
Yes
No 
Same? 
 
Select better result 
 
EVEN 
 
Figure 2: Procedure of comparison 
by native speaker 

References

Morimoto, T., Uratani, N., Takezawa, T., Furuse, O., Sobashima, Y., Iida, H., Nakamura, A., Sagisaka, Y., Higuchi, N. and Yamazaki, Y. 1994. A speech and language database for speech translation research. In Proceedings of ICSLP `94, pages 1791-1794.

Sugaya, F., Takezawa, T., Yokoo, A., Sagisaka, Y. and Yamamoto, S. 2000. Evaluation of the ATR-MATRIX Speech Translation System with a Pair Comparison Method between the System and Humans. In Proceedings of ICSLP 2000, pages 1105-1108.

Sumita, E., Yamada, S., Yamamoto K., Paul, M., Kashioka, H., Ishikawa, K. and Shirai, S. 1999. Solutions to Problems Inherent in Spoken-language Translation: The ATR-MATRIX Approach. In Proceedings of MT Summit `99, pages 229-235.

Takezawa, T. 1999. Building a bilingual travel conversation database for speech recognition research. In Proceedings of Oriental COCOSDA Workshop, pages 17-20.

Takezawa, T., Morimoto, T., Sagisaka, Y., Campbell, N., Iida., H., Sugaya, F., Yokoo, A. and Yamamoto, S. 1998. A Japanese-to-English speech translation system: ATR-MATRIX. In Proceedings of ICSLP 1998, pages 2779-2782.
