Automatic Measuring of English Language Proficiency using MT
Evaluation Technology
Keiji Yasuda
ATR Spoken Language Translation
Research Laboratories
Department of SLR
2-2-2 Hikaridai,
“Keihanna Science City”
Kyoto 619-0288 Japan
keiji.yasuda@atr.jp
Fumiaki Sugaya
KDDI R&D Laboratories
2-1-15, Ohara, Kamifukuoka-city,
Saitama, 356-8502, Japan
fsugaya@kddilabs.jp
Eiichiro Sumita
ATR Spoken Language Translation
Research Laboratories
Department of NLR
2-2-2 Hikaridai,
“Keihanna Science City”
Kyoto 619-0288 Japan
eiichiro.sumita@atr.jp
Toshiyuki Takezawa
ATR Spoken Language Translation
Research Laboratories
Department of SLR
2-2-2 Hikaridai,
“Keihanna Science City”
Kyoto 619-0288 Japan
toshiyuki.takezawa@atr.jp
Genichiro Kikui
ATR Spoken Language Translation
Research Laboratories
Department of SLR
2-2-2 Hikaridai,
“Keihanna Science City”
Kyoto 619-0288 Japan
genichiro.kikui@atr.jp
Seiichi Yamamoto
ATR Spoken Language Translation
Research Laboratories
2-2-2 Hikaridai,
“Keihanna Science City”
Kyoto 619-0288 Japan
seiichi.yamamoto@atr.jp
Abstract
Assisting in foreign language learning is one of
the major areas in which natural language pro-
cessing technology can contribute. This paper
proposes a computerized method of measuring
communicative skill in English as a foreign lan-
guage. The proposed method consists of two
parts. The first part involves a test sentence
selection part to achieve precise measurement
with a small test set. The second part is the ac-
tual measurement, which has three steps. Step
one asks proficiency-known human subjects to
translate Japanese sentences into English. Step
two gauges the match between the translations
of the subjects and correct translations based
on the n-gram overlap or the edit distance be-
tween translations. Step three learns the rela-
tionship between proficiency and match. By re-
gression it finds a straight-line fitting for the
scatter plot representing the proficiency and
matches of the subjects. Then, it estimates pro-
ficiency of proficiency-unknown users by using
the line and the match. Based on this approach,
we conducted experiments on estimating the
Test of English for International Communica-
tion (TOEIC) score. We collected two sets of
data consisting of English sentences translated
from Japanese. The first set consists of 330 sen-
tences, each translated to English by29 subjects
with varied English proficiency. The second set
consists of 510 sentences translated in a similar
manner by a separate group of 18 subjects. We
found that the estimated scores correlated with
the actual scores.
1 Introduction
For effective second language learning, it is ab-
solutely necessary to test proficiency in the sec-
ond language. This testing can help in selecting
educational materials before learning, checking
learners’ understanding after learning, and so
on.
To make learning efficient, it is important to
achieve testing with a short turnaround time.
Computer-based testing is one solution for this,
and several kinds of tests have been developed,
including CASEC (CASEC, 2004) and TOEFL-
CBT (TOEFL, 2004). However, these tests are
mainly based on cloze testing or multiple-choice
questions. Consequently, they require labour
costs for expert examination designers to make
the questions and the alternative “detractor”
answers.
Inthispaper, weproposeamethodfortheau-
tomatic measurement of English language pro-
ficiency by applying automatic evaluation tech-
niques. The proposed method selects adequate
test sentences from an existing corpus. Then,
it automatically evaluates the translations of
test sentences done by users. The core tech-
nology of the proposed method, i.e., the auto-
matic evaluation of translations, was developed
in research aiming at the efficient development
of Machine Translation (MT) technology (Su et
al., 1992; Papineni et al., 2002; NIST, 2002).
In the proposed method, we apply these MT
evaluation technologies to the measurement of
human English language proficiency. The pro-
posed method focuses on measuring the commu-
nicative skill of structuring sentences, which is
indispensable for writing and speaking. It does
not measure elementary capabilities including
vocabulary or grammar. This method also pro-
poses a test sentence selection scheme to enable
efficient testing.
Section2 describes several automatic evalua-
tion methods applied to the proposed method.
Section3 introduces the proposed evaluation
scheme. Section4 shows the evaluation results
obtained by the proposed method. Section 5
concludes the paper.
2 MT Evaluation Technologies
In this section, we briefly describe automatic
evaluation methods of translation. These meth-
ods were proposed to evaluate MT output, but
they are applicable to translation by humans.
All of these methods are based on the same
idea, that is, to compare the target transla-
tion for evaluation with high-quality reference
translations that are usually done by skilled
translators. Therefore, these methods require a
corpus of high-quality human reference transla-
tions. Wecall these translations as “references”.
2.1 DP-based Method
The DP score between a translation output and
references can be calculated by DP matching
(Su et al., 1992; Takezawa et al., 1999). First,
we define the DP score between sentence (i.e.,
word array) Wa and sentence Wb by the follow-
ing formula.
SDP(Wa;Wb) = T ¡S ¡I ¡DT (1)
where T is the total number of words in Wa, S is
the number of substitution words for comparing
Wa to Wb, I is the number of inserted words for
comparing Wa to Wb, and D is the number of
deleted words for comparing Wa to Wb.
Using Equation1, (Si(j)), that is, the test
sentence unit DP-score of the translation of test
sentence j done by subject i, can be calculated
by the following formula.
SDPi(j) =
max
k=1 to Nref
n
SDP(Wref(k)(j);Wsub(i)(j));0
o
(2)
where Nref is the number of references,
Wref(k)(j) is the k-th reference of the test sen-
tence j, and Wsub(i)(j) is the translation of the
test sentence j done by subject i.
Finally, SDPi, which is the test set unit DP-
score of subject i, can be calculated by the fol-
lowing formula.
SDPi = 1N
sent
NsentX
j=1
SDPi(j) (3)
where Nsent is the number of test sentences.
2.2 N-gram-based Method
Papineni et al. (2002) proposed BLEU, which is
an automatic method for evaluating MT qual-
ity using N-gram matching. The National Insti-
tute of Standards and Technology also proposed
an automatic evaluation method called NIST
(2002), which is a modified method of BLEU.
In this research we use two kinds of units to
apply BLEU and NIST. One is a test sentence
unit and the other is a test set unit. The unit of
utterance corresponds to the unit of “segment”
in the original BLEU and NIST studies (Pap-
ineni et al., 2002; NIST, 2002).
Equation4 is the test sentence unit BLEU
score formulation of the translation of test sen-
tence j done by subject i.
SBLEUi(j) =
exp
( NX
n=1
wn log(pn)¡max
ˆL⁄
ref
Lsys ¡1; 0
!)
(4)
where
pn =P
C2fCandidatesg
P
n¡gram2fCg Countclip (n¡gram)P
C2fCandidatesg
P
n¡gram2fCgCount(n¡gram)
wn = N¡1
and
L⁄ref = the number of words in the reference
translation that is closest in length to the
translation being scored
Lsys = the number of words in the transla-
tion being scored
Equation5 is the test sentence unit NIST
score formulation of the translation of test sen-
tence j done by subject i.
SNISTi(j) =
PN
n=1
‰P
all w1:::wn in sys output info(w1:::wn)P
all w1:::wn in sys output(1)
 
£exp
‰
fllog2
•
min
 
Lsys
Lref ; 1
¶‚ 
(5)
where
info(w1 :::wn) =
log2
‡the number of occurence of w
1:::wn¡1
the number of occurence of w1:::wn
·
Lref = the average number of words in a ref-
erence translation, averaged over all refer-
ence translations
Lsys = the number of words in the transla-
tion being scored
and fl is chosen to make the brevity penalty fac-
tor=0.5 when the number of words in the sys-
tem translation is 2/3 of the average number
of words in the reference translation. For Equa-
tions4 and 7, N indicates the maximum n-gram
length. In this research we set N to 4 for BLEU
and to 5 for NIST.
We may consider the unit of the test set cor-
responding to the unit of “document” or “sys-
tem” in BLEU and NIST. However, we use for-
mulations for the test set unit scores that are
different from those of the original BLEU and
NIST.
Calculate correlation between TOEIC score and 
sentence unit automatic score
References translatedby bilinguals
English writing by proficiency-known
human subjects
English sentencesby proficiency
Japanese test set
Automatic evaluation(sentence unit evaluation)
Corpus
Select test sentencesbased on correlation
Figure 1: Flow of Test Set Selection
The test set unit scores of BLEU and NIST
are calculated by Equations6 and 7.
SBLEUi = 1N
sent
NsentX
j=1
SBLEUi(j) (6)
SNISTi = 1N
sent
NsentX
j=1
SNISTi(j) (7)
3 The Proposed Method
The proposed method described in this paper
consists of two parts. One is the test set selec-
tion part and the other is the actual measure-
ment part. The measurement part is divided
into two phases: a parameter-estimation phase
andatestingphase. Here, weusetheterm“sub-
jects” to refer to the human subjects in the test
set selection part and the parameter-estimation
phase of the measurement part; we use “users”
to refer to the humans in the testing phase of
the measurement part.
Regression analysis usingproficiency and automatic
scores
References translatedby bilinguals
English writing by proficiency-known 
human subjects
English sentencesby proficiency
Japanese test set
Regression coefficient
Automatic evaluation(Test set unit evaluation)
English writing by a user
Automatic evaluation
Estimation of Englishproficiency
English sentences
Automatic score
Englishproficiency
g17148Testing phaseg17150
Corpus
g17148Parameter-estimation phaseg17150
Figure 2: Flow of English Proficiency Measurment
We employ the Test of English for Interna-
tional Communication (TOEIC, 2004) as an ob-
jective measure of English proficiency.
3.1 Test Sentence Selection Method
Figure1 shows the flow of the test sentence se-
lection. We first calculate the test sentence
unit automatic score by using Equation2, 4 or
5 for each test sentence and subject. Second,
for each test sentence, we calculate the correla-
tion between the automatic scores and subjects’
TOEIC scores. Finally, using the above results,
we choose the test sentences that give high cor-
relation.
3.2 Method of Measuring English
Proficiency
Figure2 shows the flow of measuring English
proficiency. In the parameter-estimation phase,
for each subject, we first calculate the test set
unit automatic score by using Equation3, 6 or
7. Next, we apply regression analysis using the
automatic scores and subjects’ TOEIC scores.
In the testing phase, we calculate a user’s
TOEIC score using the automatic score of the
user and the regression line calculated in the
parameter-estimation phase.
4 Experiments
4.1 Experimental Conditions
4.1.1 Test sets
For the experiments, we employ two differ-
ent test sets. One is BTEC (Basic Travel
Expression Corpus) (Takezawa et al., 2002)
and the other is SLTA1 (Takezawa, 1999).
Both BTEC and SLTA1 are parts of bilingual
corpora that have been collected for research
on speech translation systems. However, they
have different features. A detailed analysis
of these corpora was done by Kikui et al.
(2003). Here, we briefly explain these test sets.
In this study, we use the Japanese side as a
test set and the English side as a reference for
automatic evaluation.
BTEC
BTEC was designed to cover expressions for
every potential subject in travel conversation.
This test set was collected by investigating
“phrasebooks” that contain Japanese/English
sentence pairs that experts consider useful for
tourists traveling abroad. One sentence con-
tains 8 words on average. The test set for this
experiment consists of 510 sentences from the
BTEC corpus.
The total number of examinees is 18, and
the range of their TOEIC scores is between the
400s and 900s. Every hundred-point range has
3 examinees.
SLTA1
SLTA1 consists of 330 sentences in 23 conver-
sations from the ATR bilingual travel conver-
sation database (Takezawa, 1999). One sen-
tence contains 13 words on average. This corpus
was collected by simulated dialogues between
Japanese and English speakers through a pro-
fessional interpreter. The topics of the conver-
sations are mainly hotel conversations, such as
reservations, enquiries and so on.
The total number of examinees is 29, and the
range of their TOEIC score is between the 300s
and 800s. Excluding the 600s, every hundred-
point range has 5 examinees.
4.1.2 Reference
For the automatic evaluation, we collected 16
references for each test sentence. One of them
is from the English side of the test set, and the
remaining 15 were translated by 5 bilinguals (3
references by 1 bilingual).
4.2 Experimental Results
4.2.1 Experimental Results of Test Set
Selection
Figures3 and 4 show the correlation between
the test sentence unit automatic score and the
subjects’ TOEIC score. Here, the automatic
score is calculated using Equation2, 4 or 5. Fig-
ure 3 shows the results on BTEC, and Fig.4
shows the results on SLTA1. In these fig-
ures, the ordinate represents the correlation.
The filled circles indicate the results using the
DP-based automatic evaluation method. The
gray circles indicate the results using BLEU.
The empty circles indicate the results using
NIST. Looking at these figures, we find that
the three automatic evaluation methods show
a similar tendency. Comparing BTEC and
SLTA1, BTEC contains more cumbersome test
sentences. In BTEC, about 20% of the test sen-
tences give a correlation of less than 0. Mean-
while, in the SLTA1, this percentage is about
10%.
g14980g14984
g14980g14983g14981g14991
g14980g14983g14981g14989
g14980g14983g14981g14987
g14980g14983g14981g14985
g14983
g14983g14981g14985
g14983g14981g14987
g14983g14981g14989
g14983g14981g14991
g14984
g14983 g14986g14983g14989g14983g14992g14983g14984g14985g14983g14984g14988g14983g14984g14991g14983g14985g14984g14983g14985g14987g14983g14985g14990g14983g14986g14983g14983g14986g14986g14983g14986g14989g14983g14986g14992g14983g14987g14985g14983g14987g14988g14983g14987g14991g14983g14988g14984g14983g15019g15036g15050g15051g14967g15050g15036g15045g15051g15036g15045g15034g15036g14967g14975g15050g15046g15049g15051g15036g15035g14967g15033g15056g14967g15034g15046g15049g15049g15036g15043g15032g15051g15040g15046g15045g14976
g15002g15046g15049g15049
g15036g15043g15032g15051g15040g15046
g15045
g15003g15015g15001g15011g15004g15020
g15013g15008g15018g15019
Figure3: Correlationbetweentestsentenceunit
automatic scores and subjects’ TOEIC scores
(BTEC)
Table 1 shows examples of low-correlated test
sentences. As shown in the table, BTEC con-
tains more short and frequently used expres-
sions than does SLTA1. This kind of expres-
sion is thought to be too easy for testing, so
this low-correlation phenomenon is thought to
occur. SLTA1 still contains a few sentences of
this kind (“Example 1” of SLTA1 in the ta-
ble). Additionally, there is another contributing
factor explaining the low correlation in SLTA1.
Looking at “Example 2” of SLTA1 in the ta-
ble, this expression is not very easy to translate.
For this test sentence, several expressions can
be produced as an English translation. Thus,
automatic evaluation methods cannot evaluate
correctly due to the insufficient variety of ref-
erences. Considering these results, this method
can remove inadequate test sentences due not
only to the easiness of the test sentence but
also to the difficulty of the automatic evalua-
tion. Figures5 and 6 show the relationship
between the number of test sentences and cor-
relation. This correlation is calculated between
the test set unit automatic scores and the sub-
jects’ TOEIC scores. Here, the automatic score
is calculated using Equation3, 6 or 7. Figure
5 shows the results on BTEC, and Fig.6 shows
the results on SLTA1.
In these figures, the abscissa represents the
number of test sentences, i.e., Nsent in Equa-
tions 3, 6 and 7, and the ordinate represents
the correlation. Definitions of the circles are
the same as those in the previous figure. Here,
the test sentence selection is based on the cor-
relation shown in Figs. 3 and 4.
Comparing Fig. 5 to Fig. 6, in the case of
Table 1: Example of low-correlated test sentences
Japanese English
Example 1 g16909g16967g16924g16962g16941g16920g16903 Good night.
Example 2 g17052g17030g17056g17079g16981g11735g16926g16937g16914g16931g16920g16903 Can I see a menu, please?
Example 1 g16946g16903g16878g17049g17012g17018g17079g16998g17079g17028g16938g16909g13823g16903g16922g16961g16924g16879 Yes, with my Mastercard please
Example 2 g16928g16975g16938g16909g13823g16903g16922g16930g16903g16982g16938g16924g16911g16878g2637g9626g16945g12794g3623g16965g16901g16973g16961g16924g16982g16938g16939g16973g16901g16907g16925g17012g16991g17079g17027g16945g6321g12932g16981g6286g16907g16937g16914g16931g16920g16903g16879 I wish I could ta ke that but we have a limited budget sohow much will that cost?SL
TA
1
BT
EC
g14980g14984
g14980g14983g14981g14991
g14980g14983g14981g14989
g14980g14983g14981g14987
g14980g14983g14981g14985
g14983
g14983g14981g14985
g14983g14981g14987
g14983g14981g14989
g14983g14981g14991
g14984
g14983 g14986g14983 g14989g14983 g14992g14983 g14984g14985g14983g14984g14988g14983g14984g14991g14983g14985g14984g14983g14985g14987g14983g14985g14990g14983g14986g14983g14983g14986g14986g14983g15019g15036g15050g15051g14967g15050g15036g15045g15051g15036g15045g15034g15036g14967g14975g15050g15046g15049g15051g15036g15035g14967g15033g15056g14967g15034g15046g15049g15049g15036g15043g15032g15051g15040g15046g15045g14976
g15002g15046g15049g15049
g15036g15043g15032g15051g15040g15046
g15045
g15003g15015g15001g15011g15004g15020
g15013g15008g15018g15019
Figure4: Correlationbetweentestsentenceunit
automatic scores and subjects’ TOEIC scores
(SLTA1)
g14983g14981g14989
g14983g14981g14989g14988
g14983g14981g14990
g14983g14981g14990g14988
g14983g14981g14991
g14983g14981g14991g14988
g14983g14981g14992
g14983g14981g14992g14988
g14984
g14983 g14986g14983g14989g14983g14992g14983g14984g14985g14983g14984g14988g14983g14984g14991g14983g14985g14984g14983g14985g14987g14983g14985g14990g14983g14986g14983g14983g14986g14986g14983g14986g14989g14983g14986g14992g14983g14987g14985g14983g14987g14988g14983g14987g14991g14983g14988g14984g14983g15013g15052g15044g15033g15036g15049g14967g15046g15037g14967g15051g15036g15050g15051g14967g15050g15036g15045g15051g15036g15045g15034g15036g15050
g15002g15046g15049g15049
g15036g15043g15032g15051g15040g15046
g15045
g15003g15015g15001g15011g15004g15020
g15013g15008g15018g15019Figure 5: Correlation between test set unitautomatic scores and subjects’ TOEIC scores
(BTEC)
using the full test set (510 test sentences for
BTEC, 330 test sentences for SLTA1), the cor-
relation of BTEC is lower than that of SLTA1.
As we mentioned above, the ratio of the low-
correlatedtestsentencesinBTECishigherthan
that of SLTA1 (See Figs.3 and 4). This issue
is thought to cause a decrease in the correlation
shown in Fig. 5. However, by applying the se-
g14983g14981g14989
g14983g14981g14989g14988
g14983g14981g14990
g14983g14981g14990g14988
g14983g14981g14991
g14983g14981g14991g14988
g14983g14981g14992
g14983g14981g14992g14988
g14984
g14983 g14986g14983 g14989g14983 g14992g14983 g14984g14985g14983g14984g14988g14983g14984g14991g14983g14985g14984g14983g14985g14987g14983g14985g14990g14983g14986g14983g14983g14986g14986g14983g15013g15052g15044g15033g15036g15049g14967g15046g15037g14967g15051g15036g15050g15051g14967g15050g15036g15045g15051g15036g15045g15034g15036g15050
g15002g15046g15049g15049
g15036g15043g15032g15051g15040g15046
g15045
g15003g15015g15001g15011g15004g15020
g15013g15008g15018g15019Figure 6: Correlation between test set unitautomatic scores and subjects’ TOEIC scores
(SLTA1)
g14988g14983
g14984g14983g14983
g14984g14988g14983
g14985g14983g14983
g14985g14988g14983
g14986g14983g14983
g14986g14988g14983
g14983 g14986g14983g14989g14983g14992g14983g14984g14985g14983g14984g14988g14983g14984g14991g14983g14985g14984g14983g14985g14987g14983g14985g14990g14983g14986g14983g14983g14986g14986g14983g14986g14989g14983g14986g14992g14983g14987g14985g14983g14987g14988g14983g14987g14991g14983g14988g14984g14983g15013g15052g15044g15033g15036g15049g14967g15046g15037g14967g15051g15036g15050g15051g14967g15050g15036g15045g15051g15036g15045g15034g15036g15050
g15018g15051g15032g15045
g15035g15032g15049g15035
g14967g15036g15049g15049g15046
g15049
g15003g15015g15001g15011g15004g15020
g15013g15008g15018g15019
Figure 7: Standard error (BTEC)
lection based on sentence unit correlation, these
obstructive test sentences can be removed. This
permits the selection of high-correlated small-
sized test sets. In these figures, the highest cor-
relations are around 0.95.
4.2.2 Experimental Results of English
Proficiency Measurement
For the experiments on English proficiency mea-
surement, we carried out a leave-one-out cross
validation test. The leave-one-out cross valida-
g14988g14983
g14984g14983g14983
g14984g14988g14983
g14985g14983g14983
g14985g14988g14983
g14986g14983g14983
g14986g14988g14983
g14983 g14986g14983 g14989g14983 g14992g14983 g14984g14985g14983g14984g14988g14983g14984g14991g14983g14985g14984g14983g14985g14987g14983g14985g14990g14983g14986g14983g14983g14986g14986g14983g15013g15052g15044g15033g15036g15049g14967g15046g15037g14967g15051g15036g15050g15051g14967g15050g15036g15045g15051g15036g15045g15034g15036g15050
g15018g15051g15032g15045
g15035g15032g15049g15035
g14967g15036g15049g15049g15046
g15049
g15003g15015g15001g15011g15004g15020
g15013g15008g15018g15019
Figure 8: Standard error (SLTA1)
tion test is conducted not only for the measure-
ment of the English proficiency but also for the
test set selection.
To evaluate the proficiency measurement by
the proposed method, we calculate the standard
error of the results of a leave-one-out cross val-
idation test. The following formula is the defi-
nition of the standard error.
 E =
vu
ut 1
Nuser
NuserX
i=1
(Ti ¡Ai)2 (8)
where Nuser is the number of users, Ti is the
actual TOEIC score of user i, and Ai is user i’s
estimated TOEIC score by using the proposed
method.
Figures7 and 8 show the relationship between
the number of test sentences and the standard
error.
In these figures, the abscissa represents the
number of test sentences, and the ordinate rep-
resents the standard error. Definitions of the
circles are the same as in the previous figure.
Here, the test sentence selection is based on the
correlation shown in Figs. 3 and 4.
Looking at Figs. 7 and 8, we can observe dif-
ferences between the standard errors of BTEC
and SLTA1. This is thought to be due to the
difference of the number of subjects in the ex-
periments (for the leave-one-out cross valida-
tion test, 17 subjects with BTEC and 28 sub-
jects with SLTA1). Even though these were
closed experiments, the results in Figs. 5 and
6 show an even higher correlation with BTEC
than with SLTA1 at the highest point. There-
fore, there is room for improvement by increas-
ing the number of subjects with BTEC.
In the test using 30 to 60 test sentences in
Figs. 7 and 8, the standard errors are much
smaller than in the test using the full test set
(510 test sentences for BTEC, 330 test sentences
for SLTA1). These results imply that the test
set selection works very well and that it enables
precise testing using a smaller size test set.
5 Conclusion
We proposed an automatic measurement
method for English language proficiency. The
proposed method applies automatic MT evalu-
ation to measure human English language pro-
ficiency. This method focuses on measuring the
communicative skill of structuring sentences,
which is indispensable in writing and speaking.
However, it does not measure elementary capa-
bilities such as vocabulary and grammar. The
method also involves a new test sentence selec-
tion scheme to enable efficient testing.
In the experiments, we used TOEIC as an ob-
jective measure of English language proficiency.
We then applied some currently available auto-
matic evaluation methods: BLEU, NIST and a
DP-based method. We carried out experiments
on two test sets: BTEC and SLTA1. Accord-
ing to the experimental results, the proposed
method gave a good measurement result on a
small-sized test set. The standard error of mea-
surement is around 120 points on the TOEIC
score with BTEC and less than 100 TOEIC
points score with SLTA1. In both cases, the
optimum size of the test set is 30 to 60 test sen-
tences.
The proposed method still needs human
labour to make the references. To obtain higher
portability, we will apply an automatic para-
phrase scheme (Finch et al., 2002; Shimohata
and Sumita, 2002) to make the references auto-
matically.
6 Acknowledgements
The research reported here was supported in
part by a contract with the National Institute
of Information and Communications Technol-
ogy entitled ”A study of speech dialogue trans-
lation technology based on a large corpus”.

References
CASEC. 2004. Computer Assessment
System for English Communication.
http://www.ets.org/toefl/.
A. Finch, T. Watanabe, and E. Sumita. 2002.
“Paraphrasing by Statistical Machine Trans-
lation”. In Proceedings of the 1st Forum on
Information Technology (FIT2002), volume
E-53, pages 187–188.
G. Kikui, E. Sumita, T. Takezawa, and
S. Yamamoto. 2003. “Creating Corpora for
Speech-to-Speech Translation”. In Proceed-
ings of EUROSPEECH, pages 381–384.
NIST. 2002. Automatic Evaluation
of Machine Translation Quality Us-
ing N-gram Co-Occurence Statistics.
http://www.nist.gov/speech/tests/mt
/mt2001/resource/.
K. Papineni, S. Roukos, T. Ward, and W.-
J. Zhu. 2002. Bleu: a method for auto-
matic evaluation of machine translation. In
Proceedings of the 40th Annual Meeting of
the Association for Computational Linguis-
tics (ACL), pages 311–318.
M. Shimohata and E. Sumita. 2002. “Auto-
matic Paraphrasing Based on Parallel Corpus
for Normalization”. In Proceedings of Inter-
national Conference on Language Resources
and Evaluation (LREC), pages 453–457.
K.-Y. Su, M.-W. Wu, and J.-S. Chang. 1992.
A new quantitative quality measure for ma-
chine translation systems. In Proceedings of
the 14th International Conference on Com-
putational Linguistics(COLING), pages 433–
439.
T. Takezawa, F. Sugaya, A. Yokoo, and S. Ya-
mamoto. 1999. A new evaluation method for
speech translation systems and a case study
on ATR-MATRIX from Japanese to English.
InProceeding of Machine Translation Summit
(MT Summit), pages 299–307.
T. Takezawa, E. Sumita, F. Sugaya, H. Ya-
mamoto, and S. Yamamoto. 2002. “Toward a
Broad-Coverage Bilingual Corpus for Speech
Translation of Travel Conversations in the
Real World”. In Proceedings of International
Conference on Language Resources and Eval-
uation (LREC), pages 147–152.
T. Takezawa. 1999. Building a bilingual travel
conversation database for speech translation
research. In Proceedings of the 2nd Inter-
national Workshop on East-Asian Language
Resources and Evaluation – Oriental CO-
COSDA Workshop ’99 –, pages 17–20.
TOEFL. 2004. Test of English as a Foreign
Language. http://www.ets.org/toefl/.
TOEIC. 2004. Test of English
for International Communication.
http://www.ets.org/toeic/.
