Evahmtion Metrics t'oi- Knowledge-Based Machine Translation 
Eric I1. Nyberg, 3rd 
Teruko Mitamura 
Jaimc G. Carbonell 
Ccnte, r for Machine Translation 
Carnegie Mellon University 
Pittsburgh, PA 15213 
Topical Paper: machine translation 
Abstract 
A methodology is presented for coml)onent-l)ase(l 
machine translation (MT) evaluation through causal 
error analysis to complement existing global evalu- 
ation methods. This methodology is particularly :q)- 
propriate for knowledgc-I)ased machine translation 
(KBMT) systems. After a discussion o\[ M'I' eval- 
uation criteria and the particular evahlatiou metrics 
proposed for KBMT, we apply this methodology 
to a large-scale application of the KANT ,nachinc 
translation system, and present some sample results. 
1 Introduction 
Machine Translation (MT) is considered the paradigm task 
of Natural Language Processing (NLP) hy some researchers 
because it combines almost all NLP research :treas: syntactic 
parsing, semantic disambigt, ation, knowledge rel)reseutation, 
language generation, lexical acquisition, and morphological 
analysis and synthesis. However, the evaluation method- 
ologies for MT systems have heretofore centered on hlack 
box approaches, where global properties of tile system are 
evaluated, such as semantic fidelity of the translation or com- 
prehensibility of the target langt,age output. There is a long 
tradition of such black-box MT evaluations (Van Slype, 1979; 
Nagao, 1985; JEIDA, 1989; Wilks, 1991), to the point that 
Yorick Wilks has stated: "MT Evaluation is better understood 
than MT" (Carbonell&Wilks. 1991 ). While these evalt,,'ltions 
are extremely important, they should be augmented with de- 
tailed error analyses and with component cval uation s in ordcr 
to produce causal analyses l)inpointing errors and therefm'e 
leading to system improvement. In essence, we advocate both 
causal component analyses as well as gloi)al behavioral anal- 
yses, preferably when the latter is consistent with tile Iormer 
via composition of the component analyses. 
Tim advent of Knowledge Based Machine Translation 
(KBMT) facilitates component evaluation and error attribu- 
tion because of its modular nature, though this ol)servalion 
by no means excludes transfer-based systems from similar 
aualyses. After reviewing the reasons att(I criteria for MT 
evaluation, this paper describes a specific evaluation method- 
ology and its application to the KANT system, developed 
at CMU's Center for Machine Translation (Mitamura, et al. 
1991). The KANT KBMT architecture is particularly well- 
suited for detailed evaluation because of its relative simplicity 
':ompared to other KBMT systems, and because it has been 
scaled up to industrial-sized al)plications. 
2 Reasous for Evaluation 
Machine Translation is evaluated for a number of different 
reqsons, and when possihle these should be kept clear and 
separate, as diflerent types of ev,'duation are best suited to 
measure different aspects of an MT system, l.et ns review the 
reasons wily MT systems may be evaluated: 
• Com/)arison with llumans. It is useltd to establish a 
global comparison with hurmm-qu:.dity translation as a 
function of task. For general-ptnl)OSe accurate tralls- 
lation, most MT systelns have a long way to go. A 
behavioral black-box evahmtion is appropriate here. 
® Decision to use or buy a particular MT syMet~.t. This 
evahmliou is task dependent, aud nmst take both quality 
of trallslation as well as economics inR) accf)nllt (e.g. 
cost of purchase and of adapting the MT system to the 
task, vs. hum:in translator cost). Behavioral black-box 
evaluations arc appropriate here too. 
,, Comparison of multiple MT' systems. The compariso~l 
may be to evahmte research progress ;is iu the ARPA 
MT evahmtions, or to determine which system should 
be considered for Imrchase and use. If the systems em- 
l)loy radically different MT paradigms, such ;is EBMT 
and KP, MT, only 1)lack-box evahmtions are meaningful, 
but if they employ similar methods, then I)oth forms of 
evaluation tire appropriate. It can he very informative to 
determine which system has the better parser, or which is 
able to perform certain difficult (lisaml)iguatkms helter, 
atRl SO O11, wi 1\[1 ;Ill eye towards futt,re synthesis of the best 
ideas l,onl differeut systems. The Sl~CeCh-recognilion 
cmnmunily has benelited from such comparisons. 
• Trackit,g technological progress. In order to determine 
how a system evolves over time it is very useful lO know 
which components ,'ue improving and which are not, as 
well tls their contribution Io overall MT 1)erformance. 
Moreover, a phenomena-based evaluation is useful here: 
Which l)reviously problematic linguistic phenomena are 
being handled better and by having improved which 
module or knowledge source? This is exactly the kind 
of information that other MT researchers would find ex- 
tremely valu,:thle to improve their own systems - much 
more so than a relalively empty glohal statement such 
as: "KANT is doing 5% better this month." 
,, Improvement of a particular system. Ilere is where 
COlnponent an,'llysis and error attribution are most vahl- 
able. Systcul engineers and! linguistic knowledge source 
nlainiamers (such tls lexicographers) perforni hest when 
95 
given a causal analysis of each error, lleuce module- 
by-module performance metrics ,are key, as well as an 
analysis of how each potentially problematic linguistic 
phenomenon is handled by each module. 
Different communities will benefit from different evalua- 
tions. For instance, the MT user community (actual or poten- 
tial) will benefit most from global black-box evaluations, as 
their reasons are most clearly aligned with the first three items 
above. The funding community (e.g., EEC, ARPA, MITI), 
wants to improve the technological infrastructure and deter- 
mine which approaches work best. Thus, their interests are 
most clearly aligned with the third and fourth reasons above, 
and consequently with both global and component evalua- 
tions. The system developers and researchers need to know 
where to focus their efforts in order to improve system per- 
formance, and thus are most interested in the last two items: 
the causal error analysis and component evaluation both for 
their own systems and for those of their colleagues. In the 
latter case, researchers learn both from blame-assigmnent in 
error analysis of their own systems, as well as fiom successes 
of specific mechanisms tested by their colleagues, leading to 
importation and extension of specific ideas and methods that 
have worked well elsewhere. 
3 MT Evaluation Criteria 
There are three major criteria that we use to evaluate tile 
performance ofa KBMT system: Completeness, Correctness, 
and Stylistics. 
3.1 Completeness 
A system is complete if it assigns some output string to every 
input string it is given to translate. There are three types of 
completeness which must be considered: 
• Lexical Completeness. A system is lexieally complete 
if it has source and target language lexicon entries for 
every word or phrase in the translation domain. 
,, Grammatical Completeness. A system is grammatically 
complete if it can analyze of the grammatical structures 
encountered in the source language, and it can generate 
all of the grammatical structures necessary in the target 
language translation. Note that the notion of "grammat- 
ical structure" may be extended to include constructions 
like SGML tagging conventions, etc. found in technical 
documentation. 
• Mapping Rule Completeness. A system is complete with 
respect to mapping rules if it assigns an output struc- 
ture to every input structure in the translation domain, 
regardless of whether this mapping is direct or via an 
interlingua. This implies completeness of either transfer 
rules in transfer systems or tile semantic inteq)retation 
rules and structure selection rules in interlingtta systems. 
3.2 Correctness 
A system is correct if it assigns a correct output string to every 
input string it is given to translate. There are three types of 
correctness to consider: 
• Lexical Correctness. Each of the words selected in the 
target sentence is correctly chosen for the concept that it 
is intended to realize. 
• Syntactic Correctness. The grammatical structure of 
each target sentence should be completely correct (no 
grammatical errors); 
• Setnanlic Correctness. Senlanlic correctness presup- 
poses lexical correctness, but also requires that the corn- 
positional meaning of each target sentence should be 
equivalent to tile meaning of the source sentence. 
3.3 Stylistics 
A correct OUtpUt text must be ineaning invariall\[ and untler- 
standable. System evahmtion may go beyond correctness and 
test additional, interrelated stylistic factors: 
• Syntactic Style. An output sentence may contain a gram- 
matical structure which is correct, but less appropriate for 
the context than another structure which was not chosen. 
• Lexical Appropriateness. Each of the words chosen is 
not only a correct choice but tile most appropriate choice 
for the context. 
,, Usage Appropriateness. The most conventional or nat- 
ural expression should be chosen, whether technical 
nomenclature or comlnou figures of speech. 
• Oilier. l:orm'41ity, level of difficulty of the text, and othe,' 
snch parameters shotlJd be preserved in the translation or 
appropriately selected when absent from the source. 
4 I(BMT Evahmliou Criieria and Correctness 
Met,-ics 
In order to evahmte an inlerlingnal KBMT system, we define 
the following KBMT evahmtion criteria, which are based on 
the general criteria discussed in the previous section: 
• Analysis Coverage (AC). Tile percentage of test sen- 
tences for which tile analysis module produces all inter- 
lingua expression. 
• Analysis Correctness (AA). "File percentage of the inter- 
linguas produced which are complete and correct repre- 
senlatious of the meaning of tile input sentence. 
• GenerationCoverage(GC).Thepercentageofcoml)lete 
and correct iuterlingna expressions R}r which the gener- 
ation module produces a target language sentence. 
• Generation Correctness (GA). The percentage of target 
language senlences which are complete and correct re- 
alizations of the given complete and correct interlingua 
expression. 
More precise deliuitions of these Rnu quantities, as well as 
weighted ve,sions thereof, are preseuted ill Figure 11. 
Given these four basic quantities, we can define translation 
corrccmess as follows: 
• Translation Correctness (TA). This is tile percentage of 
the input sentences for which the system produces a 
complete and correct ot,tput sentence, and call be c,'ltcu- 
lated by mt,ltiplying together Analysis Coverage, Anal- 
ysis Correctness, Generatiou Coverage, and Generation 
Correctness: 
TA = ACx AA x GC x (,'A (I) 
For example, consider a test scenario where 100 sen- 
tences are given .'Is input; 90 sentences produce interliu- 
guas; 85 of tile interlinguas are correct; for 82 of these 
IAn additional quantity shown i!n Figure 1 is the fluency of the 
target hmguage generation (leA), which will not be discussed further 
in this paper. 
96 
Criterion Formula 
No. Sentences S 
No. Sent. w/It, StL 
No. Comp./Corr. IL 5'tL-CC 
Analysis Coverage A C = S's # / '? 
Analysis Accuracy il A =: ,q'l l.-.('(' /,q'# l. 
IL Error 1 Li 
Weighted AA I/VAA = I - F'>V,(S, t., )/,b'11, 
No. TL l:'roduced .q"rt. 
No. Correct TL ,b'TLC 
No. Fluent TL ,qr u,' 
Generation Coverage G'C : S~'I./.S'sL-cc~ 
Generation Accuracy GA : ,5'7't,c /S's't, 
TL Corr. Error 7'Li 
TL Fluency Error TLCI 
Weighted GA W(I/I = 1 - EWi (.b'~-,t, i)/,b',#, t, 
Generation Flnency ,S'<t'sm /,S'Tt, c 
Weighted FA I'V I"A = 1 -- )\]!'Vi(,q'7,t:< ± )l__q"s'(.c 
Figure 1: l)etinitlons and Ftlrinnlas for O<ileulating Strid 
and lgrror-Weighted Fxaluation Measures in Analysis and 
(;eneratinn Components 
interlingnas tile system produces French otutpt~t; ,'lnd 80 
of those culprit sentences fire correct. Then 
90 85 82 80 
rA ~: l-\])~x,~x~×~ (2) 
= .90 x .94 x .96 x .98 = .80 
Of course, we can easily calctlltlte TA ovcii.lll if we know 
tile number of input sentences arid the numl)er el corrk'ct 
output sentences for a given test suite, but often ntod- 
ules are tested separately and it is usclul to comhine the 
analysis and generation ligures in this way. It is also 
important to note that even if each module in tile system 
introduces only a small error, the cuutuhttive effect can 
be very substantial. 
All interlingua-based systems contain separate analysis and 
generation modules, aud therefore all can be subjected to the 
style of evalnation preseuted in this paper. Some systems, 
however, fttrthcr modularize the trausl.'ttion process. KANT, 
for example, has two SeXluential analysis modules (source text 
to syntactic f-structures; f-structures to interlingua) (Mita- 
mnra, et al., 1991). Ilence tile evahtation could be conducted 
at a finer-grained level. Of course, for transfer-based systems 
the modular decomposition is analysis, transfer and gorier- 
at;on moclules, and for example-based MT (Nagao, 1984) 
modnles are the tnatcher and the modifier. APl~ropriate met- 
ties for completeness and correctness can be detined for each 
MT paradigm hated on its modular decomposition. 
5 Preliminary Evaluation of KANT 
In order to test a partictdar application of tile KANT system, 
we identify a set of test suites which meet certain criteria: 
• Grammar Test Suite. This test suite contains senteuces 
which exemplify all of the grammatical constructions 
allowed in the controlled input text, anti is inttended to 
test whether 1he system can trauslate all of them, 
• Domain Lexicon Test Suite. This test suite ctmtai~ts texts 
which exemplify all the ways in which general domaiut 
te,ms (especially verbs) are used in different corttexts. It 
is intended to test whether the systent can translate ;ill of 
the usage variants for general domaill ISills. 
* Preselected hJput Texts. These test suites cont,'tin lexts 
from different parts of the domain (e.g., different types 
of nlanmtls for different pmducls), selecled in advance. 
These are intended to demonstrate that the system can 
transl;tte well in all parts of tile ct~stomer domain. 
,, &mdomly Selet:tcd Ilq)ttl Texts. These test suites tire 
comprised of texts that are selected randomly by the 
evaluator, and which have not been used to lest the sys- 
tem before. These ztre inteuded to illustrate how well the 
system will do on text it has not seeu before, which gives 
the l)esl cnmpleteness-in-context measure. 
The first three types of test suite fire employed for regres- 
sion testing as the system evolves, whereas tile latter type is 
~generated anew for each major evaluation, l)uring develop- 
ment, each successive version of the system is tested on the 
available test data to prodt ce ~ gg egate lil?ures for AC, AA, 
(;(2, and (CA. 
5.1 Cnverage "lk'stlng 
The coverage rcsults (AC aucl GC) are ealct,lated atltomat- 
;tally by a program which cotmts output structt,res during 
analysis and generation. During evaluatiou, the translation 
system is split into two halves: SotLrce-to-lnterlingua anti 
Interliulgua-to-'lhrget. l:or ,I j;ivt;u text, this allows us to ,'ltllo- 
matically count how many sellteuces l)rOduccd inlerlingttas, 
thus deriving AC. This also allows t,s to automatically count 
how ilia.lily iuterlingtias prodtlce(I otttput sentences, thtzs tie.- 
rivitlg (;C. 
5.2 Correctness Testinp, 
The correctness results (AA anti (;A) are calcuhtted l'of ,'l given 
text by a process of hunlan evaluation. Tiffs requires tile effort 
of a humau evah~ator who is skilled in lhe source language, 
target lauguage> ,'ttld translation domain. We have developed 
a method for calculating the correctness of the OUtl)Ut which 
involves tile following steps: 
1. The text to be evaluated is translated, and the input and 
outi)ut Senlences are aligned ill a sop:irate lilt for evalu- 
atiolt. 
2. A scoring program presenls each translation to the oval- 
uator, l{ach transl,<ltimt is assigned a score frorfl tile 
following sot of l)ossihilities: 
* C (Ct/rrt!cI). The OUtllul sentence is COml)letely 
correct; it preserves the liieailiug of llie iUl)tlt seri- 
tenco conipletcly, is understandal)le without difli- 
eillty, a~itl does liot violtlte any rules of gran/m;ir. 
• 1 (Incorrect). The ¢/tllpUt seutencc is inconipletc (or 
einpty), or not easily undcrsi;iudable. 
• A (Accq/table). The sentence is complete ,'utd eas- 
ily ullclerslaltdablo, I)tlt is IlOt COmliletoly gramm,'lt- 
ical or violates some ~q(iMl. lagging convention. 
3. The score lor the whole text is calculated by tallying the 
different scores. TIle overall correctlleSS of the trans- 
latioli is staled in terms of a range between the strictly 
correct (C) aud the acceptahle (C + A) (cf. Figure 2) 2. 
2111 tile gerieral case, one I y ssigll a specific em)r coeflicient 
to each citer type, and multiply that coeflicient I)y lhe nunlber of 
selltel/ces exhibiting the error. The StilnlllatiOll of these products 
across all the erroiful sellLences is then used to lm~duce a we;pilled 
error rate. Tilts level of detail llas not yet proven lo be necessary in 
current KANTewiluatioi~..qee Figure 1 I~r exainplesoflorlnulas 
weighted by elror. 
97 
5.3 Causal Component Analysis 
The scoring program used to present translations for eval- 
uation also displays intermediate data structures (syntactic 
parse, interlingua, etc.) if the evahmtor wishes to perform 
component analysis in tandem with correctness evaluation. 
ht this case, the evaluator may assign different machine- 
readable error codes to each sentence, indicating the It)cation 
of the error and its type, along with any comments that are 
appropriate. The machine-readable error codes allow all of 
the scored output to be sorted and forwarded to maintainers of 
different modules, while the unrestricted comntents capture 
more detailed information. 
For example, in figure 2, Sentence 2 is marked with the 
error codes ( :NAP : SEX), indicating that tile error is the 
selection of an incorrect target lexeme (ouvrez), occurring in 
the q,uget Language Mapper 3. It is interesting to note that 
our evaluation method will assign a correctness score of 0% 
(strictly correct) 25% (acceptable) to this small text, since 
no sentences are marked with "C" and only one sentences is 
markexl with "A". However, if we use the metric of"counting 
the percentage of words translated correctly" this text would 
score much higher (37/44, or 84%). A sample set of error 
codes used for KANT evahmtion is shown in Figure 3. 
1. "Do not heat above the following temaperature:" 
"Ne rdchauffez pas la tempdrature st, ivante au-dessus:" 
Score: I ; Error: :GEN :ORD 
2. "Cut the bolt to a length of 203.2 ,'am." 
"Ouvrez le boulon fi une longueur de 203,2 nam." 
Score: 1 ; Error: :MAP :LEX 
3. "Typical location of the 3F0025 Bolts, which must be 
used on the 826C Compactors:" 
"Position typique des boulons 319025 sur les 
compacteurs:" 
Score: I ; Error: :INT :IR; :MAP :SNM 
4. "Use spacers (2) evenly on both sides to eliminate 
side movement of the frame assembly." 
"Employez les entretoises (2) sur les deux c6tds 
pour 61iminer jeu lat6ral de I'ensemble tie bSti 
uniform6ment." 
Score: A ; Error: :MAP :ORD 
Figure 2: Sample Excerpt from Scoring Sheet 
5,4 Current Results 
The process described above is performed for each of the test 
suites used to evaluate the system. Then, an aggregate table is 
produced which derives AC, AA, GC, and GA for the system 
over all the test suites. 
At the time of this writing, we arc in the process or com- 
pleting a large-scale English-to-French application of KANT 
in the domain of heavy equipment documentation. We have 
. used the process detailed in this section to evaluate tile system 
on a bi-wcckly basis during developmcnt, using a randomly- 
selected sct of texts each time. An example containing ,qggre- 
gate results for a set of 17 randomly-selected texts is shown 
in Figure 4. 
In the strict case, a correct sentence rcccivcs a vahle of l 
and a scntence containing any error receives a value of zero. 
3For brevity, the sample excerpt dots not show the intermediate 
data structures that the evaluator would have exalnirled to make this 
decision. 
Modtde Code Colnment 
:PAR :Lt-X Source lexicon, word missipg/incorrect 
:GRA Ungrammatical sentence accel)ted, 
Grammatical sentence not accepted 
:INT :SNI F-structure slot ,tot interpreted 
:FNI F-structure feature not interpreted 
:IR Incorrect inted ingua representation 
--MAP :LEX Target lexicon, word missing/incorrect 
:SNM semantic role not ,napped 
:FNM semantic feature not maapped 
--GEN :GRA Ungrann`aatical sentence produced 
:ORD Incorrect constituent ordering 
:PAR Syntactic Parser 
: INT Semantic Interpreter 
:MAP "l,trget Language Mapper 
:GEN Target Language Generator 
Fig ure 3: Saml)le Errm" Codes Used in KANT levahtati(m 
I NAME S .5",'t. /,"r ;.c' GA TA J Result 1 608 5,16 ,167-491 86-90% 7%81% 
Result 2 608 546 467-519.46 86-95% 77-85% 
Figure 4: KANT Ev'4hiation Results, 17 R.'mdnndy- 
Selected Texts, 4/21/94 
In tile weighted case, a sentence containing an error receives 
a partial score which is equal to the percentage of correctly- 
translated words. When the weighted method is used, the 
percentages are considerably higher. For both Result 1 and 
Result 2, the nt, maber of correct target language sentences 
(given as .5"vrc) is shown as ranging between comapletely 
correct (C) and acceptable (C + A). 
We are still working to improve both coverage and accaracy 
of the heavy-equipment KANT application. These numbers 
should ,tot be taken as the upper bound for KANT accuracy, 
since we are still in tile l)roccss of i,nproving the system. 
Nevertheless, our ongoing evahmtion results are useful, both 
to illustrate the evaluation methodology and also to focus the 
effort of the system dcvelol)ers in increasing accur:lcy. 
6 Discussion 
Our ongoing evalt, atitm of the lirst large-scale KANT applica- 
tion Ires benefitted from the detailed error analysis presented 
here. Following tile tabulation of error codes l)rOduced dur- 
ing catlsal comp(mcnt analysis, we can attril)ute the ntajority 
of the completeness problems to identiliable gaps in lexieal 
coverage, :rod the majority of the accuracy prol)lefns to areas 
of the domain ntodel which are known Io be incolnplctc or 
insufiiciently general. On the other hand, the grammars of 
both source and target language, as well as tile software mod- 
ules, are relatively solid, as very few errors can be attributed 
thereto. As lexieal coverage and domain model generaliza- 
tion reach completion, the component and global ewlh,ation 
of the KANT system will t)ecome a more accurate rellection 
of the potential of the nnde,lying technology in large-scale 
apl) lications. 
As illustr,'tted in Figm-e 5, traditional transfer-based MT 
systems start with general coverage, and gradt, ally seek to 
improve accuracy and later fluency. In contrast, the KBMT 
philosophy has been to start with high accuracy and gradu- 
ally improve coverage and Iluen~ay. ht tile KANT systema, 
we combine both approaches by starting with coverage of a 
large specific dontain :rod achieving high accuracy and Iluency 
98 
~ 100% l.'hmn{:y 
\] 0()% 
C{}v{2 r ,:~{IO 
100% 
ACCH t'ilC/ ~/ 
KBMT Traditional MT ,,,.,o.ooo ,oo .............. o,H,, 
Start: High Accuracy Start: lligh Covarago 
Xmprove: Coverage, Improvo: Accuracy 
Flusncy Fluency 
Figure 5: Lnngltudln'.d lmprovemewl in Coverage, Accu- 
racy and lque.cy 
within that domain. 
The evaluation methodol{}gy devtloped here is ,no:mr t{} I)e 
ustd in conjunction with glnbal black-box evaluation meth- 
ods, indtl}endtnt of the course of develol}ment. The coml}o- 
ntnt evaluations arc meant to provide insight for the sysltm 
devtlopers, avid to identify prol)ltmatic phenomena prior to 
system coml}letion an{l dtlivefy. In particular, the method 
l}resented here c'm combine coml}onent evalttation and !gl{}l}a\] 
evaluation to support efficient system testing and nlaintenance 
beyond development. 
7 Aclmowledgements 
We woul{I like to thank Radha Rao, To{ld Kaufnlann, and 
all of our colleaguts on tile KANT project, includirig Jantts 
Altncher, Kathy Baktr, Alex Franz, Mildred Gahtrza, Sut 
llohn, Kathi lannamico, Pare Jordan, Kevin Keck, Marion 
Kee, Sarah Law, John Leavitt, Daniela l.ons{lale, Deryle 
Lonsdale, Jeanne Mier, Ve.nkatesh Narayan, Amalio Nieto, 
and Will Walker. We would also like to th:mk our Sl}{msors at 
Caterpillar, Inc. and our colltagues at Carnegie GrOUl}, luc. 
References 
\[11 Carbonell, J., Mitamura, T., and E. Nyberl; (1993). 
"Evahmting KBMT in tht I,arge," Japan-US Workshop 
on Machine-Aided Translaliov, Nov. 22-24, Washing- 
ton, D.C. 
{2\] Carbonell, J. and Y. Wilks (1991). "Machint Transhl- 
tion: An In-Depth Tntorial," 29th Annual Meeting of 
the Association for Compntational Linguistics, Univero 
sity of CaliR)rnia, Be,'keley, CA, June 18-21, 
\[371 Ooo{Imml and Nirtnburg, eds. (1991). A Case Study 
in Knowledge-Based Machine Translation, San Mateo, 
CA: Morgan Kaufmann. 
\[4\] Isalmra, Sin-nou, Yamabana, Moriguchi and Nonmra, 
(1993). "JEIDA's l'roposed Method for l'valuating Ma- 
chine "\['ranslalion (Translation Quality)," l',oceedmgs o/ 
SIGNLP 93-NL-96, July. 
151 J\[tp{ll/ Dlcctlonic h~dustry 1)evclolmlent Association, A 
Japrmese View of Machine "l'ran.','httion tin l,ight of tile 
Considerations and Recommendations Reported by AL . 
PAC, U.S.A., JEIDA Machine Translation System P,t- 
search Commitlec, Tokyo. 
\[6\] King, M. (1993). "Panel on Evaluation: MT Summit 
IV. Introduction." l'roceedit~gs of MT Summit IV, July 
20-22, Kobe, Japan. 
17\] Mitamura, "i,., E. Nyberg "and J. Cmbonell (1991). 
"An Efficient lntcrlinl.,,ua Translation System for Multi- 
l ivlglml Docunlent Production," Proceedings o/'Machine 
7?anslatim) Summit III, Washinglon, DC, July 2-d. 
\[81 Nagao, M. (1984) . "A Ftamework of a Mechanical 
Transl:ltion l}ctween Japanese and Enp, lish I)y Analogy 
Princil}lC," Artificial and Iluman Intelligence, Elithorn, 
A. and Bauerii, P,. (eds.), Elsevier Science Publishers, 
B. V. 1984. 
\[91 Nagao, M. (1985) . "Evaluation of tilt Quality of 
Machint-Transhtttd Sentences and the Control of Lan- 
guage," .lournal of h!formatioa Proeessi,\]g Society of 
.lapan, 26(1 (}): 1197-1202. 
\[l{}l Nakaiwa, Morimoto, Matsndaira, Na,'ita anti Nomura, 
(1993). "J F.I DA's Pml}osed Metho{I for Evahmtinl; Ma- 
chine Traushltion (1)evClOl}tr's Guideliiles)," I'roceed- 
ings o/ SIGNI.I' 9.}'-NL-96, J lily. 
\[111 Nomura, If. (1093). "l.~v.'lhmtion Method of Ma{:hitm 
'\['r\[llIsltttiOll: t't'Olll tile Viewl}oit,t of Natural hanguape 
Processing," I'roceedings of MT Summit IV, Jtlly 20-22, 
Kobt, J al}an. 
\[12\] Nyberg, 12. and T. Mitamura (1992). "Tim KANT Sys-- 
tom: I:tlS\[, Accurtlte, lligh-Quality TratlsIatioll in Ptac- 
Ileal Domains," Proceedings of COEING 1992, Nante.,;, 
France, July. 
\[131 Rinscht, Adriane (1993). "Towards a MT Evaluation 
Methodolop, y," I'roceedings of the t,',fth International 
Con/?rem:e o,,* Theoretical and Methodological lssttes 
iH Machine 7)'anslatiott, July 14-16, Kyolo, J.2tl}tlll. 
(ld\] Rolling, I... (1993). "P:mel Contribution on MT Evahm- 
tion," I'roceedMgs o/MT Summil IV, July 20-22, Kobt, 
Japan. 
\[15\] Takzly,:una, ltoh, Yagisawa, Mogi and Nomura (1993), 
"JHDA's Proposed Method for l:.vahmtiug Machine 
Transhltio, (End User System Selection)," l'r~)ceedings 
of SIGNI,I' 93-NL-96, July. 
\[16J VanSlype, G. (1979). "l:.valuation of the 1978 Version 
of tile SYSTRAN F.uglish-French Automatic System or 
the Coml\]tissioll of the I!tu{}l}tan COmmuvlitics," 7'he 
Incorporated Linguist 18:86-89. 
117 \] Vasconcellos, M. (1993). "P:mtl Discussion: Eval u,'ltion 
Method of Machine Translati{m," Proceedings of MT 
Summit IV, July 2{}-22, Kobe, Japan. 
\[18l Wilks, Y. (1991). "SYSTP, AN: It Obviously Works, but 
Ilow Much Can it be hnprovtd?," Teclnfical Rtt}ott 
MCCS-.91-215, (2}reputing Research 1 ,aborat{n'y, New 
Mexico Statt University, l.as Cruces. 
99 
