File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-1013_metho.xml

Size: 20,231 bytes

Last Modified: 2025-10-06 14:13:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-1013">
  <Title>Evahmtion Metrics t'oi- Knowledge-Based Machine Translation</Title>
  <Section position="3" start_page="0" end_page="95" type="metho">
    <SectionTitle>
2 Reasous for Evaluation
</SectionTitle>
    <Paragraph position="0"> Machine Translation is evaluated for a number of different reqsons, and when possihle these should be kept clear and separate, as diflerent types of ev,'duation are best suited to measure different aspects of an MT system, l.et ns review the reasons wily MT systems may be evaluated: * Com/)arison with llumans. It is useltd to establish a global comparison with hurmm-qu:.dity translation as a function of task. For general-ptnl)OSe accurate trallslation, most MT systelns have a long way to go. A behavioral black-box evahmtion is appropriate here.</Paragraph>
    <Paragraph position="1"> (r) Decision to use or buy a particular MT syMet~.t. This evahmliou is task dependent, aud nmst take both quality of trallslation as well as economics inR) accf)nllt (e.g.</Paragraph>
    <Paragraph position="2"> cost of purchase and of adapting the MT system to the task, vs. hum:in translator cost). Behavioral black-box evaluations arc appropriate here too.</Paragraph>
    <Paragraph position="3"> ,, Comparison of multiple MT' systems. The compariso~l may be to evahmte research progress ;is iu the ARPA MT evahmtions, or to determine which system should be considered for Imrchase and use. If the systems eml)loy radically different MT paradigms, such ;is EBMT and KP, MT, only 1)lack-box evahmtions are meaningful, but if they employ similar methods, then I)oth forms of evaluation tire appropriate. It can he very informative to determine which system has the better parser, or which is able to perform certain difficult (lisaml)iguatkms helter, atRl SO O11, wi 1\[1 ;Ill eye towards futt,re synthesis of the best ideas l,onl differeut systems. The Sl~CeCh-recognilion cmnmunily has benelited from such comparisons.</Paragraph>
    <Paragraph position="4"> * Trackit,g technological progress. In order to determine how a system evolves over time it is very useful lO know which components ,'ue improving and which are not, as well tls their contribution Io overall MT 1)erformance.</Paragraph>
    <Paragraph position="5"> Moreover, a phenomena-based evaluation is useful here: Which l)reviously problematic linguistic phenomena are being handled better and by having improved which module or knowledge source? This is exactly the kind of information that other MT researchers would find extremely valu,:thle to improve their own systems - much more so than a relalively empty glohal statement such as: &amp;quot;KANT is doing 5% better this month.&amp;quot; ,, Improvement of a particular system. Ilere is where COlnponent an,'llysis and error attribution are most vahlable. Systcul engineers and! linguistic knowledge source nlainiamers (such tls lexicographers) perforni hest when  given a causal analysis of each error, lleuce moduleby-module performance metrics ,are key, as well as an analysis of how each potentially problematic linguistic phenomenon is handled by each module.</Paragraph>
    <Paragraph position="6"> Different communities will benefit from different evaluations. For instance, the MT user community (actual or potential) will benefit most from global black-box evaluations, as their reasons are most clearly aligned with the first three items above. The funding community (e.g., EEC, ARPA, MITI), wants to improve the technological infrastructure and determine which approaches work best. Thus, their interests are most clearly aligned with the third and fourth reasons above, and consequently with both global and component evaluations. The system developers and researchers need to know where to focus their efforts in order to improve system performance, and thus are most interested in the last two items: the causal error analysis and component evaluation both for their own systems and for those of their colleagues. In the latter case, researchers learn both from blame-assigmnent in error analysis of their own systems, as well as fiom successes of specific mechanisms tested by their colleagues, leading to importation and extension of specific ideas and methods that have worked well elsewhere.</Paragraph>
  </Section>
  <Section position="4" start_page="95" end_page="95" type="metho">
    <SectionTitle>
3 MT Evaluation Criteria
</SectionTitle>
    <Paragraph position="0"> There are three major criteria that we use to evaluate tile performance ofa KBMT system: Completeness, Correctness, and Stylistics.</Paragraph>
    <Section position="1" start_page="95" end_page="95" type="sub_section">
      <SectionTitle>
3.1 Completeness
</SectionTitle>
      <Paragraph position="0"> A system is complete if it assigns some output string to every input string it is given to translate. There are three types of completeness which must be considered: * Lexical Completeness. A system is lexieally complete if it has source and target language lexicon entries for every word or phrase in the translation domain.</Paragraph>
      <Paragraph position="1"> ,, Grammatical Completeness. A system is grammatically complete if it can analyze of the grammatical structures encountered in the source language, and it can generate all of the grammatical structures necessary in the target language translation. Note that the notion of &amp;quot;grammatical structure&amp;quot; may be extended to include constructions like SGML tagging conventions, etc. found in technical documentation.</Paragraph>
      <Paragraph position="2"> * Mapping Rule Completeness. A system is complete with respect to mapping rules if it assigns an output structure to every input structure in the translation domain, regardless of whether this mapping is direct or via an interlingua. This implies completeness of either transfer rules in transfer systems or tile semantic inteq)retation rules and structure selection rules in interlingtta systems.</Paragraph>
    </Section>
    <Section position="2" start_page="95" end_page="95" type="sub_section">
      <SectionTitle>
3.2 Correctness
</SectionTitle>
      <Paragraph position="0"> A system is correct if it assigns a correct output string to every input string it is given to translate. There are three types of correctness to consider:  * Lexical Correctness. Each of the words selected in the target sentence is correctly chosen for the concept that it is intended to realize.</Paragraph>
      <Paragraph position="1"> * Syntactic Correctness. The grammatical structure of each target sentence should be completely correct (no grammatical errors); * Setnanlic Correctness. Senlanlic correctness presup null poses lexical correctness, but also requires that the cornpositional meaning of each target sentence should be equivalent to tile meaning of the source sentence.</Paragraph>
    </Section>
    <Section position="3" start_page="95" end_page="95" type="sub_section">
      <SectionTitle>
3.3 Stylistics
</SectionTitle>
      <Paragraph position="0"> A correct OUtpUt text must be ineaning invariall\[ and untlerstandable. System evahmtion may go beyond correctness and test additional, interrelated stylistic factors: * Syntactic Style. An output sentence may contain a grammatical structure which is correct, but less appropriate for the context than another structure which was not chosen.</Paragraph>
      <Paragraph position="1"> * Lexical Appropriateness. Each of the words chosen is not only a correct choice but tile most appropriate choice for the context.</Paragraph>
      <Paragraph position="2"> ,, Usage Appropriateness. The most conventional or natural expression should be chosen, whether technical nomenclature or comlnou figures of speech.</Paragraph>
      <Paragraph position="3"> * Oilier. l:orm'41ity, level of difficulty of the text, and othe,' snch parameters shotlJd be preserved in the translation or appropriately selected when absent from the source.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="95" end_page="97" type="metho">
    <SectionTitle>
4 I(BMT Evahmliou Criieria and Correctness
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="95" end_page="96" type="sub_section">
      <SectionTitle>
Met,-ics
</SectionTitle>
      <Paragraph position="0"> In order to evahmte an inlerlingnal KBMT system, we define the following KBMT evahmtion criteria, which are based on the general criteria discussed in the previous section: * Analysis Coverage (AC). Tile percentage of test sentences for which tile analysis module produces all inter-lingua expression.</Paragraph>
      <Paragraph position="1"> * Analysis Correctness (AA). &amp;quot;File percentage of the interlinguas produced which are complete and correct represenlatious of the meaning of tile input sentence.</Paragraph>
      <Paragraph position="2"> * GenerationCoverage(GC).Thepercentageofcoml)lete and correct iuterlingna expressions R}r which the generation module produces a target language sentence.</Paragraph>
      <Paragraph position="3"> * Generation Correctness (GA). The percentage of target language senlences which are complete and correct realizations of the given complete and correct interlingua expression.</Paragraph>
      <Paragraph position="4"> More precise deliuitions of these Rnu quantities, as well as weighted ve,sions thereof, are preseuted ill Figure 11.</Paragraph>
      <Paragraph position="5"> Given these four basic quantities, we can define translation corrccmess as follows: * Translation Correctness (TA). This is tile percentage of the input sentences for which the system produces a complete and correct ot,tput sentence, and call be c,'ltculated by mt,ltiplying together Analysis Coverage, Analysis Correctness, Generatiou Coverage, and Generation Correctness: TA = ACx AA x GC x (,'A (I) For example, consider a test scenario where 100 sentences are given .'Is input; 90 sentences produce interliuguas; 85 of tile interlinguas are correct; for 82 of these IAn additional quantity shown i!n Figure 1 is the fluency of the target hmguage generation (leA), which will not be discussed further in this paper.</Paragraph>
      <Paragraph position="6">  interlingnas tile system produces French otutpt~t; ,'lnd 80 of those culprit sentences fire correct. Then</Paragraph>
      <Paragraph position="8"> Of course, we can easily calctlltlte TA ovcii.lll if we know tile number of input sentences arid the numl)er el corrk'ct output sentences for a given test suite, but often ntodules are tested separately and it is usclul to comhine the analysis and generation ligures in this way. It is also important to note that even if each module in tile system introduces only a small error, the cuutuhttive effect can be very substantial.</Paragraph>
      <Paragraph position="9"> All interlingua-based systems contain separate analysis and generation modules, aud therefore all can be subjected to the style of evalnation preseuted in this paper. Some systems, however, fttrthcr modularize the trausl.'ttion process. KANT, for example, has two SeXluential analysis modules (source text to syntactic f-structures; f-structures to interlingua) (Mitamnra, et al., 1991). Ilence tile evahtation could be conducted at a finer-grained level. Of course, for transfer-based systems the modular decomposition is analysis, transfer and gorierat;on moclules, and for example-based MT (Nagao, 1984) modnles are the tnatcher and the modifier. APl~ropriate metties for completeness and correctness can be detined for each  MT paradigm hated on its modular decomposition.</Paragraph>
      <Paragraph position="10"> 5 Preliminary Evaluation of KANT In order to test a partictdar application of tile KANT system, we identify a set of test suites which meet certain criteria: * Grammar Test Suite. This test suite contains senteuces which exemplify all of the grammatical constructions allowed in the controlled input text, anti is inttended to test whether 1he system can trauslate all of them, * Domain Lexicon Test Suite. This test suite ctmtai~ts texts which exemplify all the ways in which general domaiut te,ms (especially verbs) are used in different corttexts. It is intended to test whether the systent can translate ;ill of the usage variants for general domaill ISills.</Paragraph>
      <Paragraph position="11"> * Preselected hJput Texts. These test suites cont,'tin lexts  from different parts of the domain (e.g., different types of nlanmtls for different pmducls), selecled in advance.</Paragraph>
      <Paragraph position="12"> These are intended to demonstrate that the system can transl;tte well in all parts of tile ct~stomer domain.</Paragraph>
      <Paragraph position="13"> ,, &amp;mdomly Selet:tcd Ilq)ttl Texts. These test suites tire comprised of texts that are selected randomly by the evaluator, and which have not been used to lest the system before. These ztre inteuded to illustrate how well the system will do on text it has not seeu before, which gives the l)esl cnmpleteness-in-context measure.</Paragraph>
      <Paragraph position="14"> The first three types of test suite fire employed for regression testing as the system evolves, whereas tile latter type is ~generated anew for each major evaluation, l)uring development, each successive version of the system is tested on the available test data to prodt ce ~ gg egate lil?ures for AC, AA, (;(2, and (CA.</Paragraph>
    </Section>
    <Section position="2" start_page="96" end_page="96" type="sub_section">
      <SectionTitle>
5.1 Cnverage &amp;quot;lk'stlng
</SectionTitle>
      <Paragraph position="0"> The coverage rcsults (AC aucl GC) are ealct,lated atltomat;tally by a program which cotmts output structt,res during analysis and generation. During evaluatiou, the translation system is split into two halves: SotLrce-to-lnterlingua anti Interliulgua-to-'lhrget. l:or ,I j;ivt;u text, this allows us to ,'ltllomatically count how many sellteuces l)rOduccd inlerlingttas, thus deriving AC. This also allows t,s to automatically count how ilia.lily iuterlingtias prodtlce(I otttput sentences, thtzs tie.rivitlg (;C.</Paragraph>
    </Section>
    <Section position="3" start_page="96" end_page="97" type="sub_section">
      <SectionTitle>
5.2 Correctness Testinp,
</SectionTitle>
      <Paragraph position="0"> The correctness results (AA anti (;A) are calcuhtted l'of ,'l given text by a process of hunlan evaluation. Tiffs requires tile effort of a humau evah~ator who is skilled in lhe source language, target lauguage&gt; ,'ttld translation domain. We have developed a method for calculating the correctness of the OUtl)Ut which involves tile following steps:  1. The text to be evaluated is translated, and the input and outi)ut Senlences are aligned ill a sop:irate lilt for evaluatiolt. null 2. A scoring program presenls each translation to the oval null uator, l{ach transl,&lt;ltimt is assigned a score frorfl tile following sot of l)ossihilities: * C (Ct/rrt!cI). The OUtllul sentence is COml)letely correct; it preserves the liieailiug of llie iUl)tlt seritenco conipletcly, is understandal)le without diflieillty, a~itl does liot violtlte any rules of gran/m;ir. * 1 (Incorrect). The C//tllpUt seutencc is inconipletc (or einpty), or not easily undcrsi;iudable.</Paragraph>
      <Paragraph position="1">  * A (Accq/table). The sentence is complete ,'utd easily ullclerslaltdablo, I)tlt is IlOt COmliletoly gramm,'ltical or violates some ~q(iMl. lagging convention. 3. The score lor the whole text is calculated by tallying the  different scores. TIle overall correctlleSS of the translatioli is staled in terms of a range between the strictly correct (C) aud the acceptahle (C + A) (cf. Figure 2) 2. 2111 tile gerieral case, one I y ssigll a specific em)r coeflicient to each citer type, and multiply that coeflicient I)y lhe nunlber of selltel/ces exhibiting the error. The StilnlllatiOll of these products across all the erroiful sellLences is then used to lm~duce a we;pilled error rate. Tilts level of detail llas not yet proven lo be necessary in current KANTewiluatioi~..qee Figure 1 I~r exainplesoflorlnulas weighted by elror.</Paragraph>
    </Section>
    <Section position="4" start_page="97" end_page="97" type="sub_section">
      <SectionTitle>
5.3 Causal Component Analysis
</SectionTitle>
      <Paragraph position="0"> The scoring program used to present translations for evaluation also displays intermediate data structures (syntactic parse, interlingua, etc.) if the evahmtor wishes to perform component analysis in tandem with correctness evaluation.</Paragraph>
      <Paragraph position="1"> ht this case, the evaluator may assign different machine-readable error codes to each sentence, indicating the It)cation of the error and its type, along with any comments that are appropriate. The machine-readable error codes allow all of the scored output to be sorted and forwarded to maintainers of different modules, while the unrestricted comntents capture more detailed information.</Paragraph>
      <Paragraph position="2"> For example, in figure 2, Sentence 2 is marked with the error codes ( :NAP : SEX), indicating that tile error is the selection of an incorrect target lexeme (ouvrez), occurring in the q,uget Language Mapper 3. It is interesting to note that our evaluation method will assign a correctness score of 0% (strictly correct) 25% (acceptable) to this small text, since no sentences are marked with &amp;quot;C&amp;quot; and only one sentences is markexl with &amp;quot;A&amp;quot;. However, if we use the metric of&amp;quot;counting the percentage of words translated correctly&amp;quot; this text would score much higher (37/44, or 84%). A sample set of error codes used for KANT evahmtion is shown in Figure 3.</Paragraph>
      <Paragraph position="3">  1. &amp;quot;Do not heat above the following temaperature:&amp;quot; &amp;quot;Ne rdchauffez pas la tempdrature st, ivante au-dessus:&amp;quot; Score: I ; Error: :GEN :ORD 2. &amp;quot;Cut the bolt to a length of 203.2 ,'am.&amp;quot; &amp;quot;Ouvrez le boulon fi une longueur de 203,2 nam.&amp;quot; Score: 1 ; Error: :MAP :LEX 3. &amp;quot;Typical location of the 3F0025 Bolts, which must be used on the 826C Compactors:&amp;quot; &amp;quot;Position typique des boulons 319025 sur les compacteurs:&amp;quot; Score: I ; Error: :INT :IR; :MAP :SNM 4. &amp;quot;Use spacers (2) evenly on both sides to eliminate side movement of the frame assembly.&amp;quot;</Paragraph>
    </Section>
    <Section position="5" start_page="97" end_page="97" type="sub_section">
      <SectionTitle>
5,4 Current Results
</SectionTitle>
      <Paragraph position="0"> The process described above is performed for each of the test suites used to evaluate the system. Then, an aggregate table is produced which derives AC, AA, GC, and GA for the system over all the test suites.</Paragraph>
      <Paragraph position="1"> At the time of this writing, we arc in the process or completing a large-scale English-to-French application of KANT in the domain of heavy equipment documentation. We have . used the process detailed in this section to evaluate tile system on a bi-wcckly basis during developmcnt, using a randomly-selected sct of texts each time. An example containing ,qggregate results for a set of 17 randomly-selected texts is shown in Figure 4.</Paragraph>
      <Paragraph position="2"> In the strict case, a correct sentence rcccivcs a vahle of l and a scntence containing any error receives a value of zero.</Paragraph>
      <Paragraph position="3">  In tile weighted case, a sentence containing an error receives a partial score which is equal to the percentage of correctlytranslated words. When the weighted method is used, the percentages are considerably higher. For both Result 1 and Result 2, the nt, maber of correct target language sentences (given as .5&amp;quot;vrc) is shown as ranging between comapletely correct (C) and acceptable (C + A).</Paragraph>
      <Paragraph position="4"> We are still working to improve both coverage and accaracy of the heavy-equipment KANT application. These numbers should ,tot be taken as the upper bound for KANT accuracy, since we are still in tile l)roccss of i,nproving the system. Nevertheless, our ongoing evahmtion results are useful, both to illustrate the evaluation methodology and also to focus the effort of the system dcvelol)ers in increasing accur:lcy.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML