File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1016_intro.xml

Size: 8,056 bytes

Last Modified: 2025-10-06 14:02:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1016">
  <Title>Extending MT evaluation tools with translation complexity metrics</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Automated evaluation tools for MT systems aim at producing scores that are consistent with the results of human assessment of translation quality parameters, such as adequacy and fluency.</Paragraph>
    <Paragraph position="1"> Automated metrics such as BLEU (Papineni et al., 2002), RED (Akiba et al, 2001), Weighted N-gram model (WNM) (Babych, 2004), syntactic relation / semantic vector model (Rajman and Hartley, 2001) have been shown to correlate closely with scoring or ranking by different human evaluation parameters. Automated evaluation is much quicker and cheaper than human evaluation.</Paragraph>
    <Paragraph position="2"> Another advantage of the scores produced by automated MT evaluation tools is that intuitive human scores depend on the exact formulation of an evaluation task, on the granularity of the measuring scale and on the relative quality of the presented translation variants: human judges may adjust their evaluation scale in order to discriminate between slightly better and slightly worse variants - but only those variants which are present in the evaluation set. For example, absolute figures for a human evaluation of a set which includes MT output only are not directly comparable with the figures for another evaluation which might include MT plus a non-native human translation, or several human translations of different quality. Because of the instability of this intuitive scale, human evaluation figures should be treated as relative rather than absolute. They capture only a local picture within an evaluated set, but not the quality of the presented texts in a larger context. Although automated evaluation scores are always calibrated with respect to human evaluation results, only the relative performance of MT systems within one particular evaluation exercise provide meaningful information for such calibration.</Paragraph>
    <Paragraph position="3"> In this respect, automated MT evaluation scores have some added value: they rely on objective parameters in the evaluated texts, so their results are comparable across different evaluations.</Paragraph>
    <Paragraph position="4"> Furthermore, they are also comparable for different types of texts translated by the same MT system, which is not the case for human scores.</Paragraph>
    <Paragraph position="5"> For example, automated scores are capable of distinguishing improved MT performance on easier texts or degraded performance on harder texts, so the automated scores also give information on whether one collection of texts is easier or harder than the other for an MT system: the complexity of the evaluation task is directly reflected in the evaluation scores.</Paragraph>
    <Paragraph position="6"> However, there may be a need to avoid such sensitivity. MT developers and users are often more interested in scores that would be stable across different types of texts for the same MT system, i.e., would reliably characterise a system's performance irrespective of the material used for evaluation. Such characterisation is especially important for state-of-the-art commercial MT systems, which typically target a wide range of general-purpose text types and are not specifically tuned to any particular genre, like weather reports or aircraft maintenance manuals.</Paragraph>
    <Paragraph position="7"> The typical problem of having &amp;quot;task-dependent&amp;quot; evaluation scores (which change according to the complexity of the evaluated texts) is that the reported scores for different MT systems are not directly comparable. Since there is no standard collection of texts used for benchmarking all MT systems, it is not clear how a system that achieves, e.g., BLEUr4n4 1 score 0.556 tested on &amp;quot;490 utterances selected from the WSJ&amp;quot; (Cmejrek et al, 2003:89) may be compared to another system which achieves, e.g., the BLEUr1n4 score 0.240 tested on 10,150 sentences from the &amp;quot;Basic Travel Expression Corpus&amp;quot; (Imamura et al., 2003:161).</Paragraph>
    <Paragraph position="8"> Moreover, even if there is no comparison involved, there is a great degree of uncertainty in how to interpret the reported automated scores. For example, BLEUr2n4 0.3668 is the highest score for a top MT system if MT performance is measured on news reports, but it is a relatively poor score for a corpus of e-mails, and a score that is still beyond the state-of-the-art for a corpus of legal documents. These levels of perfection have to be established experimentally for each type of text, and there is no way of knowing whether some reported automated score is better or worse if a new type of text is involved in the evaluation.</Paragraph>
    <Paragraph position="9"> The need to use stable evaluation scores, normalised by the complexity of the evaluated task, has been recognised in other NLP areas, such as anaphora resolution, where the results may be relative with regard to a specific evaluation set. So &amp;quot;more absolute&amp;quot; figures are obtained if we use some measure which quantifies the complexity of anaphors to be resolved (Mitkov, 2002).</Paragraph>
    <Paragraph position="10"> MT evaluation is harder than evaluation of other NLP tasks, which makes it partially dependent on intuitive human judgements about text quality.</Paragraph>
    <Paragraph position="11"> However, automated tools are capable of capturing and representing the &amp;quot;absolute&amp;quot; level of performance for MT systems, and this level could then be projected into task-dependent figures for harder or easier texts. In this respect, there is another &amp;quot;added value&amp;quot; in using automated scores for MT evaluation.</Paragraph>
    <Paragraph position="12"> Stable evaluation scores could be achieved if a formal measure of a text's complexity for translation could be cheaply computed for a source text. Firstly, the score for translation complexity allows the user to predict &amp;quot;absolute&amp;quot; performance figures of an MT system on harder or easier texts, by computing the &amp;quot;absolute&amp;quot; evaluation figures and the complexity scores for just one type of text.</Paragraph>
    <Paragraph position="13"> Secondly, it lets the user compute &amp;quot;standardised&amp;quot; performance figures for an MT system that do not depend on the complexity of a text (they are reliably within some relatively small range for any type of evaluated texts).</Paragraph>
    <Paragraph position="14"> Designing such standardised evaluation scores requires choosing a point of reference for the complexity measure: e.g., one may choose an 1 BLEUrXnY means the BLEU score with produced with X reference translations and the maximum size of compared N-grams = Y.</Paragraph>
    <Paragraph position="15"> average complexity of texts usually translated by MT as the reference point. Then the absolute scores for harder or easier texts will be corrected to fit the region of absolute scores for texts of average complexity.</Paragraph>
    <Paragraph position="16"> In this paper we report on the results of an experiment in measuring the complexity of translation tasks using resource-light parameters such as the average number of syllables per word (ASW), which is also used for computing the readability of a text. On the basis of these parameters we compute normalised BLEU and WNM scores which are relatively stable across translations produced by the same general-purpose MT systems for texts of varying difficulty. We suggest that further testing and fine-tuning of the proposed approach on larger corpora of different text types and using additional source text parameters and normalisation techniques can give better prediction of translation complexity and increase the stability of the normalised MT evaluation scores.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML