File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1016_metho.xml
Size: 17,173 bytes
Last Modified: 2025-10-06 14:08:41
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1016"> <Title>Extending MT evaluation tools with translation complexity metrics</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Set-up of the experiment </SectionTitle> <Paragraph position="0"> We compared the results of the human and automated evaluation of translations from French into English of three different types of texts which vary in size and style: an EU whitepaper on child and youth policy (120 sentences), a collection of 36 business and private e-mails and 100 news texts from the DARPA 94 MT evaluation corpus (White et al., 1994). The translations were produced by two leading commercial MT systems. Human evaluation results are available for all of the texts, with the exception of the news reports translated by System-2, which was not part of the DARPA 94 evaluation. However, the human evaluation scores were collected at different times under different experimental conditions using different formulations of the evaluation tasks, which leads to substantial differences between human scores across different evaluations, even if the evaluations were done at the same time.</Paragraph> <Paragraph position="1"> Further, we produced two sets of automated scores: BLEUr1n4, which have a high correlation with human scores for fluency, and WNM Recall, which strongly correlate with human scores for adequacy. These scores were produced under the same experimental conditions, but they uniformly differ for both evaluated systems: BLEU and WNM scores were relatively higher for e-mails and relatively low for the whitepaper, with the news texts coming in between. We interpreted these differences as reflecting the relative complexity of texts for translation.</Paragraph> <Paragraph position="2"> For the French originals of all three sets of texts we computed resource-light parameters used in standard readability measures (Flesch Reading Ease score or Flesch-Kincaid Grade Level score), i.e. average sentence length (ASL - the number of words divided by the number of sentences) and average number of syllables per word (ASW - the number of syllables divided by the number of words).</Paragraph> <Paragraph position="3"> We computed Pearson's correlation coefficient r between the automated MT evaluation scores and each of the two readability parameters. Differences in the ASL parameter were not strongly linked to the differences in automated scores, but for the ASW parameter a strong negative correlation was found.</Paragraph> <Paragraph position="4"> Finally, we computed normalised (&quot;absolute&quot;) BLEU and WNM scores using the automated evaluation results for the DARPA news texts (the medium complexity texts) as a reference point. We compared the stability of these scores with the stability of the standard automated scores by computing standard deviations for the different types of text. The absolute automated scores can be computed on any type of text and they will indicate what score is achievable if the same MT system runs on DARPA news reports. The normalised scores allow the user to make comparisons between different MT systems evaluated on different texts at different times. In most cases the accuracy of the comparison is currently limited to the first rounded decimal point of the automated score.</Paragraph> </Section> <Section position="4" start_page="0" end_page="5" type="metho"> <SectionTitle> 3 Results of human evaluations </SectionTitle> <Paragraph position="0"> The human evaluation results were produced under different experimental conditions. The output of the compared systems was evaluated each time within a different evaluation set, in some cases together with different MT systems, or native or non-native human translations. As a result human evaluation scores are not comparable across different evaluations.</Paragraph> <Paragraph position="1"> Human scores available from the DARPA 94 MT corpus of news reports were the result of a comparison of five MT systems (one of which was a statistical MT system) and a professional (&quot;expert&quot;) human translation. For our experiment we used DARPA scores for adequacy and fluency for one of the participating systems.</Paragraph> <Paragraph position="2"> We obtained human scores for translations of the whitepaper and the e-mails from one of our MT evaluation projects at the University of Leeds. This had involved the evaluation of French-to-English versions of two leading commercial MT systems System 1 and System 2 - in order to assess the quality of their output and to determine whether updating the system dictionaries brought about an improvement in performance. (An earlier version of System 1 also participated in the DARPA evaluation.) Although the human evaluations of both texts were carried out at the same time, the experimental set-up was different in each case.</Paragraph> <Paragraph position="3"> The evaluation of the whitepaper for adequacy was performed by 20 postgraduate students who knew very little or no French. A professional human translation of each segment was available to the judges as a gold standard reference. Using a five-point scale in each case, judgments were solicited on adequacy by means of the following question: &quot;For each segment, read carefully the reference text on the left. Then judge how much of the same content you can find in the candidate text.&quot; Five independent judgments were collected for each segment.</Paragraph> <Paragraph position="4"> The whitepaper fluency evaluation was performed by 8 postgraduate students and 16 business users under similar experimental conditions with the exception that the gold standard reference text was not available to the judges. The following question was asked: &quot;Look carefully at each segment of text and give each one a score according to how much you think the text reads like fluent English written by a native speaker.&quot; For e-mails a different quality evaluation parameter was used: 26 human judges (business users) evaluated the usability (or utility) of the translations. We also included translations produced by a non-professional, French-speaking translator in the evaluation set for e-mails. (This was intended to simulate a situation where, in the absence of MT, the author of the e-mail would have to write in a foreign language (here English); we anticipated that the quality would be judged lower than the professional, native speaker translations.) The non-native translations were dispersed anonymously in the data set and so were also judged. The following question was asked: &quot;Using each reference e-mail on the left, rate the three alternative versions on the right according to how usable you consider them to be for getting business done.&quot; Figure 1 and Table 1 summarise the human evaluation scores for the two compared MT systems. The judges had scored versions of the e-mails (&quot;em&quot;) and whitepaper (&quot;wp&quot;) produced both before and after dictionary update (&quot;DA&quot;), although no judge saw the before and after variants of the same text. (The scores for the DARPA news texts are converted from [0, 1] to [0, 5] scale).</Paragraph> <Paragraph position="5"> It can be inferred from the data that human evaluation scores do not allow us to make any meaningful comparison of the scores outside a particular evaluation experiment, which necessarily must be interpreted as relative rather than absolute.</Paragraph> <Paragraph position="6"> We can see that dictionary update consistently improves the performance of both systems, that System 1 is slightly better than System 2 in all cases, although after dictionary update System 2 is capable of reaching the baseline quality of System 1. However, the usability scores for supposedly easier texts (e-mails) are considerably lower than the adequacy scores for harder texts (the whitepaper), although the experimental set-up for adequacy and usability is very similar: both used a gold-standard human reference translation. We suggest that the presence of a higher quality translation done by a human non-native speaker of the target language &quot;over-shadowed&quot; lower quality MT output, which dragged down evaluation scores for e-mail usability. No such higher quality translation was present in the evaluation set for the whitepaper adequacy, so the scores went up.</Paragraph> <Paragraph position="7"> Therefore, no meaning can be given to any absolute value of the evaluation scores across different experiments involving intuitive human judgements. Only a relative comparison of these evaluation scores produced within the same experiment is possible.</Paragraph> </Section> <Section position="5" start_page="5" end_page="5" type="metho"> <SectionTitle> 4 Results of automated evaluations </SectionTitle> <Paragraph position="0"> Automated evaluation scores use objective parameters, such the number of N-gram matches in the evaluated text and in a gold standard reference translation. Therefore, these scores are more consistent and comparable across different evaluation experiments. The comparison of the scores indicates the relative complexity of the texts for translation. For the output of both MT systems under consideration we generated two sets of automated evaluation scores: BLEUr1n4 and WNM Recall.</Paragraph> <Paragraph position="1"> BLEU computes the modified precision of N-gram matches between the evaluated text and a professional human reference translation. It was found to produce automated scores, which strongly correlate with human judgements about translation fluency (Papineni et al., 2002).</Paragraph> <Paragraph position="2"> WNM is an extension of BLEU with weights of a term's salience within a given text. As compared to BLEU, the WNM recall-based evaluation score was found to produce a higher correlation with human judgements about adequacy (Babych, 2004). The salience weights are similar to standard tf.idf scores and are computed as follows: ( )</Paragraph> <Paragraph position="4"> the text j; (&quot;Relative frequency&quot; is the number of tokens of this word-type divided by the total number of tokens).</Paragraph> <Paragraph position="5"> - Pcorp-doc(i) is the relative frequency of the same word wi in the rest of the corpus, without this text; - dfi is the number of documents in the corpus where the word wi occurs; - N is the total number of documents in the corpus. - Pcorp(i) is the relative frequency of the word wi in the whole corpus, including this particular text.</Paragraph> <Paragraph position="6"> Figures 2 and 3 and Table 2 summarise the automated evaluation scores for the two MT systems.</Paragraph> <Paragraph position="7"> It can be seen from the charts that automated scores consistently change according to the type of the evaluated text: for both evaluated systems BLEU and WNM are the lowest for the whitepaper texts, which emerge as most complex to translate, the news reports are in the middle and the highest scores are given to the e-mails, which appear to be relatively easy. A similar tendency also holds for the system after dictionary update. However, technically speaking the compared systems are no longer the same, because the dictionary update was done individually for each system, so the quality of the update is an additional factor in the system's performance - in addition to the complexity of the translated texts.</Paragraph> <Paragraph position="8"> The complexity of the translation task is integrated into the automated MT evaluation scores, but for the same type of texts the scores are perfectly comparable. For example, for the DARPA news texts, newly generated BLEU and WNM scores confirm the observation made, on the basis of comparison of the whitepaper and the e-mail texts, that S1 produces higher translation quality than S2, although there is no human evaluation experiment where such translations are directly compared.</Paragraph> <Paragraph position="9"> Thus the automated MT evaluation scores derive from both the &quot;absolute&quot; output quality of an evaluated general-purpose MT system and the complexity of the translated text.</Paragraph> </Section> <Section position="6" start_page="5" end_page="5" type="metho"> <SectionTitle> 5 Readability parameters </SectionTitle> <Paragraph position="0"> In order to isolate the &quot;absolute&quot; MT quality and to filter out the contribution of the complexity of the evaluated text from automated scores, we need to find a formal parameter of translation complexity which should preferably be resourcelight, so as to be easily computed for any source text in any language submitted to an MT system.</Paragraph> <Paragraph position="1"> Since automated scores already integrate the translation complexity of the evaluated text, we can validate such a parameter by its correlation with automated MT evaluation scores computed on the same set that includes different text types.</Paragraph> <Paragraph position="2"> In our experiment, we examined the following resource-light parameters for their correlation with both automated scores: - Flesch Reading Ease score, which rates text on a 100-point scale according to how easy it is to understand; the score is computed as follows:</Paragraph> <Paragraph position="4"> ASW), where: ASL is the average sentence length (the number of words divided by the number of sentences); ASW is the average number of syllables per word (the number of syllables divided by the number of words) - Flesch-Kincaid Grade Level score which rates texts on US grade-school level and is computed as:</Paragraph> <Paragraph position="6"> - each of the ASL and ASW parameters individually.</Paragraph> <Paragraph position="7"> Table 3 presents the averaged readability parameters for all French original texts used in our evaluation experiment and the r correlation between these parameters and the corresponding correlation exists between ASW (average number of syllables per word) and the automated evaluation scores. Therefore the ASW parameter can be used to normalise MT evaluation scores. Therefore translation complexity is highly dependent on the complexity of the lexicon, which is approximated by the ASW parameter.</Paragraph> <Paragraph position="8"> The other parameter used to compute readability - ASL (average sentence length in words) - has a much weaker influence on the quality of MT, which may be due to the fact that local context is in many cases sufficient to produce accurate translation and the use of the global sentence structure in MT analysis is limited.</Paragraph> </Section> <Section position="7" start_page="5" end_page="5" type="metho"> <SectionTitle> 6 Normalised evaluation scores </SectionTitle> <Paragraph position="0"> We used the ASW parameter to normalise the automated evaluation scores in order to obtain absolute figures for MT performance, where the influence of translation complexity is neutralised.</Paragraph> <Paragraph position="1"> Normalisation requires choosing some reference point - some average level of translation complexity - to which all other scores for the same MT system will be scaled. We suggest using the difficulty of the news texts in the DARPA 94 MT evaluation corpus as one such &quot;absolute&quot; reference point. Normalised figures obtained on other types of texts will mean that if the same general-purpose MT system is run on the DARPA news texts, it will produce raw BLEU or WNM scores approximately equal to the normalised scores. This allows users to make a fairer comparison between MT systems evaluated on different types of texts.</Paragraph> <Paragraph position="2"> We found that for the WNM scores the best normalisation can be achieved by multiplying the score by the complexity normalisation coefficient C, which is the ratio: C = ASWevalText/ ASWDARPAnews.</Paragraph> <Paragraph position="3"> For BLEU the best normalisation is achieved by multiplying the score by C2 (the squared value of ASWevalText/ ASWDARPAnews).</Paragraph> <Paragraph position="4"> Normalisation makes the evaluation relatively stable - in general, the scores for the same system are the same up to the first rounded decimal point. Table 4 summarises the normalised automated scores for the evaluated systems.</Paragraph> <Paragraph position="5"> The accuracy of the normalisation can be measured by standard deviations of the normalised scores across texts of different types. We also measured the improvement in stability of the normalised scores as compared to the stability of the raw scores generated on different text types. Standard deviation was computed using the formula: It can be seen from the table that the standard deviation of the normalised BLEU scores across different text types is 3.3 times smaller; and the deviation of the normalised WNM scores is 2.25 times smaller than for the corresponding raw scores. So the normalised scores are much more stable than the raw scores across different evaluated text types.</Paragraph> </Section> class="xml-element"></Paper>