XML Viewer - p06-2095

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2095_evalu.xml
Size: 11,855 bytes
Last Modified: 2025-10-06 13:59:42
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2095">
  <Title>Using comparable corpora to solve problems difficult for human translators</Title>
  <Section position="5" start_page="741" end_page="744" type="evalu">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> There are several attributes of our system which can be evaluated, and many of them are crucial for its efficient use in the workflow of professional translators, including: usability, quality of final solutions, trade-off between adequacy and fluency across usable examples, precision and recall of potentially relevant suggestions, as well as real-text evaluation, i.e. &amp;quot;What is the coverage of difficult translation problems typically found in a text that can be successfully tackled?&amp;quot; In this paper we focus on evaluating the quality of potentially relevant translation solutions, which is the central point for developing and calibrating our methodology. The evaluation experiment discussed below was specifically designed to assess the usefulness of translation suggestions generated by our tool - in cases where translators have doubts about the usefulness of dictionary solutions. In this paper we do not evaluate other equally important aspects of the system's functionality, which will be the matter of future research. null</Paragraph>
    <Section position="1" start_page="741" end_page="742" type="sub_section">
      <SectionTitle>
3.1 Set-up of the experiment
</SectionTitle>
      <Paragraph position="0"> For each translation direction we collected ten examples of possibly recalcitrant translation problems - words or phrases whose translation is not straightforward in a given context. Some of these examples were sent to us by translators in response to our request for difficult cases. For each example, which we included in the evaluation kit, the word or phrase either does not have a translation in ORD (which is a kind of a baseline standard reference for Russian translators), or its translation has significantly lower frequency in a target language corpus in comparison to the frequency of the source expression. If an MWE is not listed in available dictionaries, we produced compositional (word-for-word) translations using ORD. In order to remove a possible anti-dictionary bias from our experiment, we also checked translations in Multitran, an on-line translation dictionary, which was often quoted as one of the best resources for translation from and into Russian.</Paragraph>
      <Paragraph position="1"> For each translation problem five solutions were presented to translators for evaluation. One or two of these solutions were taken from a dictionary (usually from Multitran, and if available and different, from ORD). The other suggestions were manually selected from lists of possible solutions returned by ASSIST. Again, the criteria for selection were intuitive: we included those suggestions which made best sense in the given context. Dictionary suggestions and the output of ASSIST were indistinguishable in the questionnaires to the evaluators. The segments were presented in sentence context and translators had an option of providing their own solutions and comments. Table 2 shows one of the questions sent to evaluators. The problem example is a247a229a242a234a224a255 a239a240a238a227a240a224a236a236a224 ('precise programme'), which is presented in the context of a Russian sentence with the following (non-literal) translation This team should be put together by responsible politicians, who have a  clear strategy for resolving the current crisis. The third translation equivalent (clear programme) in the table is found in the Multitran dictionary (ORD offers no translation for a247a229a242a234a224a255 a239a240a238a227a240a224a236a236a224). The example was included because clear programme is much less frequent in English (2 examples in the BNC) in comparison to a247a229a242a234a224a255 a239a240a238a227a240a224a236a236a224 in Russian (70). Other translation equivalents in Table 2 are generated by ASSIST.</Paragraph>
      <Paragraph position="2"> We then asked professional translators affiliated to a translator's association (identity witheld at this stage) to rate these five potential equivalents using a five-point scale:  5 = The suggestion is an appropriate translation as it is.</Paragraph>
      <Paragraph position="3"> 4 = The suggestion can be used with some minor amendment (e.g. by turning a verb into a participle). null 3 = The suggestion is useful as a hint for another, appropriate translation (e.g. suggestion elated cannot be used, but its close synonym exhilarated can).</Paragraph>
      <Paragraph position="4"> 2 = The suggestion is not useful, even though it is  still in the same domain (e.g. fear is proposed for a problem referring to hatred).</Paragraph>
      <Paragraph position="6"> We received responses from eight translators.</Paragraph>
      <Paragraph position="7"> Some translators did not score all solutions, but there were at least four independent judgements for each of the 100 translation variants. An example of the combined answer sheet for all responses to the question from Table 2 is given in Table 3 (t1,  t2,. . . denote translators; the dictionary translation is clear programme).</Paragraph>
    </Section>
    <Section position="2" start_page="742" end_page="744" type="sub_section">
      <SectionTitle>
3.2 Interpretation of the results
</SectionTitle>
      <Paragraph position="0"> The results were surprising in so far as for the majority of problems translators preferred very different translation solutions and did not agree in their scores for the same solutions. For instance, concrete plan in Table 3 received the score 1 from translator t1 and 5 from t2.</Paragraph>
      <Paragraph position="1"> In general, the translators very often picked up on different opportunities presented by the suggestions from the lists, and most suggestions were equally legitimate ways of conveying the intended content, cf. the study of legitimate translation variation with respect to the BLEU score in (Babych and Hartley, 2004). In this respect it may be unfair to compute average scores for each potential solution, since for most interesting cases the scores do not fit into the normal distribution model. So averaging scores would mask the potential usability of really inventive solutions.</Paragraph>
      <Paragraph position="2"> In this case it is more reasonable to evaluate two sets of solutions - the one generated by ASSIST and the other found in dictionaries - but not each solution individually. In order to do that for each translation problem the best scores given by each translator in each of these two sets were selected. This way of generalising data characterises the general quality of suggestion sets, and exactly meets the needs of translators, who collectively get ideas from the presented sets rather than from individual examples. This also allows us to measure inter-evaluator agreement on the dictionary set and the ASSIST set, for instance, via computing the standard deviation s of absolute scores across evaluators (Table 3). This appeared to be a very informative measure for dictionary solutions.</Paragraph>
      <Paragraph position="3"> In particular, standard deviation scores for the dictionary set (threshold s = 0.5) clearly split  our 20 problems into two distinct groups: the first group below the threshold contains 8 examples, for which translators typically agree on the quality of dictionary solutions; and the second group above the threshold contains 12 examples, for which there is less agreement. Table 4 shows some examples from both groups and Table 5 presents average evaluation scores and standard deviation figures for both groups.</Paragraph>
      <Paragraph position="4"> Overall performance on all 20 examples is the same for the dictionary responses and for the system's responses: average of the mean top scores is about 4.2 and average standard deviation of the scores is 0.8 in both cases (for set-best responses). This shows that ASSIST can reach the level of performance of a combination of two authoritative dictionaries for MWEs, while for its own translation step it uses just a subset of one-word translation equivalents from ORD. However, there is another side to the evaluation experiment. In fact, we are less interested in the system's performance on all of these examples than on those examples for which there is greater disagreement among translators, i.e. where there is some degree of dissatisfaction with dictionary suggestions.</Paragraph>
      <Paragraph position="5">  Interestingly, dictionary scores for the agreement group are always higher than 4, which means that whenever translators agreed on the dictionary scores they were usually satisfied with the dictionary solution. But they never agreed on the inappropriateness of the dictionary: inappropriateness revealed itself in the form of low scores from some translators.</Paragraph>
      <Paragraph position="6"> This agreement/disagreement threshold can be said to characterise two types of translation problems: those for which there exist generally accepted dictionary solutions, and those for which translators doubt whether the solution is appropriate. Best-set scores for these two groups of dictionary solutions - the agreement and disagreement group - are plotted on the radar charts in Figures 1 and 2 respectively. The identifiers on the charts are problematic source language expressions as used in the questionnaire (not translation solutions to these problems, because a problem may have several solutions preferred by different judges). Scores for both translation directions are presented on the same chart, since both follow the same pattern and receive the same interpretation.</Paragraph>
      <Paragraph position="7"> Figure 1 shows that whenever there is little doubt about the quality of dictionary solutions, the radar chart approaches a circle shape near the edge of the chart. In Figure 2 the picture is different: the circle is disturbed, and some scores frequently approach the centre. Therefore the disagreement group contains those translation problems where dictionaries provide little help.</Paragraph>
      <Paragraph position="8"> The central problem in our evaluation experiment is whether ASSIST is helpful for problems in the second group, where translators doubt the quality of dictionary solutions.</Paragraph>
      <Paragraph position="9"> Firstly, it can be seen from the charts that judge- null ments on the quality of the system output are more consistent: score lines for system output are closer to the circle shape in Figure 1 than those for dictionary solutions in Figure 2 (formally: the standard deviation of evaluation scores, presented in Table 4, is lower).</Paragraph>
      <Paragraph position="10"> Secondly, as shown in Table 4, in this group average evaluation scores are slightly higher for ASSIST output than for dictionary solutions (3.97 vs 3.77) - in the eyes of human evaluators ASSIST outperforms good dictionaries. For good dictionary solutions ASSIST performance is slightly lower: (4.49 vs 4.81), but the standard deviation is about the same.</Paragraph>
      <Paragraph position="11"> Having said this, solutions from our system are really not in competition with dictionary solutions: they provide less literal translations, which often emerge in later stages of the translation task, when translators correct and improve an initial draft, where they have usually put more literal equivalents (Shveitser, 1988). It is a known fact in translation studies that non-literal solutions are harder to see and translators often find them only upon longer reflection. Yet another fact is that non-literal translations often require re-writing other segments of the sentence, which may not be obvious at first glance.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML