File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1057_intro.xml

Size: 6,216 bytes

Last Modified: 2025-10-06 14:01:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1057">
  <Title>Tiejun Zhao +</Title>
  <Section position="3" start_page="21" end_page="40" type="intro">
    <SectionTitle>
2 Experiments and Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="21" end_page="21" type="sub_section">
      <SectionTitle>
2.1 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> Our evaluation method is designed to help in developing the EBMT system. It is supposed to sort the translations by quality. Experiments show that it works well sorting the sentences by order of it's being good or bad translations. In order to justify the effectiveness of the evaluation method, we also design experiments to compare the automatic evaluation with human evaluation. The result shows good compatibility between the automatic and human evaluation results. Followed are details of the experimental setup and results.</Paragraph>
      <Paragraph position="1"> In order to evaluate the performance of our EBMT system, a sample from a bilingual corpus of Microsoft Software Manual is taken as the standard test set. Denote the source sentences in the test set as set S, and the target T. Sentences in S are fed into the EBMT system. We denote the output translation set as R. Every sentence ti in T is compared with the corresponding sentence ri in R. Evaluation results are got via the functions cosine(ti, ri), Dice(ti, ri), and normalized edit distance normal_editDistance(ti, ri). As discussed in the previous section, good translations tend to have higher values of cosine correlation, Dice coefficient and lower edit distance. After sorting the translations by these values, we will see clearly which sentences are translated with high quality and which are not.</Paragraph>
      <Paragraph position="2"> Knowledge engineers can obtain much help finding the weakness of the EBMT system.</Paragraph>
      <Paragraph position="3"> Some sample sentences and evaluation results are attached in the Appendix. In our experience, with Dice as example, the translations scored above 0.7 are fairly good translations with only some minor faults; those between 0.5 and 0.7 are faulty ones with some good points; while those scored under 0.4 are usually very bad translations. From these examples, we can see that the three criteria really help sorting the good translation from those bad ones. This greatly aids the developers to find out the key faults in sentence types and grammar points.</Paragraph>
    </Section>
    <Section position="2" start_page="21" end_page="40" type="sub_section">
      <SectionTitle>
2.2 Comparison with Human Evaluation
</SectionTitle>
      <Paragraph position="0"> In the above descriptions, we have presented our theoretical analysis and experimental results of our string similarity based evaluation method.</Paragraph>
      <Paragraph position="1"> The evaluation has gained the following achievements: 1) It helps distinguishing &amp;quot;good&amp;quot; translations from &amp;quot;bad&amp;quot; ones in developing the EBMT system; 2) The scores give us a clear view of the quality of the translations in localization based EBMT. In this section we will make a direct comparison between human evaluation and our automatic machine evaluation to test the effectiveness of the string similarity evaluation method. To tackle this problem, we carry out another experiment, in which human scoring of systems are compared with the machine scoring.</Paragraph>
      <Paragraph position="2"> The human scoring is carried out with a test suite of High School English. Six undergraduate students are asked to score the translations independent from each other. The average of their scoring is taken as human scoring result.</Paragraph>
      <Paragraph position="3"> The method is similar to ALPAC scoring system.</Paragraph>
      <Paragraph position="4"> We score the translations with a 6-point scale system. The best translations are scored 1. If it's not so perfect, with small errors, the translation gets a score of 0.8. If a fatal error occurs in the translation but it's still understandable, a point of 0.6 is scored. The worst translation gets 0  point of score. Table 1 shows the manual evaluation results for 6 general-purpose machine translation systems available to us. In table 1, Error5 means the worst translation. Error4 to Error1 are better when the numbering becomes smaller. A translation is labelled &amp;quot;Perfect&amp;quot; when it's a translation without any fault in it.</Paragraph>
      <Paragraph position="5"> &amp;quot;Good%&amp;quot; is the sum of percent of &amp;quot;Error1&amp;quot; and &amp;quot;Perfect&amp;quot;. Because &amp;quot;Error1&amp;quot; translations refer to those have small imperfections. &amp;quot;Score&amp;quot; is the weighted sum of scores of the 6 kinds of translations. E.g. for machine translation system MTS1, the score is calculated as follows:  =x+x+x +x+x+x=MTSscore In table 2, the human scorings and automatic scorings of the 6 machine translation systems are listed. The translations of system #1 are taken as standard for automatic evaluations, i.e. all scorings are made on the basis of the result of system #1. In principle this will introduce some errors, but we suppose it not so great as to invalidate the automatic evaluation result. This is also why the scorings of system #1 are 100.</Paragraph>
      <Paragraph position="6"> The last row labele AutoAver is the average of automatic evaluations.</Paragraph>
      <Paragraph position="7">  Figure 3 presents the scorings of Dice coefficient, cosine correlation, edit distance and the average of the three automatic criterions in a chart, we can clearly see the consistency among these parameters.</Paragraph>
      <Paragraph position="8">  In Figure 3, the numbers on X-axis are the numbering of machine translation systems, while the Y-axis denotes the evaluation scores.  The human and automatic average scoring is shown in Figure 4. The Automatic data refers to the average of Dice, cosine correlation and edit distance scorings. On the whole, human and automatic evaluations tend to present similar scores for a specific system, e.g. 78/74 for system #2, while 69/63 for system #3.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML