File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/n04-4003_evalu.xml

Size: 7,360 bytes

Last Modified: 2025-10-06 13:59:08

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4003">
  <Title>Example-based Rescoring of Statistical Machine Translation Output</Title>
  <Section position="6" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Evaluation
</SectionTitle>
    <Paragraph position="0"> The evaluation of our approach is carried out using a collection of Japanese sentences and their English translations that are commonly found in phrasebooks for tourists going abroad (Takezawa et al., 2002). The Basic Travel Expression Corpus (BTEC) contains 157K sentence pairs and the average lengths in words of Japanese and English sentences are 7.7 and 5.5, respectively. The corpus was split randomly into three parts for training (155K), parameter tuning (10K), and evaluation (10K) purposes.</Paragraph>
    <Paragraph position="1"> The experiments described below were carried out on 510 sentences selected randomly as the test set.</Paragraph>
    <Paragraph position="2"> For the evaluation, we used the following automatic scoring measures and human assessment.</Paragraph>
    <Paragraph position="3"> Word Error Rate (WER), which penalizes the edit distance against reference translations (Su et al., 1992) BLEU: the geometric mean of n-gram precision for the translation results found in reference translations (Papineni et al., 2002) Translation Accuracy (ACC): subjective evaluation ranks ranging from A to D (A: perfect, B: fair, C: acceptable and D: nonsense), judged blindly by a native speaker (Sumita et al., 1999) In contrast to WER, higher BLEU and ACC scores indicate better translations. For the automatic scoring measures we utilized up to 16 human reference translations.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Downgrading Effects During Decoding
</SectionTitle>
      <Paragraph position="0"> In order to get an idea about how much degradation is to be expected in the translation candidates modified by the statistical decoder, we conducted an experiment using the reference translations of the test set as the input of the example-based decoder. These seed sentences are already accurate translations, thus simulating the &amp;quot;optimal&amp;quot; translation example retrieval case resulting in an upper boundary of the statistical decoder performance.</Paragraph>
      <Paragraph position="1">  The results summarized in Table 1 show a large degradation (WER=25.5%, BLEU=0.744) in the reference translations when modified by the statistical decoder (TM LM). Only 66.0% of the decoder output are still perfect and 14.6% even result in unacceptable translations. The rescoring function TM LM EDP enables us to recover some of the decoder problems gaining 4.4% in accuracy compared to the statistical decoder. The best performance is achieved by the weight-based rescoring function TM LM EDW. However, around 10% of the selected translations are not yet perfect.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Baseline Comparison
</SectionTitle>
      <Paragraph position="0"> In the second experiment, we used two types of retrieval methods (tf idf-based, MT -based), as introduced in Section 2, and compared the results with the baseline system TM LM, i.e., the example-based decoding approach of (Watanabe and Sumita, 2003) using the tf idf criteria for the retrieval of translation examples and only the statistical scores for the selection of the translation.</Paragraph>
      <Paragraph position="1"> For the MT-based retrieval method we used eight machine translation systems for Japanese-to-English. Three of them were in-house EBMT systems which differ in the translation unit (sentence-based vs. phrase-based). They were trained on the same corpus as the statistical decoder.</Paragraph>
      <Paragraph position="2"> The remaining five systems were (off-the-shelf) general-purpose translation engines with quite different levels of performance (cf. Table 2).</Paragraph>
      <Paragraph position="3">  best when used in combination with the tf idf-based retrieval method, achieving around 80% translation accuracy. Moderate improvements of around 2% can be seen when the proposed rescoring functions are used together with the seed sentences obtained for the baseline system.</Paragraph>
      <Paragraph position="4"> However, the largest gain in performance is achieved when the decoder is applied to the output of multiple machine translation systems and the translation is selected using the weight-based rescoring function.</Paragraph>
      <Paragraph position="5">  and the TM LM EDW system. 67.5% of the translations are assigned to the same rank, out of which 29.2% of the translations are identical. TM LM EDW achieves higher grades for 27% of the sentences, whereas 5.5% of the baseline system translations are better. In total, the translation accuracy improved by 11.9% to 92.7%. Examples of differing translation ratings are given in Table 5.</Paragraph>
      <Paragraph position="6"> One of the reasons for the improved performance is  input: Zutsuu ga shimasu asupirin wa arimasu ka TM LM [D] aspirin do i have a headache TM LM EDW [A] i have a headache do you have any aspirin input: kore wa nani de dekiteimasu ka TM LM [C] what is this made TM LM EDW [A] what is this made of input: nanjikan no okure ni narimasu ka TM LM [B] how many hours are we behind schedule TM LM EDW [A] how many hours are we delayed input: watashi wa waruku arimasen TM LM [A] it 's not my fault TM LM EDW [B] I 'm not bad input: omedetou onnanoko ga umareta sou desu ne TM LM [A] i hear you had a baby girl congratulations TM LM EDW [C] congratulations i heard you were born a boy or a girl input: ima me o akete mo ii desu ka TM LM [A] is it all right to open my eyes now TM LM EDW [D] do you mind opening the eye  that the seed sentences obtained by the tf idf-based retrieval method are not translations of the input sentence. Moreover, the translations of the MT-based retrieval method cover a large variation of expressions due to different MT output styles, whereby the reduced quality of these seed sentences seems to be successfully compensated by the statistical models. In contrast, the translation examples retrieved by the tf idf-based method are quite similar to each other. Thus, local optimization might result in the same decoder output.</Paragraph>
      <Paragraph position="7"> In addition, the statistical decoder has the tendency to select shorter translations (4.8 words/sentence for TM LM and 5.5 words/sentence for TM LM EDW, which might indicate some problems in the utilized translation models as well as the language model.</Paragraph>
      <Paragraph position="8"> (Watanabe and Sumita, 2003) try to overcome these problems by skipping the decoding process of seed sentences whose tf idf-score indicates an exact match and output the obtained seed sentence instead. However, this shortcut method (WER=0.295, BLEU=0.641, ACC=0.898) is out-performed by the proposed rescoring method by 2.9% in translation accuracy, because our method takes advantage of translations successfully modified by the decoder and is able to identify and reject wrongly modified ones.</Paragraph>
      <Paragraph position="9"> Moreover, the rescoring function is language-independent and thus can be easily applied to other language-pairs as well.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML