File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1047_metho.xml

Size: 19,420 bytes

Last Modified: 2025-10-06 14:08:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1047">
  <Title>Using a Mixture of N-Best Lists from Multiple MT Systems in Rank-Sum-Based Confidence Measure for MT Outputs</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The Existing RSCM
</SectionTitle>
    <Paragraph position="0"> The existing confidence measures include the rank-sum-based confidence measure (RSCM) for SMT systems (Ueffing et al., 2003). The basic idea of this RSCM is to roughly calculate the word posterior probability by using ranks of MT outputs in the N-best list of an SMT system. That is, the ranks of probabilities of MT outputs in the N-best list were used instead of the probabilities of the outputs themselves, as in the non-parametric statistical test.</Paragraph>
    <Paragraph position="1"> Hereafter, ^eI1 and wIn1 denote the top output2 and the n-th best output in the N-best list, respectively. ^ei denotes the i-th word in the top MT output ^eI1. Li(^eI1;wIn1 ) denote the Levenshtein alignment3 (Levenshtein, 1966) of ^ei on the n-th best output wIn1 according to the top output ^eI1. The existing RSCM of the word ^ei is the sum of the ranks of MT outputs in an N-best list containing the word ^ei in a position that is aligned to i in the Levenshtein alignment, which is normalized by the total rank sum:</Paragraph>
    <Paragraph position="3"> words/morphemes x and y are the same, (x;y) = 1; otherwise, (x;y) = 0. Thus, only in the case where ^ei and Li(^eI1;wIn1 ) are the same, the rank of the MT output wIn1 , N n, is summed. In the calculation of Crank, N n is summed instead of the rank n because ranks near the top of the N-best list contribute more to the score Crank.</Paragraph>
    <Paragraph position="4">  i, in the calculation of edit distance from the top MT output ^eI1 to the n-th best output wIn1 .</Paragraph>
    <Paragraph position="5"> In this paper, the calculation of Crank is slightly modified to sum N n + 1 so that the total summation is equal to N(N + 1)=2. Moreover, when there are MT outputs that have the same score, such MT outputs are assigned the average rank as in the discipline of non-parametric statistical test.</Paragraph>
    <Paragraph position="6"> As shown in Section 1, the existing RSCM does not always work well on types of MT systems other than SMT systems. This is because the system's N-best list does not always give a good approximation of the total summation of the probability of all candidate translations given the source sentence/utterance. The N-best list is expected to approximate the total summation as closely as possible. null</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Proposed Method
</SectionTitle>
    <Paragraph position="0"> In this section, the authors propose a method that eliminates unsatisfactory top output by using an alternative RSCM based on a mixture of N-best lists from multiple MT systems. The judgment that the top output is satisfactory is based on the same threshold comparison as the judgment that the top output is perfect, as mentioned in Section 1. The elimination system and the alternative RSCM are explained in Sections 3.1 and 3.2, respectively.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Elimination system
</SectionTitle>
      <Paragraph position="0"> This section proposes a method that eliminates unsatisfactory top outputs by using an alternative RSCM based on a mixture of N-best lists from multiple MT systems (Figure 3). This elimination system is intended to be used in the selector architecture (Figure 2). The elimination system receives an M-best list from each element MT system and outputs only top2 outputs whose translation quality is better than or as good as that which the user can permit. In the case of Figure 3, the number of MT systems is three; thus, the elimination system can output zero to three top MT outputs, which depends on the number of the eliminated top outputs.</Paragraph>
      <Paragraph position="1">  The proposed elimination system judges whether a top output is satisfactory by using a threshold comparison, as in (Ueffing et al., 2003). When the confidence values of all words in the top output, which are calculated by using the alternative RSCM explained in Section 3.2, are larger than a fixed threshold, the top output is judged as satisfactory. Otherwise, the top output is judged as unsatisfactory. The threshold was optimized on a development corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 The proposed RSCM
</SectionTitle>
      <Paragraph position="0"> The proposed RSCM is an extension of the existing RSCM outlined in Section 2. The proposed RSCM differs from the existing RSCM in the adopted N-best list (Figure 3). The proposed RSCM receives an M-best list from each element MT system. Next the proposed RSCM sorts the mixture of all the MT outputs in the order of the average product of the scores of a language model and a translation model (Akiba et al., 2002). This sorted mixture is alternatively used instead of the system's N-best list in the existing RSCM. That is, the proposed RSCM checks whether it accepts/rejects each top MT output in the original M-best lists by using the sorted mixture; on the other hand, the existing RSCM checks whether it accepts/rejects the top MT output in the system's N-best list by using the system's N-best.</Paragraph>
      <Paragraph position="1"> For scoring MT outputs, the proposed RSCM uses a score based on a translation model called IBM4 (Brown et al., 1993) (TM-score) and a score based on a language model for the translation target language (LM-score). As Akiba et al. (2002) reported, the products of TM-scores and LM-scores are statistical variables. Even in the case where the translation model (TM) and the language model for the translation target language (LM) are trained on a sub-corpus of the same size, changing the training corpus also changes the TM-score, the LM-score, and their product. Each pair of TM-score and LM-score differently order the MT outputs.</Paragraph>
      <Paragraph position="2"> For robust scoring, the authors adopt the multiple scoring technique presented in (Akiba et al., 2002). The multiple scoring technique prepares</Paragraph>
      <Paragraph position="4"> trains a translation model TMi on Ci (= C Vi) and a language model LMi on the target-language part of Ci (Figure 4). MT outputs in the mixture are sorted by using the average of the product scores by TMi and LMi for each i. In (Akiba et al., 2002), this multiple scoring technique was shown to select the best translation better than a single scoring technique that uses TM and LM trained from a full corpus. null</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Comparison
</SectionTitle>
    <Paragraph position="0"> The authors conducted an experimental comparison between the proposed RSCM and the existing RSCM in the framework of the elimination system.</Paragraph>
    <Paragraph position="1"> The task of both RSCMs was to judge whether each top2 MT output from an MT system is satisfactory, that is, whether the translation quality of the top MT output is better than or as good as that which the user can permit.</Paragraph>
    <Paragraph position="2"> In this experiment, the translation quality of MT outputs was assigned one of four grades: A, B, C, or D as follows: (A) Perfect: no problems in either information or grammar; (B) Fair: easy-tounderstand, with either some unimportant information missing or flawed grammar; (C) Acceptable: broken, but understandable with effort; (D) Nonsense: important information has been translated incorrectly. This evaluation standard was introduced by Sumita et al. (1999) to evaluate S2SMT systems.</Paragraph>
    <Paragraph position="3"> In advance, each top MT output was evaluated by nine native speakers of the target language, who were also familiar with the source language, and then assigned the median grade of the nine grades.</Paragraph>
    <Paragraph position="4"> To conduct a fair comparison, the number of MT outputs in the system's N-best list and the number of MT outputs in the mixture are expected to be the same. Thus, the authors used either a three-best list from each of three MT systems or a five-best list from each of two non-SMT MT systems for the proposed RSCM and a ten-best list for the existing RSCM. Naturally, this setting4 is not disadvantageous for the existing RSCM.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Evaluation metrics
</SectionTitle>
      <Paragraph position="0"> The performances of both RSCMs were evaluated by using three different metrics: ROC Curve, Hmean, and Accuracy. For each MT system, these metrics were separately calculated by using a confusion matrix (Table 1). For example, for J2E D3 (Section 4.2.1), the proposed RSCM checked each top MT output from J2E D3 by using the input mixture of three-best lists from the three J2E MT systems (Section 4.2.1); on the other hand, the existing RSCM checked each top MT output from J2E D3 by using the input ten-best list from J2E D3. For J2E D3, the results were counted up into the confusion matrix of each RSCM, and the metrics were calculated as follows: ROC Curve plots the correct acceptance rate versus the correct rejection rate for different values of the threshold. Correct acceptance rate (CAR) is defined as the number of satisfactory outputs that have been accepted, divided by the total number of satisfactory outputs, that is, Vs;a=Vs (Table 1). Correct rejection rate (CRR) is defined as the number of unsatisfactory outputs that have been rejected, divided by the total number of unsatisfactory outputs,  that is, Vu;r=Vu (Table 1).</Paragraph>
      <Paragraph position="1"> H-mean is defined as a harmonic mean5 of the CAR and the CRR (Table 1), 2 CAR CRR=(CAR + CRR).</Paragraph>
      <Paragraph position="2"> Accuracy is defined as a weighted mean6 of the CAR and the CRR (Table 1), (Vs CAR + Vu</Paragraph>
      <Paragraph position="4"> For each performance of H-mean and Accuracy, 10-fold cross validation was conducted. The threshold was fixed such that the performance was maximized on each non-held-out subset, and the performance was calculated on the corresponding held-out subset. To statistically test the differences in performance (H-mean or Accuracy) between the confidence measures, the authors conducted a pairwise t-test (Mitchell, 1997), which was based on the results of 10-fold cross validation. When the difference in performance meets the following condition, the difference is statistically different at a confidence level 5This harmonic mean is used for summarizing two measures, each of which has a trade-off relationship with each other. For example, F-measure is the harmonic mean of precision and recall, which is well used in the discipline of Infor- null in the AB row indicates the ratio of A-or-B-graded translation by each MT system. Each number in the other rows similarly indicates corresponding ratios.</Paragraph>
      <Paragraph position="5">  where ppro and pext, respectively, denote the average performance of the proposed RSCM and the existing RSCM, t( ;10 1) denotes the upper point of the Student's t-distribution with (10 1) degrees of freedom, and S denotes the estimated standard deviation of the average difference in performance.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Experimental conditions
</SectionTitle>
      <Paragraph position="0"> Three English-to-Japanese (E2J) MT systems and three Japanese-to-English (J2E) MT systems of the three types described below were used. Table 2 shows the performances of these MT systems.</Paragraph>
      <Paragraph position="1"> D3 (DP-match Driven transDucer) is an example-based MT system using onlinegenerated translation patterns (Doi and Sumita, 2003).</Paragraph>
      <Paragraph position="2"> HPAT (Hierarchical Phrase Alignment based Translation) is a pattern-based system using automatically generated syntactic transfer (Imamura et al., 2003).</Paragraph>
      <Paragraph position="3"> SAT (Statistical ATR Translator) is an SMT system using a retrieved seed translation as the start point for decoding/translation (Watanabe et al., 2003).</Paragraph>
      <Paragraph position="4">  The test set used consists of five hundred and ten pairs of English and Japanese sentences, which  Ave. sent. length 7.7 6.6 were randomly selected from the Basic Travel Expression Corpus (BTEC) (Takezawa et al., 2002). BTEC contains a variety of expressions used in a number of situations related to overseas travel.  The corpora used for training TMs and LMs described in Section 3.2 were merged corpora (Table 3). The number of trained TMs/LMs was three. The translation models and language models were learned by using GIZA++ (Och and Ney, 2000) and the CMU-Cambridge Toolkit (Clarkson and Rosenfeld, 1997), respectively.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Experimental results and discussion
</SectionTitle>
      <Paragraph position="0"> In order to plot the ROC Curve, the authors conducted the same experiment as shown in Figure 1.</Paragraph>
      <Paragraph position="1"> That is, in the case where the grade of satisfactory translations is only grade A, each of the proposed and existing RSCMs tried to accept grade A MT outputs and to reject grade B, C, or D MT outputs.</Paragraph>
      <Paragraph position="2"> Figures 5 to 7 show the ROC Curves for each of the three J2E MT systems (D3, HPAT, and SAT).</Paragraph>
      <Paragraph position="3"> The curves with diamond marks, cross marks, triangle marks, and circle marks show the ROC Curves for the existing RSCM, the proposed RSCM by using the mixture of three-best lists from D3, HPAT and SAT, the proposed RSCM by using the mixture of five-best lists from D3 and HPAT, and the existing RSCM with reordering, respectively. In the existing RSCM with reordering, the system's  tal results of each of the three MT systems: D3, HPAT, and SAT. Each floating number in the first to third column of each MT system indicates the average performance of the proposed RSCM, the average difference of the performance of the proposed RSCM from that of the existing RSCM, and the t-value of the left-next difference, respectively. The bold floating numbers indicate that the left-next difference is significant at a confidence level of 95%. The floating numbers on the three rows for each MT system, whose row heads are &amp;quot;A j BCD&amp;quot;, &amp;quot;AB j CD&amp;quot;, or &amp;quot;ABC j D&amp;quot;, correspond to the three types of experiments in which each RSCM tried to accept/reject the MT output assigned one of the grades left/right of &amp;quot;j&amp;quot;, respectively.</Paragraph>
      <Paragraph position="4">  original N-best list was sorted by using the average of the product scores from the multiple scoring technique described in Section 3.2, and the existing RSCM with reordering used this sorted system's N-best instead of the system's original N-best.</Paragraph>
      <Paragraph position="5"> The dotted lines indicate the contours by H-mean from 0.7 to 0.8. The ideal ROC curve is a square (0; 1); (1; 1); (1; 0); thus, the closer the curve is to a square, the better the performance of the RSCM is.</Paragraph>
      <Paragraph position="6"> In Figures 5 and 6, the curves of the proposed RSCM by using the mixture of three-best lists from the three MT systems are much closer to a square than that of the existing RSCM; moreover, the curves of the proposed RSCM by using the mixture of five-best lists from the two MT systems are much closer to a square than that of the existing RSCM.</Paragraph>
      <Paragraph position="7"> Note that the superiority of the proposed RSCM to the existing RSCM is maintained even in the case where an M-best list from the SMT system was not used. The curves of the existing RSCM with re-ordering are closer to a square than those of the existing RSCM. Thus the performance of the proposed RSCM on the non-SMT systems, D3 and HPAT, are much better than that of the existing RSCM. The difference between the performance of the proposed and existing RSCMs is due to both resorting the MT outputs and using a mixture of N-best lists.</Paragraph>
      <Paragraph position="8"> In Figure 7, the curve of the proposed RSCM is a little closer when CRR is larger than CAR; and the curve of the existing RSCM is a little closer when CAR is larger than CRR. Thus, the performance of the proposed RSCM on the SMT system, SAT, is a little better than that of the existing RSCM in the case where CRR is regarded as important; similarly, the performance of the proposed RSCM on the SMT system is a little worse than that of the existing RSCM in the case where CAR is regarded as important.</Paragraph>
      <Paragraph position="9">  Tables 4 and 5 show the experimental results of ten-fold cross-validated pairwise t-tests of the performance of H-mean and Accuracy, respectively.</Paragraph>
      <Paragraph position="10"> On the non-SMT systems, Table 4 shows that at every level of translation quality that the user would permit, the H-mean of the proposed RSCM is significantly better than that of the existing RSCM. On the SMT MT system, Table 4 shows that at every permitted level of translation quality, there is no significant difference between the H-mean of the proposed RSCM and that of the existing RSCM except for two cases: &amp;quot;ABC j D&amp;quot; for E2J- SAT and &amp;quot;AB j CD&amp;quot; for J2E- SAT.</Paragraph>
      <Paragraph position="11"> Table 5 shows almost the same tendency as Table 4. As for difference, in the case where the translation quality that the user would permit is better than D, there is no significant difference between the Accuracy of the proposed RSCM and that of the existing RSCM except in the one case of &amp;quot;ABC j D&amp;quot; for E2J-HPAT.</Paragraph>
      <Paragraph position="12"> As defined in Section 4.1, Accuracy is an evaluation metric whose value is sensitive/inclined to the ratio of the number of satisfactory translations and unsatisfactory translations. H-mean is an evaluation metric whose value is independent/natural to this ratio. We need to use these different evaluation metrics according to the situations encountered. For general purposes, the natural evaluation metric, Hmean, is better. In the case where the test set reflects special situations encountered, Accuracy is useful.</Paragraph>
      <Paragraph position="13"> Regardless of whether we encounter any special situation, in most cases on a non-SMT system, the proposed RSCM proved to be significantly better than the existing RSCM. In most cases on an SMT system, the proposed RSCM proved to be as good in performance as the existing RSCM.</Paragraph>
      <Paragraph position="14"> This paper reports a case study in which a mixture of N-best lists from multiple MT systems boosted the performance of the RSCM for MT outputs. The authors believe the proposed RSCM will work well only when each of the element MT systems complements the others, but the authors leave the question of the best combination of complementary MT systems open for future study.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML