File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1005_metho.xml

Size: 19,937 bytes

Last Modified: 2025-10-06 14:10:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1005">
  <Title>Computing Consensus Translation from Multiple Machine Translation Systems Using Enhanced Hypotheses Alignment</Title>
  <Section position="3" start_page="33" end_page="35" type="metho">
    <SectionTitle>
2 Description of the Algorithm
</SectionTitle>
    <Paragraph position="0"> The proposed approach takes advantage of multiple translations for a whole test corpus to compute a consensus translation for each sentence in this corpus. Given a single source sentence in the test corpus, we combine M translation hypotheses E1,...,EM from M MT engines. We first choose one of the hypotheses Em as the primary one. We consider this primary hypothesis to have the &amp;quot;correct&amp;quot; word order. We then align and re-order the other, secondary hypotheses En(n = 1,...,M;n negationslash= m) to match this word order. Since each hypothesis may have an acceptable word order, we let every hypothesis play the role of the primary translation once, and thus align all pairs of hypotheses (En,Em); n negationslash= m.</Paragraph>
    <Paragraph position="1"> In the following subsections, we will explain the word alignment procedure, the reordering approach, and the construction of confusion networks. null</Paragraph>
    <Section position="1" start_page="33" end_page="34" type="sub_section">
      <SectionTitle>
2.1 Statistical Alignment
Thewordalignmentisperformedinanalogytothe
</SectionTitle>
      <Paragraph position="0"> training procedure in SMT. The difference is that the two sentences that have to be aligned are in the same language. We consider the conditional probability Pr(En|Em) of the event that, given Em, another hypothesis En is generated from the Em.</Paragraph>
      <Paragraph position="1"> Then, the alignment between the two hypotheses is introduced as a hidden variable:</Paragraph>
      <Paragraph position="3"> As in statistical machine translation, we make modelling assumptions. We use the IBM Model 1 (Brown et al., 1993) (uniform distribution) and the Hidden Markov Model (HMM, first-order dependency, (Vogel et al., 1996)) to estimate the alignment model. The lexicon probability of a sentence pair is modelled as a product of single-word based probabilities of the aligned words.</Paragraph>
      <Paragraph position="4"> The training corpus for alignment is created from a test corpus of N sentences (usually a few hundred) translated by all of the involved MT engines. However, the effective size of the training corpus is larger than N, since all pairs of different hypotheses have to be aligned. Thus, the effective size of the training corpus is M *(M [?]1)*N. The single-word based lexicon probabilities p(en|em) are initialized with normalized lexicon counts collected over the sentence pairs (En,Em) on this corpus. Since all of the hypotheses are in the same language, we count co-occurring equal words, i.e.</Paragraph>
      <Paragraph position="5"> if en is the same word as em. In addition, we add a fraction of a count for words with identical prefixes. The initialization could be furthermore improved by using word classes, part-of-speech tags, or a list of synonyms.</Paragraph>
      <Paragraph position="6"> The model parameters are trained iteratively in an unsupervised manner with the EM algorithm using the GIZA++ toolkit (Och and Ney, 2003).</Paragraph>
      <Paragraph position="7"> The training is performed in the directions En Em and Em - En. The updated lexicon tables from the two directions are interpolated after each iteration.</Paragraph>
      <Paragraph position="8"> The final alignments are determined using cost matrices defined by the state occupation probabilities of the trained HMM (Matusov et al., 2004). The alignments are used for reordering each secondary translation En and for computing the confusion network.</Paragraph>
      <Paragraph position="9">  Figure1: Exampleofcreatingaconfusionnetworkfrommonotoneone-to-onewordalignments(denoted with symbol |). The words of the primary hypothesis are printed in bold. The symbol $ denotes a null alignment or an e-arc in the corresponding part of the confusion network. 1. would you like coffee or tea original 2. would you have tea or coffee hypotheses 3. would you like your coffee or 4. I have some coffee tea would you like alignment would|would you|you have|like coffee|coffee or|or tea|tea and would|would you|you like|like your|$ coffee|coffee or|or $|tea reordering I|$ would|would you|you like|like have|$ some|$ coffee|coffee $|or tea|tea $ would you like $ $ coffee or tea confusion $ would you have $ $ coffee or tea network $ would you like your $ coffee or $ I would you like have some coffee $ tea</Paragraph>
    </Section>
    <Section position="2" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
2.2 Word Reordering
</SectionTitle>
      <Paragraph position="0"> The alignment between En and the primary hypothesis Em used for reordering is computed as a function of words in the secondary translation En with minimal costs, with an additional constraint that identical words in En can not be all aligned to the same word in Em. This constraint is necessary toavoidthatreorderedhypotheseswithe.g. multipleconsecutivearticles&amp;quot;the&amp;quot;wouldbeproducedif null fewerarticleswereusedintheprimaryhypothesis.</Paragraph>
      <Paragraph position="1"> The new word order for En is obtained through sortingthewordsinEn bytheindicesofthewords in Em to which they are aligned. Two words in En which are aligned to the same word in Em are kept in the original order. After reordering each secondary hypothesis En, we determine M [?] 1 monotone one-to-one alignments betweenEm and En,n = 1,...,M;n negationslash= m. In case of many-to-one connections of words inEn to a single wordin Em, we only keep the connection with the lowest alignment costs. The one-to-one alignments are convenient for constructing a confusion network in the next step of the algorithm.</Paragraph>
    </Section>
    <Section position="3" start_page="34" end_page="34" type="sub_section">
      <SectionTitle>
2.3 Building Confusion Networks
</SectionTitle>
      <Paragraph position="0"> GiventheM[?]1monotoneone-to-onealignments, the transformation to a confusion network as described by (Bangalore et al., 2001) is straightforward. It is explained by the example in Figure 1. Here, the original 4 hypotheses are shown, followedbythealignmentofthereorderedsecondary null hypotheses2-4withtheprimaryhypothesis1. The alignment is shown with the  |symbol, and the words of the primary hypothesis are to the right of this symbol. The symbol $ denotes a null alignment or an e-arc in the corresponding part of the confusion network, which is shown at the bottom of the figure.</Paragraph>
      <Paragraph position="1"> Note that the word &amp;quot;have&amp;quot; in translation 2 is aligned to the word &amp;quot;like&amp;quot; in translation 1. This alignment is acceptable considering the two translations alone. However, given the presence of the word &amp;quot;have&amp;quot; in translation 4, this is not the best alignment. Yet the problems of this type can in part be solved by the proposed approach, since every translation once plays the role of the primary translation. For each sentence, we obtain a total of M confusion networks and unite them in a single lattice. The consensus translation can be chosen among different alignment and reordering paths in this lattice.</Paragraph>
      <Paragraph position="2"> The &amp;quot;voting&amp;quot; on the union of confusion networks is straightforward and analogous to the ROVER system. We sum up the probabilities of the arcs which are labeled with the same word and have the same start and the same end state.</Paragraph>
      <Paragraph position="3"> These probabilities are the global probabilities assigned to the different MT systems. They are manually adjusted based on the performance of the involvedMTsystemsonaheld-outdevelopmentset. null In general, a better consensus translation can be produced if the words hypothesized by a betterperforming system get a higher probability. Additional scores like word confidence measures can be used to score the arcs in the lattice.</Paragraph>
    </Section>
    <Section position="4" start_page="34" end_page="35" type="sub_section">
      <SectionTitle>
2.4 Extracting Consensus Translation
</SectionTitle>
      <Paragraph position="0"> In the final step, the consensus translation is extracted as the best path from the union of confu- null sion networks. Note that the extracted consensus translation can be different from the original M translations. Alternatively, the N-best hypotheses can be extracted for rescoring by additional models. We performed experiments with both approaches. null Since M confusion networks are used, the lattice may contain two best paths with the same probability, the same words, but different word order. We extended the algorithm to favor more well-formed word sequences. We assign a higher probability to each arc of the primary (unreordered) translation in each of the M confusion networks. Experimentally, this extension improved translation fluency on some tasks.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="35" end_page="36" type="metho">
    <SectionTitle>
3 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
3.1 Corpus Statistics
</SectionTitle>
      <Paragraph position="0"> The alignment and voting algorithm was evaluated on both small and large vocabulary tasks. Initial experiments were performed on the IWSLT 2004 Chinese-English and Japanese-English tasks (Akiba et al., 2004). The data for these tasks come from the Basic Travel Expression corpus (BTEC), consisting of tourism-related sentences. We combined the outputs of several MT systems that had officially been submitted to the IWSLT 2004 evaluation. Each system had used 20K sentence pairs (180K running words) from the BTEC corpus for training.</Paragraph>
      <Paragraph position="1"> Experiments with translations of automatically recognized speech were performed on the BTEC Italian-English task (Federico, 2003). Here, the involved MT systems had used about 60K sentence pairs (420K running words) for training.</Paragraph>
      <Paragraph position="2"> Finally, wealsocomputedconsensustranslation from some of the submissions to the TC-STAR 2005 evaluation campaign (TC-STAR, 2005). The TC-STAR participants had submitted translations of manually transcribed speeches from the European Parliament Plenary Sessions (EPPS). In our experiments, we used the translations from Span- null ish to English. The MT engines for this task had been trained on 1.2M sentence pairs (32M running words).</Paragraph>
      <Paragraph position="3"> Table 1 gives an overview of the test corpora, on which the enhanced hypotheses alignment was computed, and for which the consensus translations were determined. The official IWSLT04 test corpus was used for the IWSLT 04 tasks; the CSTAR03 test corpus was used for the speech translation task. The March 2005 test corpus of the TC-STAR evaluation (verbatim condition) was used for the EPPS task. In Table 1, the number of running words in English is the average number of running words in the hypotheses, from which the consensus translation was computed; the vocabulary of English is the merged vocabulary of these hypotheses. For the BTEC IWSLT04 corpus, the statistics for English is given for the experiments described in Sections 3.3 and 3.5, respectively.</Paragraph>
    </Section>
    <Section position="2" start_page="35" end_page="36" type="sub_section">
      <SectionTitle>
3.2 Evaluation Criteria
</SectionTitle>
      <Paragraph position="0"> Well-established objective evaluation measures like the word error rate (WER), position-independent word error rate (PER), and the BLEU score (Papineni et al., 2002) were used to assess the translation quality. All measures were computed with respect to multiple reference translations. The evaluation (as well as the alignment training) was case-insensitive, without considering the punctuation marks.</Paragraph>
    </Section>
    <Section position="3" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
3.3 Chinese-English Translation
</SectionTitle>
      <Paragraph position="0"> Different applications of the proposed combination method have been evaluated. First, we focused on combining different MT systems which have the same source and target language. The initial experiments were performed on the BTEC Chinese-English task. We combined translations produced by 5 different MT systems. Table 2 shows the performance of the best and the worst of these systems in terms of the BLEU score. The results for the consensus translation show a dramatic improvement in translation quality. The word error rate is reduced e.g. from 54.6 to 47.8%. The  researchgroupwhichhadsubmittedthebesttranslation in 2004 translated the same test set a year later with an improved system. We compared the consensus translation with this new translation (last line of Table 2). It can be observed that the consensus translation based on the MT systems developed in 2004 is still superior to this 2005 single system translation in terms of all error measures. null We also checked how many sentences in the consensus translation of the test corpus are different from the 5 original translations. 185 out of 500 sentences (37%) had new translations. Computing the error measures on these sentences only, we observed significant improvements in WER and PER and a small improvement in BLEU with respect to the original translations. Thus, the quality of previously unseen consensus translations as generated from the original translations is acceptable. In this experiment, the global system probabilities for scoring the confusion networks were tuned manually on a development set. The distribution was 0.35,0.25,0.2,0.1,0.1, with 0.35 for the words of the best single system and 0.1 for the words of the worst single system. We observed that the consensus translation did not change significantly with small perturbations of these values. However, the relation between the probabilities is very important for good performance.</Paragraph>
      <Paragraph position="1"> No improvement can be achieved with a uniform probability distribution - it is necessary to penalize translations of low quality.</Paragraph>
    </Section>
    <Section position="4" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
3.4 Spanish-English Translation
</SectionTitle>
      <Paragraph position="0"> The improvements in translation quality are also significant on the TC-STAR EPPS Spanish-English task. Here, we combined four different systems which performed best in the TC-STAR  2005 evaluation, see Table 3. Compared to the best performing single system, the consensus hypothesis reduces the WER from 41.0 to 39.1%. This result is further improved by rescoring the N-best lists derived from the confusion networks (N=1000). For rescoring, a word penalty feature, the IBM Model 1, and a 4-gram target language model were included. The linear interpolation weights of these models and the score from the confusion network were optimized on a separate development set with respect to word error rate.</Paragraph>
      <Paragraph position="1"> Table 4 gives examples of improved translation quality by using the consensus translation as derived from the rescored N-best lists.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="36" end_page="37" type="metho">
    <SectionTitle>
3.5 Multi-source Translation
</SectionTitle>
    <Paragraph position="0"> In the IWSLT 2004 evaluation, the English reference translations for the Chinese-English and Japanese-English test corpora were the same, except for a permutation of the sentences. Thus, we could combine MT systems which have different source and the same target language, performing multi-source machine translation (described e.g.</Paragraph>
    <Paragraph position="1"> by (Och and Ney, 2001)). We combined two Japanese-English and two Chinese-English systems. The best performing system was a Japanese-English system with a BLEU score of 44.7%, see Table 5. By computing the consensus translation, we improved this score to 49.6%, and also significantly reduced the error rates.</Paragraph>
    <Paragraph position="2"> To investigate the potential of the proposed approach, we generated the N-best lists (N = 1000) ofconsensustranslations. Then,foreachsentence, we selected the hypothesis in the N-best list with the lowest word error rate with respect to the multiple reference translations for the sentence. We then evaluated the quality of these &amp;quot;oracle&amp;quot; translations with all error measures. In a contrastive experiment, for each sentence we simply selected  Table4: ExamplesofimprovedtranslationqualitywiththeconsensustranslationsontheSpanish-English TC-STAR EPPS task (case-insensitive output).</Paragraph>
    <Paragraph position="3"> best system I also authorised to committees to certain reports consensus I also authorised to certain committees to draw up reports reference I have also authorised certain committees to prepare reports best system human rights which therefore has fought the european union consensus human rights which the european union has fought reference human rights for which the european union has fought so hard best system we of the following the agenda consensus moving on to the next point on the agenda reference we go on to the next point of the agenda  in translation quality when computing consensus translation using the output of two Chinese-English and two Japanese-English systems on the  the translation with the lowestWER from the original 4 MT system outputs. Table 6 shows that the potential for improvement is significantly larger for the consensus-based combination of translation outputs than for simple selection of the best translation1. In our future work, we plan to improve the scoring of hypotheses in the confusion networks to explore this large potential.</Paragraph>
    <Section position="1" start_page="37" end_page="37" type="sub_section">
      <SectionTitle>
3.6 Speech Translation
</SectionTitle>
      <Paragraph position="0"> Some state-of-the-art speech translation systems can translate either the first best recognition hy1Similar &amp;quot;oracle&amp;quot; results were observed on other tasks. potheses or the word lattices of an ASR system. It has been previously shown that word lattice input generally improves translation quality. In practice, however, the translation system may choose, for some sentences, the paths in the lattice with many recognition errors and thus produce inferior translations. These translations can be improved if we compute a consensus translation from the output ofatleasttwodifferentspeechtranslationsystems.</Paragraph>
      <Paragraph position="1"> From each system, we take the translation of the single best ASR output, and the translation of the ASR word lattice.</Paragraph>
      <Paragraph position="2"> Two different statistical MT systems capable of translating ASR word lattices have been compared by (Matusov and Ney, 2005). Both systems produced translations of better quality on the BTEC Italian-English speech translation task when using lattices instead of single best ASR output. We obtained the output of each of the two systems under each of these translation scenarios on the CSTAR03 test corpus. The first-best recognition worderrorrateonthiscorpusis22.3%. Theobjective error measures for the 4 translation hypotheses are given in Table 7. We then computed a consensus translation of the 4 outputs with the proposed method. The better performing word lattice translations were given higher system probabilities. With the consensus hypothesis, the word error rate went down from 29.5 to 28.5%. Thus, the negative effect of recognition errors on the translation quality was further reduced.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML